- Jan 06, 2023
-
-
David Douard authored
this generates swh.model objects from ORC files. This should allow to rebuild a storage from an ORC dataset of the archive. Note: not all object types are supported for now (eg. ExtID, metadata related objects, etc. are not yet supported).
-
- Apr 07, 2022
-
-
David Douard authored
-
David Douard authored
this feature requires a config parameter `with_data=true` and an objstorage configuration.
-
- Apr 06, 2022
-
-
Antoine Lambert authored
pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for pytest >= 7 so we can now drop the pytest pinning.
-
- Apr 01, 2022
-
-
Antoine Pietri authored
-
- Mar 31, 2022
-
-
David Douard authored
Partially reverting 5a8a8a78. This is needed to ensure better compliance with usual ORC semantics, where one directory = one table.
-
- Mar 30, 2022
-
-
David Douard authored
- use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456') - reduce the description strings and align them. The result looks like: Exporting release: - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s] - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
-
David Douard authored
to prevent some possible race condition leading to kafka errors like: rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed This could occur when, at the time of unsubscribing from a partition, another partition is also depleted. Since the unsubscription consist in resubscribing to all partitions except unsubscribes ones, we could try to subscribe to such a depleted parition, leading to the error message listed above.
-
David Douard authored
eg. orc exporter specific exporter config entries are now under the 'orc' section, like: journal: brokers: [...] orc: remove_pull_requests: true max_rows: revision: 100000 directory: 10000
-
David Douard authored
Make it possible to specify a maximum number of rows a table can store in a single ORC file. The limit can only be set on main tables for now (i.e. cannot be specified for tables like revision_history or directory_entry). This can be set by configuration only (no extra cli options).
-
- Mar 29, 2022
-
-
David Douard authored
-
David Douard authored
-
David Douard authored
-
David Douard authored
related ORC files being ORC files involved in the serialization of a given object type, namely: - snapshot and snapshot_branches, - revision and revision_history, - directory and directory_entry. Also include the object_type in the generated file name (in place of the static 'graph'). So the result will typically be like: output/orc/shapshot/ snaphot-18a575cb-3a92-4753-9267-e3475fa30857.orc snaphot_branch-18a575cb-3a92-4753-9267-e3475fa30857.orc snapshot-1f41d206-994a-49bb-917f-e096e40c2856.orc snapshot_branch-1f41d206-994a-49bb-917f-e096e40c2856.orc
-
David Douard authored
add: - object type - uuid - version of swh.model used at file generation time, - version of swh.dataset used at file generation time.
-
David Douard authored
and split it in 2 parts (needed for changes to come).
-
David Douard authored
rather than hardcoding it to 'swh-dataset-export-', use the 'group_id' value from the 'journal' section of the config file as prefix, if given 9otherwise default to the former value). This is needed because current auth policy of swh kafka cluster only allow group_id to start with the actual login for authenticated connection. So we need to be able to specify this group_id prefix.
-
David Douard authored
and add a few more debug logging statements.
-
David Douard authored
deserialize_message() now takes an optional 'object_type' argument.
-
Antoine Pietri authored
Some use cases, such as building reproducible test datasets, require exporting data in a deterministic location. This adds a config option to exporters to make them always write in the same file. It also refactors the shared logic to generate file UUIDs.
-
David Douard authored
ie. use the standard ORC Timestamp format (aka a couple (seconds, nanoseconds)) with 2 extra fields for the offset. The offset is stored as an integer (in minutes), but the raw offset value is also present as a binary string representation, following recent evolutions of swh-model. This makes swh-dataset compatible with swh-model 5.
-
- Feb 10, 2022
-
-
Antoine Lambert authored
To install the new hook: $ pre-commit install -t commit-msg
-
- Feb 07, 2022
-
-
Antoine R. Dumont authored
Related to T3916
-
- Jan 25, 2022
-
-
Antoine Pietri authored
-
Antoine Pietri authored
-
- Jan 20, 2022
-
-
vlorentz authored
-
- Dec 16, 2021
-
-
Antoine R. Dumont authored
This also drops spurious copyright headers to those files if present. Related to T3812
-
- Oct 11, 2021
-
-
vlorentz authored
-
- Sep 13, 2021
-
-
David Douard authored
this is necessary to ensure these messages are committed in kafka, otherwise, since the (considered) empty partition is unsubscribed from, it never gets committed in `JournalClient.handle_messages()` (since this later only commit assigned partitions). Ensure offset are committed only after worker_fn is executed without error. This requires to overload the `JournalClient.handle_messages()` method in `JournalClientOffsetRanges` to make sure "pending" messages are committed after the proper execution of `worker_fn`. Doing so, we can both unsubscribe from "eof" partitions on the fly (with "eof" meaning when the partition has been consumed up to the high watermark offset at the beginning of the export), and commit ALL offsets that need to be, but only after proper execution of the `worker_fn` callback. This should guarantee proper and consistent behavior (famous last word...).
-
David Douard authored
to make the code a bit clearer.
-
David Douard authored
-
David Douard authored
so we get a chance to actually have a visible progress bar: - reduce the label size (shorter desc), - use a single 'workers' postfix (like "workers=n/m").
-
- Sep 10, 2021
-
-
David Douard authored
- ensure the last offset is sent to the queue, - fix the computation of the progress value (off-by-one).
-
David Douard authored
used to retrieve partitions and lo/hi offets. It could cause some dead-lock/long timeout kind of situation sometime (especially in the developper docker environment).
-
David Douard authored
The computation of lo and high offsets used to be done in 2 steps: - first get the watermak offsets (thus the absolute min and max offsets of the whole partition) - then, as a "hook" in `process()`, retrieve the last committed offset for the partition and "push" these current offsets in the progress queue. Instead, this simplifies a bit this process by quering the committed offsets while computing the hi/low offsets.
-
- Sep 09, 2021
-
-
David Douard authored
-
Nicolas Dandrimont authored
-
- Aug 06, 2021
-
-
vlorentz authored
Before this commit, between 30 and 40% of the run time was spent in this function (especially ExtendedSWHID.__init__). Now, it is under 10%.
-
- Jul 28, 2021
-
-
vlorentz authored
This caused `JournalClientOffsetRanges` to ignore the last batch of messages in each assignment, because `JournalClient.handle_messages` deserializes all messages in the batch before calling the worker function; and raising `EOFError` from `deserialize_message` makes it exit early (before calling the worker fn). Additionally, it doesn't make much sense for a `deserialize_message` function to raise this kind of exception. Instead, this commit removes the explicit `raise EOFError`, and tells `JournalClient` to stop on EOF. `deserialize_message` calls `handle_offset`, which updates the assignment of the Kafka consumer to be the empty set, which causes it to be EOF (since there are no more partitions to read from).
-
- Jul 27, 2021
-
-
vlorentz authored
When running outside a TTY (eg. in Docker), progress bar updates are each on a new line, so it's harder to find what the current step is if it is not shown on the same line.
-