Commits · ce372d963da55026371589baa2cc51dbdeeed151 · Antoine Lambert / swh-dataset

Jan 06, 2023

Add a ORC file loading function · ce372d96

David Douard authored 3 years ago

this generates swh.model objects from ORC files.
This should allow to rebuild a storage from an ORC dataset of the
archive.

Note: not all object types are supported for now (eg. ExtID, metadata
      related objects, etc. are not yet supported).

ce372d96

Apr 07, 2022
- Reduce cli's loading time by moving import statements in commands · 18325cc8
  David Douard authored 2 years ago
  
  18325cc8
- Add support for blob in content export · 9d97f0c0
  David Douard authored 3 years ago
  
  this feature requires a config parameter `with_data=true` and an objstorage configuration.
  9d97f0c0
Apr 06, 2022

requirements-test: Remove pytest pinning to < 7 · f282e88a

Antoine Lambert authored 2 years ago

pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for
pytest >= 7 so we can now drop the pytest pinning.

f282e88a

Apr 01, 2022
- Docs: update dataset list with recent datasets · a1dd9189
  Antoine Pietri authored 3 years ago
  
  a1dd9189
Mar 31, 2022

Keep each ORC table in a dedicated directory · 76eba659

David Douard authored 3 years ago

Partially reverting 5a8a8a78.

This is needed to ensure better compliance with usual ORC semantics,
where one directory = one table.

76eba659

Mar 30, 2022

Improve the progress reporting look and feel · 58b9f92e

David Douard authored 3 years ago

- use shortened values in the progress bar (eg. '11.3M/566M' instead of
  something like '11310201/566123456')
- reduce the description strings and align them.

The result looks like:

Exporting release:
  - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
  - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

58b9f92e

Delay the unsubscribe to the end of handle_messages · be3c5da2

David Douard authored 3 years ago

to prevent some possible race condition leading to kafka errors like:

rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed

This could occur when, at the time of unsubscribing from a partition,
another partition is also depleted. Since the unsubscription consist in
resubscribing to all partitions except unsubscribes ones, we could try
to subscribe to such a depleted parition, leading to the error message
listed above.

be3c5da2

Move exporter config entries in dedicated sections · e01daba4

David Douard authored 3 years ago

eg. orc exporter specific exporter config entries are now under the
'orc' section, like:

  journal:
    brokers: [...]

  orc:
    remove_pull_requests: true
    max_rows:
      revision: 100000
      directory: 10000

e01daba4

Add support for limited row numbers in ORC files · 3df08fd7

David Douard authored 3 years ago

Make it possible to specify a maximum number of rows a table can store
in a single ORC file. The limit can only be set on main tables for now
(i.e. cannot be specified for tables like revision_history or
directory_entry).

This can be set by configuration only (no extra cli options).

3df08fd7

Mar 29, 2022

Add the raw_manifest column for revision, release and directory ORC files · fd3f9aa6
David Douard authored 3 years ago

fd3f9aa6
Export revision extra headers in a dedicated ORC file · 5c652bb0
David Douard authored 3 years ago

5c652bb0
Add the type fields for revision and origin_visit_status ORC table · 45c8124b
David Douard authored 3 years ago

45c8124b

David Douard authored 3 years ago

related ORC files being ORC files involved in the serialization of a given
object type, namely:

- snapshot and snapshot_branches,
- revision and revision_history,
- directory and directory_entry.

Also include the object_type in the generated file name (in place of the
static 'graph').

So the result will typically be like:

  output/orc/shapshot/
    snaphot-18a575cb-3a92-4753-9267-e3475fa30857.orc
    snaphot_branch-18a575cb-3a92-4753-9267-e3475fa30857.orc
    snapshot-1f41d206-994a-49bb-917f-e096e40c2856.orc
    snapshot_branch-1f41d206-994a-49bb-917f-e096e40c2856.orc

5a8a8a78

Add some user metadata in generated ORC files · 729ae64f

David Douard authored 3 years ago

add:
- object type
- uuid
- version of swh.model used at file generation time,
- version of swh.dataset used at file generation time.

729ae64f

Implement test_orc exporter as a simple function instead of a fixture · 2298fb34
David Douard authored 3 years ago
```
and split it in 2 parts (needed for changes to come).
```
2298fb34

Make the kafka group_id prefix configurable in the config file · 68899901

David Douard authored 3 years ago

rather than hardcoding it to 'swh-dataset-export-', use the 'group_id'
value from the 'journal' section of the config file as prefix, if given
9otherwise default to the former value).

This is needed because current auth policy of swh kafka cluster only allow
group_id to start with the actual login for authenticated connection.
So we need to be able to specify this group_id prefix.

68899901

Use a named logger for journalprocessor.py · 769b6a77
David Douard authored 3 years ago
```
and add a few more debug logging statements.
```
769b6a77
Update JournalClientOffsetRanges for swh.journal 0.9 · d7c332e4
David Douard authored 3 years ago
```
deserialize_message() now takes an optional 'object_type' argument.
```
d7c332e4

Exporters: add option to write in a deterministic location · 31081e41

Antoine Pietri authored 3 years ago

Some use cases, such as building reproducible test datasets, require
exporting data in a deterministic location. This adds a config option to
exporters to make them always write in the same file.

It also refactors the shared logic to generate file UUIDs.

31081e41

Encode TimestampWithTimezone as (timestamp, offset, raw_offset_bytes) in ORC file · f588e20a

David Douard authored 3 years ago

ie. use the standard ORC Timestamp format (aka a couple
(seconds, nanoseconds)) with 2 extra fields for the offset.

The offset is stored as an integer (in minutes), but the raw offset
value is also present as a binary string representation, following
recent evolutions of swh-model.

This makes swh-dataset compatible with swh-model 5.

f588e20a

Feb 10, 2022
- pre-commit: Bump hooks and add new one to check commit message spelling · 68f9bd20
  Antoine Lambert authored 3 years ago
  
  To install the new hook: $ pre-commit install -t commit-msg
  68f9bd20
Feb 07, 2022
- requirements-test: Pin pytest to < 7.0.0 · 5df2fff7
  Antoine R. Dumont authored 3 years ago
  
  Related to T3916
  Verified
  
  5df2fff7
Jan 25, 2022
- ORC exporter: use ZST compression · 18c612bb
  Antoine Pietri authored 3 years ago
  
  18c612bb
- Add a command to generate a subdataset from a list of SWHIDs using S3 · 027235d6
  Antoine Pietri authored 3 years ago
  
  027235d6
Jan 20, 2022
- journalprocessor: Reuse the Kafka key instead of computing a new one · ab2ebfad
  vlorentz authored 3 years ago
  
  ab2ebfad
Dec 16, 2021
- Pin mypy and drop type annotations which makes mypy unhappy · 8d9fcb8e
  Antoine R. Dumont authored 3 years ago
  
  This also drops spurious copyright headers to those files if present. Related to T3812
  Verified
  
  8d9fcb8e
Oct 11, 2021
- Rename remaining references to swh.model.identifiers. · 7b883f16
  vlorentz authored 3 years ago
  
  7b883f16
Sep 13, 2021

Commit kafka messages which offset has reach the high limit · 94be817f

David Douard authored 3 years ago

this is necessary to ensure these messages are committed in kafka,
otherwise, since the (considered) empty partition is unsubscribed from,
it never gets committed in `JournalClient.handle_messages()` (since this
later only commit assigned partitions).

Ensure offset are committed only after worker_fn is executed without
error.

This requires to overload the `JournalClient.handle_messages()` method in
`JournalClientOffsetRanges` to make sure "pending" messages are
committed after the proper execution of `worker_fn`.

Doing so, we can both unsubscribe from "eof" partitions on the fly (with
"eof" meaning when the partition has been consumed up to the high
watermark offset at the beginning of the export), and commit ALL offsets
that need to be, but only after proper execution of the `worker_fn`
callback.

This should guarantee proper and consistent behavior (famous last
word...).

94be817f

Add a JournalClientOffsetRanges.unsubscribe() method · a3c1f390
David Douard authored 3 years ago
```
to make the code a bit clearer.
```
a3c1f390
Fix a missing f-string prefix · 0425bdea
David Douard authored 3 years ago

0425bdea

Reduce the size of the progress bar · 358d8493

David Douard authored 3 years ago

so we get a chance to actually have a visible progress bar:

- reduce the label size (shorter desc),
- use a single 'workers' postfix (like "workers=n/m").

358d8493

Sep 10, 2021

Make sure the progress bar for the export reaches 100% · 47713ee3

David Douard authored 3 years ago

- ensure the last offset is sent to the queue,
- fix the computation of the progress value (off-by-one).

47713ee3

Explicitly close the temporary kafka consumer in `get_offsets` · d07b2a63

David Douard authored 3 years ago

used to retrieve partitions and lo/hi offets.

It could cause some dead-lock/long timeout kind of situation sometime
(especially in the developper docker environment).

d07b2a63

Simplify the lo/high partition offset computation · 2760e322

David Douard authored 3 years ago

The computation of lo and high offsets used to be done in 2 steps:
- first get the watermak offsets (thus the absolute min and max offsets
  of the whole partition)
- then, as a "hook" in `process()`, retrieve the last committed offset
  for the partition and "push" these current offsets in the progress
  queue.

Instead, this simplifies a bit this process by quering the committed
offsets while computing the hi/low offsets.

2760e322

Sep 09, 2021
- Use proper signature for JournalClientOffsetRanges.process() · e47a3db1
  David Douard authored 3 years ago
  
  e47a3db1
- Update common configs from swh-py-template · 002ee70b
  Nicolas Dandrimont authored 3 years ago
  
  v0.3.0
  
  002ee70b
Aug 06, 2021

exporters/edges: Make swhid() format directly instead of instantiating ExtendedSWHID · 671e8861

vlorentz authored 3 years ago

Before this commit, between 30 and 40% of the run time was spent in this
function (especially ExtendedSWHID.__init__).

Now, it is under 10%.

671e8861

Jul 28, 2021

journalprocessor: Fix deserialize_message raising EOFError on the last message of each assignment · 202bcdf2

vlorentz authored 3 years ago

This caused `JournalClientOffsetRanges` to ignore the last batch of messages
in each assignment, because `JournalClient.handle_messages` deserializes
all messages in the batch before calling the worker function;
and raising `EOFError` from `deserialize_message` makes it exit early
(before calling the worker fn).

Additionally, it doesn't make much sense for a `deserialize_message`
function to raise this kind of exception.

Instead, this commit removes the explicit `raise EOFError`, and tells
`JournalClient` to stop on EOF. `deserialize_message` calls
`handle_offset`, which updates the assignment of the Kafka consumer to
be the empty set, which causes it to be EOF (since there are no more
partitions to read from).

202bcdf2

Jul 27, 2021

journalprocessor: Show which step is running on progress bar lines · ec636d49

vlorentz authored 3 years ago

When running outside a TTY (eg. in Docker), progress bar updates are
each on a new line, so it's harder to find what the current step is
if it is not shown on the same line.

ec636d49