Skip to content
Snippets Groups Projects
  1. Jan 06, 2023
    • David Douard's avatar
      Add a ORC file loading function · ce372d96
      David Douard authored
      this generates swh.model objects from ORC files.
      This should allow to rebuild a storage from an ORC dataset of the
      archive.
      
      Note: not all object types are supported for now (eg. ExtID, metadata
            related objects, etc. are not yet supported).
      ce372d96
  2. Apr 07, 2022
  3. Apr 06, 2022
  4. Apr 01, 2022
  5. Mar 31, 2022
  6. Mar 30, 2022
    • David Douard's avatar
      Improve the progress reporting look and feel · 58b9f92e
      David Douard authored
      - use shortened values in the progress bar (eg. '11.3M/566M' instead of
        something like '11310201/566123456')
      - reduce the description strings and align them.
      
      The result looks like:
      
      Exporting release:
        - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
        - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
      58b9f92e
    • David Douard's avatar
      Delay the unsubscribe to the end of handle_messages · be3c5da2
      David Douard authored
      to prevent some possible race condition leading to kafka errors like:
      
        rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed
      
      This could occur when, at the time of unsubscribing from a partition,
      another partition is also depleted. Since the unsubscription consist in
      resubscribing to all partitions except unsubscribes ones, we could try
      to subscribe to such a depleted parition, leading to the error message
      listed above.
      be3c5da2
    • David Douard's avatar
      Move exporter config entries in dedicated sections · e01daba4
      David Douard authored
      eg. orc exporter specific exporter config entries are now under the
      'orc' section, like:
      
        journal:
          brokers: [...]
      
        orc:
          remove_pull_requests: true
          max_rows:
            revision: 100000
            directory: 10000
      e01daba4
    • David Douard's avatar
      Add support for limited row numbers in ORC files · 3df08fd7
      David Douard authored
      Make it possible to specify a maximum number of rows a table can store
      in a single ORC file. The limit can only be set on main tables for now
      (i.e. cannot be specified for tables like revision_history or
      directory_entry).
      
      This can be set by configuration only (no extra cli options).
      3df08fd7
  7. Mar 29, 2022
  8. Feb 10, 2022
  9. Feb 07, 2022
  10. Jan 25, 2022
  11. Jan 20, 2022
  12. Dec 16, 2021
  13. Oct 11, 2021
  14. Sep 13, 2021
    • David Douard's avatar
      Commit kafka messages which offset has reach the high limit · 94be817f
      David Douard authored
      this is necessary to ensure these messages are committed in kafka,
      otherwise, since the (considered) empty partition is unsubscribed from,
      it never gets committed in `JournalClient.handle_messages()` (since this
      later only commit assigned partitions).
      
      Ensure offset are committed only after worker_fn is executed without
      error.
      
      This requires to overload the `JournalClient.handle_messages()` method in
      `JournalClientOffsetRanges` to make sure "pending" messages are
      committed after the proper execution of `worker_fn`.
      
      Doing so, we can both unsubscribe from "eof" partitions on the fly (with
      "eof" meaning when the partition has been consumed up to the high
      watermark offset at the beginning of the export), and commit ALL offsets
      that need to be, but only after proper execution of the `worker_fn`
      callback.
      
      This should guarantee proper and consistent behavior (famous last
      word...).
      94be817f
    • David Douard's avatar
      Add a JournalClientOffsetRanges.unsubscribe() method · a3c1f390
      David Douard authored
      to make the code a bit clearer.
      a3c1f390
    • David Douard's avatar
      Fix a missing f-string prefix · 0425bdea
      David Douard authored
      0425bdea
    • David Douard's avatar
      Reduce the size of the progress bar · 358d8493
      David Douard authored
      so we get a chance to actually have a visible progress bar:
      
      - reduce the label size (shorter desc),
      - use a single 'workers' postfix (like "workers=n/m").
      358d8493
  15. Sep 10, 2021
    • David Douard's avatar
      Make sure the progress bar for the export reaches 100% · 47713ee3
      David Douard authored
      - ensure the last offset is sent to the queue,
      - fix the computation of the progress value (off-by-one).
      47713ee3
    • David Douard's avatar
      Explicitly close the temporary kafka consumer in `get_offsets` · d07b2a63
      David Douard authored
      used to retrieve partitions and lo/hi offets.
      
      It could cause some dead-lock/long timeout kind of situation sometime
      (especially in the developper docker environment).
      d07b2a63
    • David Douard's avatar
      Simplify the lo/high partition offset computation · 2760e322
      David Douard authored
      The computation of lo and high offsets used to be done in 2 steps:
      - first get the watermak offsets (thus the absolute min and max offsets
        of the whole partition)
      - then, as a "hook" in `process()`, retrieve the last committed offset
        for the partition and "push" these current offsets in the progress
        queue.
      
      Instead, this simplifies a bit this process by quering the committed
      offsets while computing the hi/low offsets.
      2760e322
  16. Sep 09, 2021
  17. Aug 06, 2021
  18. Jul 28, 2021
    • vlorentz's avatar
      journalprocessor: Fix deserialize_message raising EOFError on the last message of each assignment · 202bcdf2
      vlorentz authored
      This caused `JournalClientOffsetRanges` to ignore the last batch of messages
      in each assignment, because `JournalClient.handle_messages` deserializes
      all messages in the batch before calling the worker function;
      and raising `EOFError` from `deserialize_message` makes it exit early
      (before calling the worker fn).
      
      Additionally, it doesn't make much sense for a `deserialize_message`
      function to raise this kind of exception.
      
      Instead, this commit removes the explicit `raise EOFError`, and tells
      `JournalClient` to stop on EOF. `deserialize_message` calls
      `handle_offset`, which updates the assignment of the Kafka consumer to
      be the empty set, which causes it to be EOF (since there are no more
      partitions to read from).
      202bcdf2
  19. Jul 27, 2021
Loading