Skip to content
Snippets Groups Projects
  1. Sep 09, 2021
  2. Aug 06, 2021
  3. Jul 28, 2021
    • vlorentz's avatar
      journalprocessor: Fix deserialize_message raising EOFError on the last message of each assignment · 202bcdf2
      vlorentz authored
      This caused `JournalClientOffsetRanges` to ignore the last batch of messages
      in each assignment, because `JournalClient.handle_messages` deserializes
      all messages in the batch before calling the worker function;
      and raising `EOFError` from `deserialize_message` makes it exit early
      (before calling the worker fn).
      
      Additionally, it doesn't make much sense for a `deserialize_message`
      function to raise this kind of exception.
      
      Instead, this commit removes the explicit `raise EOFError`, and tells
      `JournalClient` to stop on EOF. `deserialize_message` calls
      `handle_offset`, which updates the assignment of the Kafka consumer to
      be the empty set, which causes it to be EOF (since there are no more
      partitions to read from).
      202bcdf2
  4. Jul 27, 2021
  5. Jun 09, 2021
  6. Apr 19, 2021
  7. Apr 17, 2021
  8. Apr 16, 2021
  9. Apr 15, 2021
  10. Apr 14, 2021
  11. Apr 13, 2021
  12. Apr 08, 2021
  13. Mar 29, 2021
  14. Mar 26, 2021
  15. Mar 23, 2021
  16. Mar 15, 2021
  17. Feb 12, 2021
    • Antoine Pietri's avatar
    • Antoine Pietri's avatar
      Refactor export paths in the base Exporter class · bf8d2625
      Antoine Pietri authored
      Previously, each exporter had to be given a custom __init__ argument
      containing the export path. This behavior is now shared in the base
      Exporter class.
      
      This required a change in the way GraphEdgesExporter behaved, since it
      is no longer instanciated with an export path that already contains the
      obj_type. Its behavior is now closer to ORCExporter, where a new
      writer is created on the fly in the appropriate directory for each
      object type.
      bf8d2625
    • Antoine Pietri's avatar
      ORC exporter: Add unit tests · 35253c89
      Antoine Pietri authored
      These tests do a roundtrip between relational and ORC formats, using the
      test data from swh-model. It allows us to check that the data is
      properly readable and not corrupted.
      35253c89
    • Antoine Pietri's avatar
      Add ORC exporter · cf125983
      Antoine Pietri authored
      This new exporter allows us to export the SWH graph dataset as a set of
      relational tables in a static columnar format called ORC. This can then
      be uploaded on data processing engines like Amazon Athena, BigQuery or
      Azure Databricks.
      
      This replaces the old scripts that were extracting the data directly
      from the PostgreSQL database, to be integrated to the journal instead.
      A notable change is that we now use the ORC format instead of Parquet,
      as it supports streamed writes, which simplifies the data buffering and
      allows for larger dataset files.
      cf125983
  18. Dec 17, 2020
  19. Dec 16, 2020
Loading