Skip to content
Snippets Groups Projects
  1. Apr 16, 2021
  2. Apr 15, 2021
  3. Apr 14, 2021
  4. Apr 13, 2021
  5. Apr 08, 2021
  6. Mar 29, 2021
  7. Mar 26, 2021
  8. Mar 23, 2021
  9. Mar 15, 2021
  10. Feb 12, 2021
    • Antoine Pietri's avatar
    • Antoine Pietri's avatar
      Refactor export paths in the base Exporter class · bf8d2625
      Antoine Pietri authored
      Previously, each exporter had to be given a custom __init__ argument
      containing the export path. This behavior is now shared in the base
      Exporter class.
      
      This required a change in the way GraphEdgesExporter behaved, since it
      is no longer instanciated with an export path that already contains the
      obj_type. Its behavior is now closer to ORCExporter, where a new
      writer is created on the fly in the appropriate directory for each
      object type.
      bf8d2625
    • Antoine Pietri's avatar
      ORC exporter: Add unit tests · 35253c89
      Antoine Pietri authored
      These tests do a roundtrip between relational and ORC formats, using the
      test data from swh-model. It allows us to check that the data is
      properly readable and not corrupted.
      35253c89
    • Antoine Pietri's avatar
      Add ORC exporter · cf125983
      Antoine Pietri authored
      This new exporter allows us to export the SWH graph dataset as a set of
      relational tables in a static columnar format called ORC. This can then
      be uploaded on data processing engines like Amazon Athena, BigQuery or
      Azure Databricks.
      
      This replaces the old scripts that were extracting the data directly
      from the PostgreSQL database, to be integrated to the journal instead.
      A notable change is that we now use the ORC format instead of Parquet,
      as it supports streamed writes, which simplifies the data buffering and
      allows for larger dataset files.
      cf125983
  11. Dec 17, 2020
  12. Dec 16, 2020
  13. Dec 15, 2020
  14. Dec 11, 2020
    • Antoine Pietri's avatar
      1a33cba7
    • Antoine Pietri's avatar
      Exporter documentation fixes · 35d9ef00
      Antoine Pietri authored
      35d9ef00
    • Antoine Pietri's avatar
      Rewrite of the export pipeline using Exporters · f7cd059b
      Antoine Pietri authored
      Summary:
      This rewrite enables multiple things:
      
      1. Each export can now have multiple exporters, so we can read the
         journal a single time, then export the objects we read in different
         formats without having to re-read them every time.
      
      2. We use a shared on-disk set for the nodes, to avoid storing them
         unnecessarily in each exporter
      
      3. The SQLite files are sharded depending on the partition ID of the
         incoming messages. This reduces performance issues we had when using
         a single large set per process. It's also now easier to rewrite the
         on-disk set logic to use a different set backend, or to change the
         sharding.
      
      4. The new abstractions make it a lot nicer to write exporters. You just
         need to override the methods corresponding to each object type, and
         you can do your setup and teardown in the __enter__ and __exit__
         methods of your exporter, which is used as a context manager.
         Exporters also don't have to worry about duplicates, since this is
         already done in the journal processor itself.
      
      Differential Revision: https://forge.softwareheritage.org/D4718
      f7cd059b
    • Antoine Pietri's avatar
      Graph export: add labels to the export CSV format · f1952316
      Antoine Pietri authored
      We want to have the labels in the edge dataset to differentiate between
      the file names and branch names that have the same src/dst.
      
      This changes the format of the edge files to be:
      
      <SRC> <DST> [LABEL] [PERMISSION]
      
      Where LABEL is an optional base64-encoded label of the file or branch
      name, and PERMISSION an integer in base 10 corresponding to the
      permission of the file.
      f1952316
  15. Dec 08, 2020
  16. Sep 25, 2020
  17. Sep 23, 2020
  18. Sep 18, 2020
  19. Sep 17, 2020
Loading