- Apr 16, 2021
-
-
Antoine Pietri authored
-
- Apr 15, 2021
-
-
Antoine Pietri authored
-
- Apr 14, 2021
-
-
Antoine Pietri authored
-
Antoine Pietri authored
Before that fix, we could spam the same partition reassignments because we would keep calling this function for each message we receive even after a partition has been unassigned.
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
- Apr 13, 2021
-
-
vlorentz authored
-
- Apr 08, 2021
-
-
vlorentz authored
-
- Mar 29, 2021
-
-
vlorentz authored
-
- Mar 26, 2021
-
-
Antoine Pietri authored
Summary: SQLite seems to slow down over time when a lot of data is inserted in it. This is an attempt at using LevelDB to have more efficient node sets. Differential Revision: https://forge.softwareheritage.org/D5315
-
- Mar 23, 2021
-
-
Antoine Pietri authored
-
- Mar 15, 2021
-
-
David Douard authored
swhid() has been removed from swh.model.identifiers since swh-model v1.0.1.
-
- Feb 12, 2021
-
-
Antoine Pietri authored
-
Antoine Pietri authored
Previously, each exporter had to be given a custom __init__ argument containing the export path. This behavior is now shared in the base Exporter class. This required a change in the way GraphEdgesExporter behaved, since it is no longer instanciated with an export path that already contains the obj_type. Its behavior is now closer to ORCExporter, where a new writer is created on the fly in the appropriate directory for each object type.
-
Antoine Pietri authored
These tests do a roundtrip between relational and ORC formats, using the test data from swh-model. It allows us to check that the data is properly readable and not corrupted.
-
Antoine Pietri authored
This new exporter allows us to export the SWH graph dataset as a set of relational tables in a static columnar format called ORC. This can then be uploaded on data processing engines like Amazon Athena, BigQuery or Azure Databricks. This replaces the old scripts that were extracting the data directly from the PostgreSQL database, to be integrated to the journal instead. A notable change is that we now use the ORC format instead of Parquet, as it supports streamed writes, which simplifies the data buffering and allows for larger dataset files.
-
- Dec 17, 2020
-
-
Antoine Pietri authored
-
- Dec 16, 2020
-
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
- Dec 15, 2020
-
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
- Dec 11, 2020
-
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
Summary: This rewrite enables multiple things: 1. Each export can now have multiple exporters, so we can read the journal a single time, then export the objects we read in different formats without having to re-read them every time. 2. We use a shared on-disk set for the nodes, to avoid storing them unnecessarily in each exporter 3. The SQLite files are sharded depending on the partition ID of the incoming messages. This reduces performance issues we had when using a single large set per process. It's also now easier to rewrite the on-disk set logic to use a different set backend, or to change the sharding. 4. The new abstractions make it a lot nicer to write exporters. You just need to override the methods corresponding to each object type, and you can do your setup and teardown in the __enter__ and __exit__ methods of your exporter, which is used as a context manager. Exporters also don't have to worry about duplicates, since this is already done in the journal processor itself. Differential Revision: https://forge.softwareheritage.org/D4718
-
Antoine Pietri authored
We want to have the labels in the edge dataset to differentiate between the file names and branch names that have the same src/dst. This changes the format of the edge files to be: <SRC> <DST> [LABEL] [PERMISSION] Where LABEL is an optional base64-encoded label of the file or branch name, and PERMISSION an integer in base 10 corresponding to the permission of the file.
-
- Dec 08, 2020
-
-
Antoine Pietri authored
Differential Revision: https://forge.softwareheritage.org/D4691
-
- Sep 25, 2020
-
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
-
- Sep 23, 2020
-
-
David Douard authored
-
- Sep 18, 2020
-
-
Thibault Allançon authored
Use the new SWHID naming convention instead of SWH PID.
-
- Sep 17, 2020
-
-
Antoine Lambert authored
Related to T2610
-