- Sep 09, 2021
-
-
Nicolas Dandrimont authored
-
- Aug 06, 2021
-
-
vlorentz authored
Before this commit, between 30 and 40% of the run time was spent in this function (especially ExtendedSWHID.__init__). Now, it is under 10%.
-
- Jul 28, 2021
-
-
vlorentz authored
This caused `JournalClientOffsetRanges` to ignore the last batch of messages in each assignment, because `JournalClient.handle_messages` deserializes all messages in the batch before calling the worker function; and raising `EOFError` from `deserialize_message` makes it exit early (before calling the worker fn). Additionally, it doesn't make much sense for a `deserialize_message` function to raise this kind of exception. Instead, this commit removes the explicit `raise EOFError`, and tells `JournalClient` to stop on EOF. `deserialize_message` calls `handle_offset`, which updates the assignment of the Kafka consumer to be the empty set, which causes it to be EOF (since there are no more partitions to read from).
-
- Jul 27, 2021
-
-
vlorentz authored
When running outside a TTY (eg. in Docker), progress bar updates are each on a new line, so it's harder to find what the current step is if it is not shown on the same line.
-
vlorentz authored
The state is only updated after consuming messages; so if there is no message at all, JournalClientOffsetRanges never notices there is no message to wait for.
-
vlorentz authored
'export_id' was used both for the UUID itself, and the group_id (built from the UUID)
-
- Jun 09, 2021
-
-
Antoine Lambert authored
-
Antoine Lambert authored
-
Antoine Lambert authored
-
- Apr 19, 2021
-
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Lambert authored
Related to T2265
-
- Apr 17, 2021
-
-
Stefano Zacchiroli authored
-
- Apr 16, 2021
-
-
Antoine Pietri authored
-
- Apr 15, 2021
-
-
Antoine Pietri authored
-
- Apr 14, 2021
-
-
Antoine Pietri authored
-
Antoine Pietri authored
Before that fix, we could spam the same partition reassignments because we would keep calling this function for each message we receive even after a partition has been unassigned.
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
- Apr 13, 2021
-
-
vlorentz authored
-
- Apr 08, 2021
-
-
vlorentz authored
-
- Mar 29, 2021
-
-
vlorentz authored
-
- Mar 26, 2021
-
-
Antoine Pietri authored
Summary: SQLite seems to slow down over time when a lot of data is inserted in it. This is an attempt at using LevelDB to have more efficient node sets. Differential Revision: https://forge.softwareheritage.org/D5315
-
- Mar 23, 2021
-
-
Antoine Pietri authored
-
- Mar 15, 2021
-
-
David Douard authored
swhid() has been removed from swh.model.identifiers since swh-model v1.0.1.
-
- Feb 12, 2021
-
-
Antoine Pietri authored
-
Antoine Pietri authored
Previously, each exporter had to be given a custom __init__ argument containing the export path. This behavior is now shared in the base Exporter class. This required a change in the way GraphEdgesExporter behaved, since it is no longer instanciated with an export path that already contains the obj_type. Its behavior is now closer to ORCExporter, where a new writer is created on the fly in the appropriate directory for each object type.
-
Antoine Pietri authored
These tests do a roundtrip between relational and ORC formats, using the test data from swh-model. It allows us to check that the data is properly readable and not corrupted.
-
Antoine Pietri authored
This new exporter allows us to export the SWH graph dataset as a set of relational tables in a static columnar format called ORC. This can then be uploaded on data processing engines like Amazon Athena, BigQuery or Azure Databricks. This replaces the old scripts that were extracting the data directly from the PostgreSQL database, to be integrated to the journal instead. A notable change is that we now use the ORC format instead of Parquet, as it supports streamed writes, which simplifies the data buffering and allows for larger dataset files.
-
- Dec 17, 2020
-
-
Antoine Pietri authored
-
- Dec 16, 2020
-
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-