- Nov 10, 2022
-
-
vlorentz authored
They are more tuned toward running automatically, as they call each other as needed, and can be imported by workflows defined in other modules (eg. the future swh.graph.luigi module).
-
vlorentz authored
So it can be reused by a Luigi task
-
vlorentz authored
For some reason, using a non-existing database works when working with credentials with unnecessarily high privileges (though it is not clear to me which permissions allow this).
-
- Nov 04, 2022
-
-
vlorentz authored
-
- Nov 03, 2022
-
-
vlorentz authored
- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- Sep 08, 2022
-
-
vlorentz authored
-
- Aug 29, 2022
- Jul 06, 2022
-
-
Antoine Pietri authored
-
- Jun 21, 2022
-
-
Nicolas Dandrimont authored
We are removing support for the objstorage computing the object id itself.
-
- May 23, 2022
-
-
Antoine Pietri authored
-
- May 09, 2022
-
-
Pratyush authored
-
- Apr 29, 2022
-
-
Antoine Pietri authored
Significantly improves performance by reducing the number of levels in each DB, and thus reducing the amount of compaction.
-
Antoine Pietri authored
Reviewers: #reviewers, olasd Reviewed By: #reviewers, olasd Subscribers: olasd Differential Revision: https://forge.softwareheritage.org/D7711
-
- Apr 28, 2022
-
-
Antoine Pietri authored
-
- Apr 26, 2022
-
-
David Douard authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
We no longer support exporting the dataset as PostgreSQL dumps. It's pretty much useless for big data analysis; we rather encourage researchers to use big data engine (Hadoop, Hive, Presto...) which all support the ORC format. Alternatives to import the dataset on PostgreSQL include https://github.com/HighgoSoftware/orc_fdw/ , or using the swh mirroring pipeline to spawn a local storage instance with a PostgreSQL backend.
-
vlorentz authored
-
- Apr 21, 2022
-
-
Antoine Lambert authored
That hook can be frustrating as it can discard a long commit message if it finds a typo in it so better removing it.
-
David Douard authored
this option allows to restart consuming kafka a bit earlier than last committed offsets. This can be useful to test and debug.
-
David Douard authored
to easily set the list of exported object types. If not set, export all supported object types. Note that ``--exclude`` is also applied.
-
- Apr 14, 2022
-
-
Antoine Pietri authored
The origin table now contains the origin URL and a sha1 of the URL as the "ID" field. Allows us to join this table more easily with the SWHIDs retrieved from the compressed graph, as well as generate the edge dataset without having to compute sha1s manually.
-
- Apr 12, 2022
-
-
Antoine Pietri authored
-
- Apr 08, 2022
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Related to T3922
-
Antoine Lambert authored
black is considered stable since release 22.1.0 and the version we are currently using is quite outdated and not compatible with click 8.1.0, so it is time to bump it to its latest stable release. Please note that E501 pycodestyle warning related to line length is replaced by B950 one from flake8-bugbear as recommended by black. https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length Related to T3922
-
- Apr 07, 2022
-
-
David Douard authored
-
David Douard authored
this feature requires a config parameter `with_data=true` and an objstorage configuration.
-
- Apr 06, 2022
-
-
Antoine Lambert authored
pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for pytest >= 7 so we can now drop the pytest pinning.
-
- Apr 01, 2022
-
-
Antoine Pietri authored
-
- Mar 31, 2022
-
-
David Douard authored
Partially reverting 5a8a8a78. This is needed to ensure better compliance with usual ORC semantics, where one directory = one table.
-
- Mar 30, 2022
-
-
David Douard authored
- use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456') - reduce the description strings and align them. The result looks like: Exporting release: - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s] - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
-
David Douard authored
to prevent some possible race condition leading to kafka errors like: rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed This could occur when, at the time of unsubscribing from a partition, another partition is also depleted. Since the unsubscription consist in resubscribing to all partitions except unsubscribes ones, we could try to subscribe to such a depleted parition, leading to the error message listed above.
-
David Douard authored
eg. orc exporter specific exporter config entries are now under the 'orc' section, like: journal: brokers: [...] orc: remove_pull_requests: true max_rows: revision: 100000 directory: 10000
-
David Douard authored
Make it possible to specify a maximum number of rows a table can store in a single ORC file. The limit can only be set on main tables for now (i.e. cannot be specified for tables like revision_history or directory_entry). This can be set by configuration only (no extra cli options).
-
- Mar 29, 2022
-
-
David Douard authored
-