- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- Sep 08, 2022
-
-
vlorentz authored
-
- Aug 29, 2022
- Jul 06, 2022
-
-
Antoine Pietri authored
-
- Jun 21, 2022
-
-
Nicolas Dandrimont authored
We are removing support for the objstorage computing the object id itself.
-
- May 23, 2022
-
-
Antoine Pietri authored
-
- May 09, 2022
-
-
Pratyush authored
-
- Apr 29, 2022
-
-
Antoine Pietri authored
Significantly improves performance by reducing the number of levels in each DB, and thus reducing the amount of compaction.
-
Antoine Pietri authored
Reviewers: #reviewers, olasd Reviewed By: #reviewers, olasd Subscribers: olasd Differential Revision: https://forge.softwareheritage.org/D7711
-
- Apr 28, 2022
-
-
Antoine Pietri authored
-
- Apr 26, 2022
-
-
David Douard authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
We no longer support exporting the dataset as PostgreSQL dumps. It's pretty much useless for big data analysis; we rather encourage researchers to use big data engine (Hadoop, Hive, Presto...) which all support the ORC format. Alternatives to import the dataset on PostgreSQL include https://github.com/HighgoSoftware/orc_fdw/ , or using the swh mirroring pipeline to spawn a local storage instance with a PostgreSQL backend.
-
vlorentz authored
-
- Apr 21, 2022
-
-
Antoine Lambert authored
That hook can be frustrating as it can discard a long commit message if it finds a typo in it so better removing it.
-
David Douard authored
this option allows to restart consuming kafka a bit earlier than last committed offsets. This can be useful to test and debug.
-
David Douard authored
to easily set the list of exported object types. If not set, export all supported object types. Note that ``--exclude`` is also applied.
-
- Apr 14, 2022
-
-
Antoine Pietri authored
The origin table now contains the origin URL and a sha1 of the URL as the "ID" field. Allows us to join this table more easily with the SWHIDs retrieved from the compressed graph, as well as generate the edge dataset without having to compute sha1s manually.
-
- Apr 12, 2022
-
-
Antoine Pietri authored
-
- Apr 08, 2022
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Related to T3922
-
Antoine Lambert authored
black is considered stable since release 22.1.0 and the version we are currently using is quite outdated and not compatible with click 8.1.0, so it is time to bump it to its latest stable release. Please note that E501 pycodestyle warning related to line length is replaced by B950 one from flake8-bugbear as recommended by black. https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length Related to T3922
-
- Apr 07, 2022
-
-
David Douard authored
-
David Douard authored
this feature requires a config parameter `with_data=true` and an objstorage configuration.
-
- Apr 06, 2022
-
-
Antoine Lambert authored
pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for pytest >= 7 so we can now drop the pytest pinning.
-
- Apr 01, 2022
-
-
Antoine Pietri authored
-
- Mar 31, 2022
-
-
David Douard authored
Partially reverting 5a8a8a78. This is needed to ensure better compliance with usual ORC semantics, where one directory = one table.
-
- Mar 30, 2022
-
-
David Douard authored
- use shortened values in the progress bar (eg. '11.3M/566M' instead of something like '11310201/566123456') - reduce the description strings and align them. The result looks like: Exporting release: - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s] - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]
-
David Douard authored
to prevent some possible race condition leading to kafka errors like: rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed This could occur when, at the time of unsubscribing from a partition, another partition is also depleted. Since the unsubscription consist in resubscribing to all partitions except unsubscribes ones, we could try to subscribe to such a depleted parition, leading to the error message listed above.
-
David Douard authored
eg. orc exporter specific exporter config entries are now under the 'orc' section, like: journal: brokers: [...] orc: remove_pull_requests: true max_rows: revision: 100000 directory: 10000
-
David Douard authored
Make it possible to specify a maximum number of rows a table can store in a single ORC file. The limit can only be set on main tables for now (i.e. cannot be specified for tables like revision_history or directory_entry). This can be set by configuration only (no extra cli options).
-
- Mar 29, 2022
-
-
David Douard authored
-
David Douard authored
-
David Douard authored
-
David Douard authored
related ORC files being ORC files involved in the serialization of a given object type, namely: - snapshot and snapshot_branches, - revision and revision_history, - directory and directory_entry. Also include the object_type in the generated file name (in place of the static 'graph'). So the result will typically be like: output/orc/shapshot/ snaphot-18a575cb-3a92-4753-9267-e3475fa30857.orc snaphot_branch-18a575cb-3a92-4753-9267-e3475fa30857.orc snapshot-1f41d206-994a-49bb-917f-e096e40c2856.orc snapshot_branch-1f41d206-994a-49bb-917f-e096e40c2856.orc
-
David Douard authored
add: - object type - uuid - version of swh.model used at file generation time, - version of swh.dataset used at file generation time.
-
David Douard authored
and split it in 2 parts (needed for changes to come).
-
David Douard authored
rather than hardcoding it to 'swh-dataset-export-', use the 'group_id' value from the 'journal' section of the config file as prefix, if given 9otherwise default to the former value). This is needed because current auth policy of swh kafka cluster only allow group_id to start with the actual login for authenticated connection. So we need to be able to specify this group_id prefix.
-