- Dec 06, 2022
- Nov 29, 2022
-
-
vlorentz authored
Otherwise, UploadToS3, DownloadToS3, and RunAll would conflict with tasks about to be defined in swh.graph; and Luigi requires task names to be globally unique.
- Nov 24, 2022
- Nov 21, 2022
-
-
vlorentz authored
They are only useful while exporting the dataset -- after the export is finished, meta.json is good enough and stamp files only save a couple of minutes when only some objects types are needed (ie. never in practice)
-
- Nov 15, 2022
-
-
vlorentz authored
This will allow running swh-graph tasks easily on machines that didn't export the graph themselves.
-
- Nov 10, 2022
-
-
vlorentz authored
Other tasks will import them in order to depend on tasks defined here
-
vlorentz authored
They are more tuned toward running automatically, as they call each other as needed, and can be imported by workflows defined in other modules (eg. the future swh.graph.luigi module).
-
vlorentz authored
So it can be reused by a Luigi task
-
vlorentz authored
For some reason, using a non-existing database works when working with credentials with unnecessarily high privileges (though it is not clear to me which permissions allow this).
-
- Nov 04, 2022
-
-
vlorentz authored
-
- Nov 03, 2022
-
-
vlorentz authored
- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- Sep 08, 2022
-
-
vlorentz authored
-
- Aug 29, 2022
- Jul 06, 2022
-
-
Antoine Pietri authored
-
- Jun 21, 2022
-
-
Nicolas Dandrimont authored
We are removing support for the objstorage computing the object id itself.
-
- May 23, 2022
-
-
Antoine Pietri authored
-
- May 09, 2022
-
-
Pratyush authored
-
- Apr 29, 2022
-
-
Antoine Pietri authored
Significantly improves performance by reducing the number of levels in each DB, and thus reducing the amount of compaction.
-
Antoine Pietri authored
Reviewers: #reviewers, olasd Reviewed By: #reviewers, olasd Subscribers: olasd Differential Revision: https://forge.softwareheritage.org/D7711
-
- Apr 28, 2022
-
-
Antoine Pietri authored
-
- Apr 26, 2022
-
-
David Douard authored
-
Antoine Pietri authored
-
Antoine Pietri authored
-
Antoine Pietri authored
We no longer support exporting the dataset as PostgreSQL dumps. It's pretty much useless for big data analysis; we rather encourage researchers to use big data engine (Hadoop, Hive, Presto...) which all support the ORC format. Alternatives to import the dataset on PostgreSQL include https://github.com/HighgoSoftware/orc_fdw/ , or using the swh mirroring pipeline to spawn a local storage instance with a PostgreSQL backend.
-
vlorentz authored
-
- Apr 21, 2022
-
-
Antoine Lambert authored
That hook can be frustrating as it can discard a long commit message if it finds a typo in it so better removing it.
-
David Douard authored
this option allows to restart consuming kafka a bit earlier than last committed offsets. This can be useful to test and debug.
-
David Douard authored
to easily set the list of exported object types. If not set, export all supported object types. Note that ``--exclude`` is also applied.
-
- Apr 14, 2022
-
-
Antoine Pietri authored
The origin table now contains the origin URL and a sha1 of the URL as the "ID" field. Allows us to join this table more easily with the SWHIDs retrieved from the compressed graph, as well as generate the edge dataset without having to compute sha1s manually.
-
- Apr 12, 2022
-
-
Antoine Pietri authored
-
- Apr 08, 2022
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Related to T3922
-