Commits · 058e568492ba8ba495b6366ae19cadc6ce7e5c4f · Antoine Lambert / swh-dataset

Nov 10, 2022

vlorentz authored 2 years ago

They are more tuned toward running automatically, as they call each
other as needed, and can be imported by workflows defined in other
modules (eg. the future swh.graph.luigi module).

058e5684

cli: Move the main code of export_graph to its own function · eea3e15b
vlorentz authored 2 years ago
```
So it can be reused by a Luigi task
```
eea3e15b

athena: Fix create_table to work with restricted permissions · 5087a463

vlorentz authored 2 years ago

For some reason, using a non-existing database works when working with
credentials with unnecessarily high privileges (though it is not clear
to me which permissions allow this).

5087a463

Nov 04, 2022
- cli: Increase open file descriptor limit to support 256 open LevelDBs · b2ae0826
  vlorentz authored 2 years ago
  
  b2ae0826
Nov 03, 2022
- cli: Sort object types to be processed in the right order · 735397ac
  vlorentz authored 2 years ago
  
  Order was randomized since 4154d43a
  735397ac
- mypy.ini: Sort config entries · fec47f6f
  vlorentz authored 2 years ago
  
  fec47f6f
Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · 8853c0ab

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

8853c0ab

Sep 08, 2022
- Fix link to the 2021-03-23 compressed dataset · 6d4154bc
  vlorentz authored 2 years ago
  
  6d4154bc
Aug 29, 2022

docs/athena: Fix value of --location-prefix · 47c04f7c
vlorentz authored 2 years ago

47c04f7c

Update for swh-objstorage >= 2.0.0 · 72655208

vlorentz authored 2 years ago

* objstorage.add() now requires an id as second argument
* objstorage.get() now allows a dict of hashes instead of just
  sha1, and a future version will require this dict

72655208

Jul 06, 2022
- Update dataset page with 2022 export · c8e77380
  Antoine Pietri authored 2 years ago
  
  v0.3.2
  
  c8e77380
Jun 21, 2022
- Set object id when calling objstorage.add · e31bdb26
  Nicolas Dandrimont authored 2 years ago
  
  We are removing support for the objstorage computing the object id itself.
  e31bdb26
May 23, 2022
- docs: add missing refs, comment outdated schema · 5916d8b5
  Antoine Pietri authored 2 years ago
  
  5916d8b5
May 09, 2022
- add strict asyncio_mode in pytest.ini · 5666bc66
  Pratyush authored 2 years ago
  
  5666bc66
Apr 29, 2022

journalprocessor: re-enable subsharding per partition · c2c2c21e

Antoine Pietri authored 2 years ago

Significantly improves performance by reducing the number of levels in
each DB, and thus reducing the amount of compaction.

c2c2c21e

docs: Document how to export subdatasets and document/publish datasets · db331f27

Antoine Pietri authored 2 years ago

Reviewers: #reviewers, olasd

Reviewed By: #reviewers, olasd

Subscribers: olasd

Differential Revision: https://forge.softwareheritage.org/D7711

db331f27

Apr 28, 2022
- docs: document graph dataset export · 8a44a63f
  Antoine Pietri authored 2 years ago
  
  8a44a63f
Apr 26, 2022
- Make swh.objstorage an optional dependency, as a 'with-content' extra · 157331b9
  David Douard authored 2 years ago
  
  157331b9
- docs: update Databricks tutorial · 755c903b
  Antoine Pietri authored 2 years ago
  
  755c903b
- docs: update Athena tutorial · 17995b90
  Antoine Pietri authored 2 years ago
  
  17995b90
- docs: remove PostgreSQL local setup · d353e6db
  Antoine Pietri authored 2 years ago
  
  We no longer support exporting the dataset as PostgreSQL dumps. It's pretty much useless for big data analysis; we rather encourage researchers to use big data engine (Hadoop, Hive, Presto...) which all support the ORC format. Alternatives to import the dataset on PostgreSQL include https://github.com/HighgoSoftware/orc_fdw/ , or using the swh mirroring pipeline to spawn a local storage instance with a PostgreSQL backend.
  d353e6db
- Bump mypy to v0.942 · 230829c6
  vlorentz authored 2 years ago
  
  230829c6
Apr 21, 2022

pre-commit: Remove codespell commit-msg hook · 68b4bb63

Antoine Lambert authored 2 years ago

That hook can be frustrating as it can discard a long commit message
if it finds a typo in it so better removing it.

68b4bb63

Add a --margin option to the `swh dataset graph export` command · 07bcf167

David Douard authored 2 years ago

this option allows to restart consuming kafka a bit earlier than last
committed offsets.
This can be useful to test and debug.

07bcf167

Add a --types option to the `swh dataset graph export` command · 4154d43a

David Douard authored 2 years ago

to easily set the list of exported object types. If not set, export all
supported object types.

Note that ``--exclude`` is also applied.

4154d43a

Apr 14, 2022

relational exports: add ID field to origin table · 9f342d99

Antoine Pietri authored 2 years ago

The origin table now contains the origin URL and a sha1 of the URL as
the "ID" field. Allows us to join this table more easily with the SWHIDs
retrieved from the compressed graph, as well as generate the edge
dataset without having to compute sha1s manually.

9f342d99

Apr 12, 2022
- journalprocessor: save final offsets to a text file · 075b3c30
  Antoine Pietri authored 2 years ago
  
  075b3c30
Apr 08, 2022

Add .git-blame-ignore-revs file with automatic reformatting commits · d2665ef3
Antoine Lambert authored 2 years ago

d2665ef3
python: Reformat code with black 22.3.0 · 9adb9b76
Antoine Lambert authored 2 years ago
```
Related to T3922
```
9adb9b76

pre-commit, tox: Bump black from 19.10b0 to 22.3.0 · ceb5bf8c

Antoine Lambert authored 2 years ago

black is considered stable since release 22.1.0 and the version
we are currently using is quite outdated and not compatible with
click 8.1.0, so it is time to bump it to its latest stable release.

Please note that E501 pycodestyle warning related to line length
is replaced by B950 one from flake8-bugbear as recommended by black.
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length

Related to T3922

ceb5bf8c

Apr 07, 2022
- Reduce cli's loading time by moving import statements in commands · 18325cc8
  David Douard authored 2 years ago
  
  18325cc8
- Add support for blob in content export · 9d97f0c0
  David Douard authored 3 years ago
  
  this feature requires a config parameter `with_data=true` and an objstorage configuration.
  9d97f0c0
Apr 06, 2022

requirements-test: Remove pytest pinning to < 7 · f282e88a

Antoine Lambert authored 2 years ago

pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for
pytest >= 7 so we can now drop the pytest pinning.

f282e88a

Apr 01, 2022
- Docs: update dataset list with recent datasets · a1dd9189
  Antoine Pietri authored 2 years ago
  
  a1dd9189
Mar 31, 2022

Keep each ORC table in a dedicated directory · 76eba659

David Douard authored 2 years ago

Partially reverting 5a8a8a78.

This is needed to ensure better compliance with usual ORC semantics,
where one directory = one table.

76eba659

Mar 30, 2022

Improve the progress reporting look and feel · 58b9f92e

David Douard authored 3 years ago

- use shortened values in the progress bar (eg. '11.3M/566M' instead of
  something like '11310201/566123456')
- reduce the description strings and align them.

The result looks like:

Exporting release:
  - Offset: 100%|███████████████████████████████| 64/64 [00:03<00:00, 18.61it/s]
  - Export: 100%|█████████████████| 130/130 [00:01<00:00, 10.3it/s, workers=4/4]

58b9f92e

Delay the unsubscribe to the end of handle_messages · be3c5da2

David Douard authored 3 years ago

to prevent some possible race condition leading to kafka errors like:

rd_kafka_assignment_partition_stopped: Assertion `rktp->rktp_started' failed

This could occur when, at the time of unsubscribing from a partition,
another partition is also depleted. Since the unsubscription consist in
resubscribing to all partitions except unsubscribes ones, we could try
to subscribe to such a depleted parition, leading to the error message
listed above.

be3c5da2

Move exporter config entries in dedicated sections · e01daba4

David Douard authored 3 years ago

eg. orc exporter specific exporter config entries are now under the
'orc' section, like:

  journal:
    brokers: [...]

  orc:
    remove_pull_requests: true
    max_rows:
      revision: 100000
      directory: 10000

e01daba4

Add support for limited row numbers in ORC files · 3df08fd7

David Douard authored 3 years ago

Make it possible to specify a maximum number of rows a table can store
in a single ORC file. The limit can only be set on main tables for now
(i.e. cannot be specified for tables like revision_history or
directory_entry).

This can be set by configuration only (no extra cli options).

3df08fd7

Mar 29, 2022
- Add the raw_manifest column for revision, release and directory ORC files · fd3f9aa6
  David Douard authored 3 years ago
  
  fd3f9aa6