-
v1.3.38dd81df3 · ·
v1.3.3 * luigi: Work around absence of metadata files in old exports * luigi: Add support for downloading exports in parallel * luigi: Delete all files in the root dir instead of the root dir itself * docs: Add links to Terms of Use and 'How to use SWH data' * Fix Sphinx role (:cls: -> :class:)
-
-
-
-
v1.2.0a9692a7e · ·
v1.2.0 * luigi: Actually check whether RunExportAll is complete. * Advertize 2022-12-07 dataset * luigi: Make AthenaDatabaseTarget check tables exist
-
v1.1.05a7bb58f · ·
v1.1.0 * Rename 'object_type' to 'object_types' in export metadata * luigi: Clarify the meaning of 'object_types' * Invert symlink and content for the README file * Bump mypy to 1.0.1 and isort to 5.11.5 * Fix tox and pytest config
-
v1.0.3a01a82fc · ·
v1.0.3 * luigi.UploadExportToS3: Skip upload of already-uploaded files * luigi: Dynamically list directories instead of using object_types * luigi: Read meta/export.json instead of relying on stamp files * docs/index.rst: Add missing new line at end of file * docs/index.rst: Fix sphinx tag name * docs: Include module indices only when building standalone package doc
-
v1.0.2c717f60f · ·
v1.0.2 * exporters/orc: Fix crash on visit status with no type * luigi.CreateAthena: Fix validation of DB name * luigi.RunExportAll: Default to exporting all formats
-
v1.0.1a1cf9b87 · ·
v1.0.1 * luigi: Rename classes to be globally unambiguous * luigi: Send progress reports to the scheduler
-
v1.0.0f8a13718 · ·
v1.0.0 * Update for swh-objstorage >= 2.0.0 * docs/athena: Fix value of --location-prefix * Fix link to the 2021-03-23 compressed dataset * cli: Sort object types to be processed in the right order * cli: Increase open file descriptor limit to support 256 open LevelDBs * athena: Fix create_table to work with restricted permissions * Add luigi tasks
-
v0.2.0dffb127c · ·
v0.2.0 / 2021-04-17 * athena: pass database name as an attribute * docs: Update for new schema * Add two ORC tools (orc-merge, orc-print-contents) * journalprocessor: only reassign partitions when needed * journalprocessor: disable in-partition sharding for LevelDB tests * ORC: export missing revision_history table * athena: add documentation and licensing info * Add athena subcommand to create/query AWS Athena database * Move ORC table schema in relational.py * test_edges: fix mypy error while mocking a method * Fix duplicate reference target * Swap README.rst and docs/README.rst to match the new template. * Include README.rst in the documentation. * Add LevelDB backend for exporter node sets * ORC exporter: handle releases with empty authors/dates * Update exporters.edged to swh.model 1.0 * ORC exporter: avoid fromtimestamp(), use datetime() from epoch instead * Refactor export paths in the base Exporter class * ORC exporter: Add unit tests * Add ORC exporter * Edge exporter: use common remove_pull_requests() function * journalprocessor: be resilient to exporter errors * Export CLI: add a way to exclude specific object types * Namespace exporters in exporters/ dir * journalprocessor: don't shadow the object function * journalprocessor: fix hashing of origin_visit_status objects * journalprocessor: remove comment about deserialize_message overload being a 'hack' * tests: fix test_export_origin * SQLite on-disk set: disable journalling and synchronous mode * journalprocessor: also partition sqlite files by first byte * Journal processor: fetch offsets in parallel * Exporter documentation fixes * Rewrite of the export pipeline using Exporters * Graph export: add labels to the export CSV format * graph exporter: schema upgrade for origin_visit_status * Replace vcversioner with setuptools-scm * Run isort after the CLI import changes