Skip to content

Add luigi tasks

They are more tuned toward running automatically, as they call each other as needed, and can be imported by workflows defined in other modules (eg. the future swh.graph.luigi module).

This massively re-uses the CLI, so most of the code is:

  • telling Luigi how to deduplicate + when/how to reuse output of tasks that already ran
  • adding stamp files to avoid accidentally using a partially written export (because it was interrupted midway)
  • the meta.json file, which acts as a final stamp and provides information about the dataset export itself (required for swh-graph#2579 (closed))

Depends on !73 (closed)

Test Plan

This is mostly declarative code, and all the issues are when interfacing with external stuff (mostly S3 and Athena), so I do not think writing tests is worth it.

I played with this code in various scenarios while debugging, so I am confident task deduplication works fine.


Migrated from D8829 (view on Phabricator)

Merge request reports