Skip to content

Export orc data of newly ingested swh-objects in Adastra

Why this MR

  • Convert kafka messages into swh-objects
  • Implement the update_local_orc_files method, this method uses the ORCExporter from the swh.export package
  • Integrate the ORCExporterconfiguration in the deduplicator configuration
  • Handle stop-signals in the deduplication worker (a stop script can be added later)

References and related:

How to review this MR

  • Run the full ingestion pipeline and make sure newly ingested data is written in kafka . Make sure that objects have been ingested that have not been ingested yet when you ran the index initialization (which you of course have to run before, see the readme).
  • Update the deduplicator configuration file deduplicator-config.yml
  • In a the deduplicator venv (Make sure to load the gcc module if testing in adastra) run the deduplicator start_deduplicator.py
  • Interrupt the deduplicator via a kill signal or keyboard interruption
  • Make sure the script gracefully stops
  • Check that orc data is present in the orc export folder
Edited by Simeon Carstens

Merge request reports

Loading