Export orc data of newly ingested swh-objects in Adastra (!31) · Merge requests · Teams / CodeCommons / github-ingestion

Convert kafka messages into swh-objects
Implement the update_local_orc_files method, this method uses the ORCExporter from the swh.export package
Integrate the ORCExporterconfiguration in the deduplicator configuration
Handle stop-signals in the deduplication worker (a stop script can be added later)

References and related:

Run the full ingestion pipeline and make sure newly ingested data is written in kafka . Make sure that objects have been ingested that have not been ingested yet when you ran the index initialization (which you of course have to run before, see the readme).
Update the deduplicator configuration file deduplicator-config.yml
In a the deduplicator venv (Make sure to load the gcc module if testing in adastra) run the deduplicator start_deduplicator.py
Interrupt the deduplicator via a kill signal or keyboard interruption
Make sure the script gracefully stops
Check that orc data is present in the orc export folder

Edited Apr 15, 2025 by Simeon Carstens

Export orc data of newly ingested swh-objects in Adastra