Export orc data of newly ingested swh-objects in Adastra
Why this MR
- Convert kafka messages into
swh-objects
- Implement the
update_local_orc_files
method, this method uses theORCExporter
from theswh.export
package - Integrate the
ORCExporter
configuration in the deduplicator configuration - Handle
stop-signals
in the deduplication worker (a stop script can be added later)
References and related:
How to review this MR
- Run the full ingestion pipeline and make sure newly ingested data is written in
kafka
. Make sure that objects have been ingested that have not been ingested yet when you ran the index initialization (which you of course have to run before, see the readme). - Update the deduplicator configuration file
deduplicator-config.yml
- In a the deduplicator venv (Make sure to load the gcc module if testing in adastra) run the deduplicator
start_deduplicator.py
- Interrupt the deduplicator via a kill signal or keyboard interruption
- Make sure the script gracefully stops
- Check that orc data is present in the orc export folder
Edited by Simeon Carstens