Add a backfiller cli command
This command allowd to backfill a kafka journal from an existing Postgresql provenance storage.
The command will run a given number of workers in parallel. The state of the backfilling process is saved in a leveldb store, so interrupting and restarting a backfilling process is possible, with limitations: it won't work properly if the range generation is modified.
Migrated from D8964 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D8964 (id=32301)
Rebasing onto c626cc21...
Current branch diff-target is up to date.
Changes applied before test
commit a98494dafd797ae6c1fb8e509d53fa17afa07374 Author: David Douard <david.douard@sdfa3.org> Date: Fri Dec 16 15:09:11 2022 +0100 Add a backfiller cli command This command allowd to backfill a kafka journal from an existing Postgresql provenance storage. The command will run a given number of workers in parallel. The state of the backfilling process is saved in a leveldb store, so interrupting and restarting a backfilling process is possible, with limitations: it won't work properly if the range generation is modified. commit 0b9df1a11c798767beacf09dfed6179ddc593419 Author: David Douard <david.douard@sdfa3.org> Date: Fri Dec 9 15:06:16 2022 +0100 Extract the journal writer part from the ProvenanceStorageJournal class This allows to use the journal writing part independently from the ProvenanceStorage proxy class, eg. for the backfiller mechanism.
See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/711/ for more details.
- swh/provenance/storage/backfill.py 0 → 100644
191 logger.debug("Exiting producer") 192 return 193 except StopIteration: 194 logger.info("range generator for %s is over, removing", etype) 195 196 def backfill_worker(self, queue, exiting): 197 logger.info("Spawning backfiller worker %s", threading.current_thread().name) 198 while not exiting.is_set(): 199 etype, start, stop = queue.get() 200 logger.info("backfill %s [%s, %s)", etype, start.hex(), stop.hex()) 201 backfiller = getattr(self, f"backfill_{etype}") 202 try: 203 n = backfiller(start=start, stop=stop) 204 except Exception as e: 205 print("argh", etype, e, repr(e)) 206 raise - swh/provenance/storage/backfill.py 0 → 100644
180 etype, range_gen = range_gens.pop(0) 181 try: 182 start, stop = next(range_gen) 183 range_gens.append((etype, range_gen)) 184 logger.debug( 185 "Adding range %s: [%s, %s)", etype, start.hex(), stop.hex() 186 ) 187 try: 188 queue.put((etype, start, stop), timeout=1) 189 except Full: 190 if exiting.is_set(): 191 logger.debug("Exiting producer") 192 return 193 except StopIteration: 194 logger.info("range generator for %s is over, removing", etype) 195 Apply @vlorentz's comments
Build is green
Patch application report for D8964 (id=32334)
Could not rebase; Attempt merge onto c626cc21...
Updating c626cc2..e66a2bf Fast-forward mypy.ini | 3 + requirements.txt | 1 + swh/provenance/cli.py | 91 +++++- swh/provenance/storage/backfill.py | 344 +++++++++++++++++++++ swh/provenance/storage/journal.py | 104 ++++--- .../tests/test_provenance_journal_writer.py | 64 ++-- 6 files changed, 539 insertions(+), 68 deletions(-) create mode 100644 swh/provenance/storage/backfill.py
Changes applied before test
commit e66a2bf98615a59ffbea30f1269c364bdf4db57e Author: David Douard <david.douard@sdfa3.org> Date: Fri Dec 16 15:09:11 2022 +0100 Add a backfiller cli command This command allowd to backfill a kafka journal from an existing Postgresql provenance storage. The command will run a given number of workers in parallel. The state of the backfilling process is saved in a leveldb store, so interrupting and restarting a backfilling process is possible, with limitations: it won't work properly if the range generation is modified. commit 0b9df1a11c798767beacf09dfed6179ddc593419 Author: David Douard <david.douard@sdfa3.org> Date: Fri Dec 9 15:06:16 2022 +0100 Extract the journal writer part from the ProvenanceStorageJournal class This allows to use the journal writing part independently from the ProvenanceStorage proxy class, eg. for the backfiller mechanism.
See https://jenkins.softwareheritage.org/job/DPROV/job/tests-on-diff/712/ for more details.
! In !163 (closed), @vlorentz wrote: You might want to remove your calls to
logger.error()
andprint()
before re-raisingwhat about these?
merged via e66a2bf9