Indexers - Find and implement a proper scheduling content messages indexing method

We are current indexing all our >> 3.8b contents.

(Probably as a pre-requisite to index on the fly new contents (most probably leveraging our journal stack)).

So far, i've been scheduling regularly some batch of contents from a snapshot file stored in uffizi (latest one being /srv/storage/space/lists/indexer/orchestrator-all/last-remaining-hashes-less-than-100mib.txt.gz).

source: https://forge.softwareheritage.org/diffusion/DSNIP/browse/master/ardumont/send-batch-sha1s.sh (well, some derivative form, running in worker01.euwest.azure).

This is:

ugly
heavy on my dayly workload (it increased most recently since the mimetype implementation change, T849, boosted the performance :)

Find a proper implementation for automating this (as 3.2b contents remains to be indexed).

Note:

We cannot directly use our rabbitmq host as we will not have enough disk space for it. That would make the host explode and break other workers (loader, lister, checker, etc...).
As per discussion with team, we cannot use the scheduler infrastructure either (with oneshot tasks). That would make explode the scheduler's db size. As we don't have cleaning up routine for that db yet.

Note 2: It had been useful to trigger it that way though. Sometimes, the queue was almost empty except for some jobs. Those kept being rescheduled for unexpected errors (latest one being #861 (closed)/#862 (closed) but others issues were raised that way).

Note3: That was how rehash computations were scheduled (#712 (closed)) but that took "only" 2-3 months (june-august 2017). Those are running since more than that (running since may or june 2017 now). I never quite found the time to make it more appropriate...

Migrated from T864 (view on Phabricator)

Edited Jan 07, 2023 by Phabricator Migration user