Improve handling of recurrent loading tasks in scheduler

The approach we're currently using for recurrent loading tasks in the scheduler has a lot of shortcomings:

does not take into account "freshness" information provided by a lister. Two consequences
- lots of lag accrued on origins with updates
- substantial amount of time wasted on origins with no updates
- some amount of time wasted on completely dead origins
uses (apparently unreliable) scheduler information as feedback loop
- lots of tasks end up lost in space, when we now have a reliable mechanism (the journal) to subscribe to updates about objects in the archive
feedback loop is very inflexible
- the "visit interval" target has never been met, or even calibrated to our bandwidth
- the way we adapt intervals is very stiff (x2 for inactive origins, /2 for active origins); no idea if it's stable or not
- save code now requests are completely ignored by the recurrent tasks

To handle this functionality better, I propose introducing a separate task scheduler for recurrent origin visits, with a few components:

a new table and set of API endpoints in the scheduler backend, to record information about recurrent origin loading tasks, replacing the current contents of the task table in the scheduler (for tasks that come from listers)
a new runner for these origin visit tasks
- TBD: generate one-shot tasks in the swh scheduler? send tasks directly to celery?
a journal client, feeding off origin_visits / origin_visit_updates, recording the status of all origin loading tasks

Common goals for a new origin visit scheduler

Some common goals that seem desirable:

loading origins at least once, as soon as possible after they appear in a lister
minimizing the number of "useless visits" with no updates to integrate
"smooth" the size of visits, by visiting active origins more often and doing less work each time (which reduces memory pressure on the workers)
make use of forge-provided "last modification times" to reduce the amount of useless work done

Baseline for the recurrence of origin visits

As a common baseline, we can handle the list of origins to be visited as a queue.

The way origins are ordered in this queue can be materialized by using two variables

a //next visit target//, the "time" at which we expect the origin to have new objects, and we should visit it again.
a //visit interval//, which is the duration that we expect to wait between visits of this origin

While conceptually a timestamp, the //next visit target// value is only really used as a queue index; the //visit interval// is an offset by which we move the origin within this queue when a visit completes.

As we should keep our infrastructure busy, and attempt origin visits as often as possible (with a minimal cooldown period between visits of a given origin, to avoid DoSing hosters), the //next visit target// currently being serviced can drift away from the current clock according to the behavior of our infrastructure.

The visit interval is really an index in a list of possible visit intervals. This allows us to set a smooth increase at first, and a stiffer increase later:

index 0 and 1, interval 1 day
index 2, 3 and 4, interval 2 days
index 5 and up, interval 4^(n-4) days (4, 16, 64, 256, 1024)

The breakpoints of this "exponential" can be adjusted to match the reality of our loading infrastructure, by monitoring the skew between the next visit targets and the actual scheduling time of visits.

The next visit target of an origin is updated after each visit completes:

if the visit has failed, increase the next visit target by the minimal visit interval (to take into account transient loading issues)
- if the visit has failed several times in a row, disable the origin until the next run of the lister
if the visit is successful, and records some changes, decrease the visit interval index by 2 (visit the origin way more often).
if the visit is successful, and records no changes, increase the visit interval index by 1 (visit the origin less often).

We set the next visit target to its current value + the new visit interval multiplied by a random fudge factor (picked in the -/+ 10% range).

The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins e.g. when a number of origins from a single hoster are processed at once.

Bootstrapping of the //next visit target// and //visit interval// values for new origins

There is an obvious bootstrap issue: new origins do not have a next visit target. (We can, however, bootstrap the visit interval index to a default value, e.g. 4)

The first of our stated goals is visiting origins at least once as soon as possible.

There's multiple ways to achieve this:

we can schedule new origins separately
- pros
  no fudging of the next visit target needed
  
  can precisely control when new origins are loaded
- cons
  makes the scheduling logic more complex
  
  needs careful monitoring to make sure the number doesn't grow unbounded
we can generate a //next visit target// for new origins
- pros
  simpler scheduling logic: origins are picked from a single queue
  
  can handle spreading the visits new origins over a longer time in a "oneshot" fashion
- cons
  needs careful consideration to avoid DoSing new hosters by bursting requests to new origins
Once the first visit happens, we can set the next visit target to the "current next visit target being scheduled" + the default visit interval * the random fudge factor.

Optimizations for listers providing a date of last modification

When the lister provides a date of last modification for origins, we can do some more subtle management of the scheduling of origin visits. Instead of scheduling according to the next visit target, we can schedule visits to origins in the following three pools:

origins that have never successfully loaded

ordered by increasing date of last modification (oldest first)

NOTE: if the lister returns a creation date for the origin, we could instead use a decreasing interval between //creation time// to //last update time// as sorting heuristic: this would favor "more active" origins, but could leave origins only updated once behind.

origins where the date of last modification is more recent than the latest (successful) load date

ordered by decreasing difference between last load date and date of last modification

NOTE: this favors more recently active origins to the detriment of origins which had some activity right after being loaded, then went silent. There's a good chance that this heuristic will not converge (and, if the infrastructure struggles, a bunch of modifications to origins that happened right after a visit will never be recorded). To reduce the impact of this, we could mix both ends of the heuristic: do some amount of visits to active origins, and do some amount of "easy" visits to origins that haven't been updated very much.

other origins

ordered by next visit target

NOTE: this is only a last-resort strategy:

listers might not be running all the time, but we still want a chance to update repositories
the modification time information provided by listers might not be reliable
if everything works perfectly, these last-resort updates should mostly be short no-ops

Actual visit scheduling policy

The //next visit target// for origins without date of last modification (group A) give them a total ordering.

For origins with an upstream-provided date of last modification (group B), the three scheduling pools don't give us a total ordering of visits.

To handle both of these groups when the execution queue has n slots free, we can :

get the ratio of "last modification time-provided" origins r = **#B / (#**A + **#**B)
schedule (1-r) n origins from group A according to their //next visit target//
schedule n r/3 origins from each of the 3 pools of origin group B.

We need to monitor the number of origins in each of the pools of group B, to make sure that our sorting heuristics are not making our work diverge.

Potential future improvements

Prioritize non-fork repositories over forked repositories?
- not in the first draft; still, record the fork information if possible so we can adjust the scheduling policies afterwards

Feedback loop in the origin_visit journal client

Update last visit date, status, and eventfulness
- keep time of last successful visit
- if status is failed, keep same last visit date, increase failure count
  - if failure count too high (3 ?) disable origin until next run of lister
- else, reset failure count
Update visit interval index and next visit target

Proposed metrics for the new scheduler

number of origins for every scheduling policy group and subgroup (probably grouped by lister).
number of active origins (last modification \in [previous listing, current listing]), by lister.
drift between real time and current //next visit target// scheduled
...

Discussion on implementation detail [1]

[1] https://hedgedoc.softwareheritage.org/jF0v5LZITGqhpZ-hgiy8zw

Migrated from T2345 (view on Phabricator)

Edited Jan 07, 2023 by Phabricator Migration user

Improve handling of recurrent loading tasks in scheduler

Common goals for a new origin visit scheduler

Baseline for the recurrence of origin visits

Bootstrapping of the //next visit target// and //visit interval// values for new origins

no fudging of the next visit target needed

can precisely control when new origins are loaded

makes the scheduling logic more complex

needs careful monitoring to make sure the number doesn't grow unbounded

simpler scheduling logic: origins are picked from a single queue

can handle spreading the visits new origins over a longer time in a "oneshot" fashion

needs careful consideration to avoid DoSing new hosters by bursting requests to new origins

Optimizations for listers providing a date of last modification

Actual visit scheduling policy

Potential future improvements

Feedback loop in the origin_visit journal client

Proposed metrics for the new scheduler