Skip to content

Direct scheduling of origin visits in celery

This stack of changes builds up to a CLI endpoint allowing us to schedule origin visits directly in Celery, bypassing the legacy scheduler entirely.

This has zero test coverage save from old tests still passing, which is already something... It's being used on the actual production database to schedule actual tasks for git, npm and pypi.

Included changes:

  • Drop duplicate docstring from backend
  • Make the origin visit scheduling cooldown configurable

(Cosmetic changes)

  • Add a (longer) specific cooldown for failed origin visits
  • Add a specific cooldown for notfound origins

Both of these changes prevent repeating visits on failing origins. This is necessary because, as we're using a consistent ordering with respect to the upstream information, we'd always be trying to load them, never reaching origins further down the stack. Listers should eventually disable these origins.

  • Add table sampling option to grab_next_visits

Running common operations on all git origins is pretty intense. Using table sampling gives us the opportunity to at least schedule some jobs in (decently small) time.

  • Add a (very basic) scheduling policy for origins with no known last update

This is especially useful for pypi, as well as some git hosters that do not provide the right info in their APIs. We will need to implement smarter heuristics to avoid repeated uneventful visits on these origins.

  • Split off the helper for available slots in a celery queue

This is needed for the send-to-celery subcommand as well, so split it off of the runner module.

  • Add a swh scheduler origin send-to-celery subcommand

Yes, finally!

Test Plan

obviously needs at least /some/ test coverage.


Migrated from D5809 (view on Phabricator)

Merge request reports