Skip to content

scheduler/origin: Add scheduling origins cli from file/stdin to celery

It lived in the snippet repository for years (2017 and prior) and used regularly since then. It's been recently reused for the osdn origins scheduling. And as usual, improvments on it ensued. It's time to make it an official cli.

It's now migrated under the the 'swh scheduler origin' subcommand. It's name is 'send-origins-from-file-to-celery'. The file to read is either a local file or the standard input. This expects a list of origins (urls) to be pushed directly in the proper queue according to the task-type argument.

Help message:

Usage: swh scheduler origin send-origins-from-file-to-celery
           [OPTIONS] TASK_TYPE [FILE_INPUT] [OPTIONS]...

  Send origins directly from file/stdin to celery, filling the queue according
  to its standard configuration (and some optional adjustments).

  Arguments:

      TASK_TYPE: Scheduler task type (e.g. load-git, load-svn, ...)

      INPUT: Dataset file of origins to schedule, use '-' when piping to the
      cli.

      OPTIONS: Extra options (in the key=value form, e.g. base_git_url=<foo>)
      passed              directly to the task to be scheduled

Options:
  --queue-name-prefix TEXT  Prefix to add to the default queue name (if
                            needed). Usually needed to treat special origins
                            (e.g. large repositories, fill-in-the-hole
                            datasets, ... ) in another dedicated queue.
  --threshold INTEGER       Threshold override for the queue.
  --limit INTEGER           Number of origins to send. Usually to limit to a
                            small number for debug purposes.
  --waiting-period INTEGER  Waiting time between checks
  --dry-run                 Print only messages to send to celery
  --debug                   Print extra messages
  -h, --help                Show this message and exit.

For example:

export SWH_CONFIG_FILENAME=~/.config/swh/scheduler.yml; \
  head -20 /tmp/20230509-1539-priority.list.github | \
  shuf | \
  swh scheduler -C $SWH_CONFIG_FILENAME origin send-origins-from-file-to-celery \
    --queue-name-prefix large-repository \  # optional
    --debug \
    --dry-run \
    load-git

or directly:

swh scheduler -C $SWH_CONFIG_FILENAME origin send-origins-from-file-to-celery \
    --queue-name-prefix large-repository \  # optional
    --debug \
    --dry-run \
    --limit 10 \
    load-git
    /tmp/20230509-1539-priority.list.github  # note that here, the full file is sent,
                                             # without shuffling it

Origins can be routed to extra queue with the help of the --queue-name-prefix flag. This will use standard (configured in the scheduler) queue name with the dedicated prefix (queue-name-prefix:standard-queue-name). This also expects that the destination queue is being consumed on the infra side (ping sysadm for it if not).

The cli can be parametric to limit the number of messages with the --limit flag. It can also just be tested with --dry-run to do nothing but print actions. Some extra logging can be triggered with the --debug flag.

Manual test shot (debug & dry-run mode, with a limit number of messages of 3):

export SWH_CONFIG_FILENAME=~/.config/swh/scheduler.yml; head -10 ~/downloads/20230509-1539-priority.list.github | shuf | my-swh scheduler -C $SWH_CONFIG_FILENAME origin send-origins-from-file-to-celery --queue-name-prefix add_forge_now --debug --dry-run --limit 3 load-git
{'type': 'load-git', 'description': 'Update an origin of type git', 'backend_name': 'swh.loader.git.tasks.UpdateGitRepository', 'default_interval': datetime.timedelta(days=64), 'min_interval': datetime.timedelta(seconds=43200), 'max_interval': datetime.timedelta(days=64), 'backoff_factor': 2.0, 'max_queue_length': 1000, 'num_retries': 3, 'retry_delay': None}
** DRY-RUN ** call app.send_task with: {'name': 'swh.loader.git.tasks.UpdateGitRepository', 'task_id': '9ef7bef0-2391-4804-99ba-6df54d5e0d01', 'args': (), 'kwargs': {'url': 'https://github.com/0xADE1A1DE/MeasureSuite'}, 'queue': 'add_forge_now:swh.loader.git.tasks.UpdateGitRepository'}
** DRY-RUN ** call app.send_task with: {'name': 'swh.loader.git.tasks.UpdateGitRepository', 'task_id': '371c0d06-63b9-466a-8202-dee33edf7da5', 'args': (), 'kwargs': {'url': 'https://github.com/0ssamaak0/labelmm'}, 'queue': 'add_forge_now:swh.loader.git.tasks.UpdateGitRepository'}
** DRY-RUN ** call app.send_task with: {'name': 'swh.loader.git.tasks.UpdateGitRepository', 'task_id': 'fecffc04-7992-445a-8d92-1dee5db12dcc', 'args': (), 'kwargs': {'url': 'https://github.com/0xJepsen/CRC_Research'}, 'queue': 'add_forge_now:swh.loader.git.tasks.UpdateGitRepository'}

Manual test shot with unknown task type, this fails:

export SWH_CONFIG_FILENAME=~/.config/swh/scheduler.production.yml; cat ~/downloads/only-20-20230509-1539-priority.list.github | shuf | my-swh scheduler -C $SWH_CONFIG_FILENAME origin send-origins-from-file-to-celery  --queue-name-prefix large-repository --debug --dry-run --limit 10 unknown-stuff
Usage: swh scheduler origin send-origins-from-file-to-celery
           [OPTIONS] TASK_TYPE [FILE_INPUT] [OPTIONS]...
Try 'swh scheduler origin send-origins-from-file-to-celery -h' for help.

Error: Could not find scheduler <unknown-stuff> task type

TODO:

  • documentation
  • tests
  • Find acceptable name for the new subcommand

Refs. swh/infra/sysadm-environment#4872

Edited by Antoine R. Dumont

Merge request reports