Commits · 0496c3977a56b66dbbe2682bcb8f37390f650c1f · Platform / Development / swh-scheduler

Jun 03, 2022

Remove unused get_current_version method · 0496c397

Attribute current_version is already set and directly used by swh db
[version|init|upgrade] clis.

Related to T4305

Verified

0496c397

May 31, 2022
- tests: use stock pytest_postgresql factory function · ef153850
  David Douard authored 2 years ago
  
  instead of (soon-to-be-deprecated) swh-core's postgresql_fact one.
  ef153850
May 12, 2022

interface: Return enabled origins only by default in get_listed_origins · e56fc4d1

Antoine Lambert authored 2 years ago

Add a new enabled_only parameter set to True by default in
get_listed_origins scheduler method.

It enables to filter out by default disabled listed origins
when requesting the result of a listing and avoid possible
errors in listers implementation.

e56fc4d1

May 09, 2022
- add strict asyncio_mode in pytest.ini · c7c53eab
  Pratyush authored 2 years ago
  
  c7c53eab
May 06, 2022
- cli/task: Fix sphinx >= 4.4 warning · 1d50b2e1
  Antoine Lambert authored 2 years ago
  
  Fix "more than one target found for cross-reference 'Origin'" sphinx warning.
  1d50b2e1
Apr 28, 2022
- Add missing sentry captures · 881b521f
  Benoit Chauvet authored 2 years ago
  
  881b521f
Apr 27, 2022
- cli/utils: Fix parsing of empty strings · 82274c1b
  vlorentz authored 2 years ago
  
  v1.1.1
  
  82274c1b
Apr 26, 2022
- Bump mypy to v0.942 · 353cf2a6
  vlorentz authored 2 years ago
  
  353cf2a6
- Add a 'lister_instance_name' argument to all tasks created from ListedOrigin · 0365b853
  vlorentz authored 2 years ago
  
  This will allow loaders to use the right API credentials to fetch extrinsic metadata for the origin from the forge.
  v1.1.0
  
  0365b853
- Add a 'lister_name' argument to all tasks created from ListedOrigin · 42e362db
  vlorentz authored 2 years ago
  
  This will allow loaders to guess the forge type, and use the right API to fetch extrinsic metadata for the origin from the forge.
  42e362db
- Update a bit the documentation for the new origin visit scheduler · 3687931f
  David Douard authored 2 years ago
  
  3687931f
Apr 21, 2022
- Make create_origin_task_dict a standalone function · 9483493f
  vlorentz authored 2 years ago
  
  It feels off as an object method; and I am going to make it use joins in a future commit, so it makes more sense this way.
  9483493f
- test_utils.py: Convert to pytest-style tests · 5e9ee60d
  vlorentz authored 2 years ago
  
  5e9ee60d
- pre-commit: Remove codespell commit-msg hook · 9627e6d6
  Antoine Lambert authored 2 years ago
  
  That hook can be frustrating as it can discard a long commit message if it finds a typo in it so better removing it.
  9627e6d6
Apr 20, 2022

Make scheduling policy used in schedule_recurrent configurable · a76bb02f

David Douard authored 2 years ago

Add support for a configuration option "scheduling_policy" in the config
file loaded by the 'swh scheduler schedule-recurrent' command. This
config entry allows to specify the scheduling policies used by the
schedule-recurrent tool, instead of having them hardcoded in the source
code.

A visit type policy config entry should have at least a 'weight' value
for each policy.

Default values are unchanged.

Eg.:

  scheduling_policy:
    git:
      - policy: already_visited_order_by_lag
        weight: 55
        tablesample: 0.5
      - policy: never_visited_oldest_update_first
        weight: 45
        tablesample: 0.5

Note: there may not be configuration entries for all visit types, but if
      a visit type policy is configured, the config entry should be complete
      (in other words, the merging of the configuration with the default
      values is only done at first config level).

a76bb02f

Apr 08, 2022

Add .git-blame-ignore-revs file with automatic reformatting commits · 5302efda
Antoine Lambert authored 2 years ago

5302efda
python: Reformat code with black 22.3.0 · 3f0843bd
Antoine Lambert authored 2 years ago
```
Related to T3922
```
3f0843bd

pre-commit, tox: Bump black from 19.10b0 to 22.3.0 · d9a25121

Antoine Lambert authored 2 years ago

black is considered stable since release 22.1.0 and the version
we are currently using is quite outdated and not compatible with
click 8.1.0, so it is time to bump it to its latest stable release.

Please note that E501 pycodestyle warning related to line length
is replaced by B950 one from flake8-bugbear as recommended by black.
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length

Related to T3922

d9a25121

Apr 06, 2022

requirements-test: Remove pytest pinning to < 7 · bafe03f4

Antoine Lambert authored 2 years ago

pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for
pytest >= 7 so we can now drop the pytest pinning.

bafe03f4

Mar 22, 2022

pytest: Exclude build directory for tests discovery · 78f5579b

Antoine Lambert authored 3 years ago

Due to test modules being copied in subdirectories of the
build directory by setuptools, it makes pytest fail by raising
ImportPathMismatchError exceptions when invoked from root
directory of the module.

So ignore the build folder to discover tests.

78f5579b

Feb 24, 2022

Adapt to swh.core 2.0.0 · 5cc62be1

David Douard authored 3 years ago

- add the `get_datastore` function in `swh.scheduler`
- add the `get_current_version` method in `SchedulerBackend`,
- remove dbversion management from sql init script
- update tests accordingly

5cc62be1

Feb 10, 2022
- pre-commit: Bump hooks and add new one to check commit message spelling · 234e1659
  Antoine Lambert authored 3 years ago
  
  To install the new hook: $ pre-commit install -t commit-msg
  234e1659
Feb 09, 2022

requirements: Remove click version pin · fddec020

Antoine Lambert authored 3 years ago

Latest versions of celery and flask now support click >= 8.0 so
we can remove the version pin.

fddec020

Feb 08, 2022

Prefix task types used in tests with 'test-' · c46ffadf

David Douard authored 3 years ago

so that tests do not depend on a lucky guess on what the scheduler db
state actually is. DB initialization scripts do create task types for
git, hg and svn (used in tests) but these tests depends on the fact the
db fixture has been called already once before, so tables are
truncated (especially the task and task_type ones).

For example running a single test involved in task-type creation was
failing (eg. 'pytest swh -k test_create_task_type_idempotence').

This commit does make tests not collide with any existing task or task
type initialization scripts may create.

Note that this also means that there is actually no test dealing with
the scheduler db state after initialization, which is not grat and
should be addressed.

c46ffadf

Feb 07, 2022
- requirements-test: Pin pytest to < 7.0.0 · 9f601f56
  Antoine R. Dumont authored 3 years ago
  
  Related to T3916
  Verified
  
  9f601f56
Jan 21, 2022
- Fix ReST syntax · ce112835
  vlorentz authored 3 years ago
  
  ce112835
Jan 12, 2022

sql: Clean up task/task_run data model · b5477ea2

Antoine R. Dumont authored 3 years ago

This archives current task and task_run tables, creating new ones filtering only
necessary tasks (last 2 months' oneshot tasks plus some recurring tasks; lister,
indexer, ...). Those filtered tasks are the ones scheduled by the runner and runner
priority services.

This archiving will allow those services to be faster (corresponding query execution
time will outputs results faster without the archived data).

Related to T3837

Verified

b5477ea2

Jan 05, 2022
- Allow to specify the visit grab parameters per visit type and policy · 5c836d64
  Vincent Sellier authored 3 years ago
  
  Related to T3827
  v0.23.0 Verified
  
  5c836d64
Dec 16, 2021
- Pin mypy and drop type annotations which makes mypy unhappy · 559f3451
  Antoine R. Dumont authored 3 years ago
  
  This also drops spurious copyright headers to those files if present. Related to T3812
  Verified
  
  559f3451
Dec 09, 2021

Use a temporary table to update scheduler metrics · e051b320

Nicolas Dandrimont authored 3 years ago

When using ``insert into <...> select <...>``, PostgreSQL disables
parallel querying. Under some circumstances (in our large production
database), this makes updating the scheduler metrics take a (very) long
time.

Parallel querying is allowed for ``create table <...> as select <...>``,
and doing so restores the small(er) runtimes for this query (15 minutes
instead of multiple hours). To use that, we have to turn the function
into plpgsql instead of plain sql.

e051b320

Dec 08, 2021
- Clean up disabled scheduler archival task related services · a8edbdbb
  Antoine R. Dumont authored 3 years ago
  
  This is dead code now as this has long been stopped and disabled in production. Related to T3777
  Verified
  
  a8edbdbb
Dec 07, 2021

Make next_visit_queue_position an integer · 5de8ba42

Nicolas Dandrimont authored 3 years ago

In visit types with small amounts of origins having no last_update
field, we would end up overflowing Python datetimes (which only go up to
31 December 9999) pretty quickly. Making the queue position a 64-bit
integer should give us some more leeway.

The queue position now defaults to zero instead of an arbitrary point in
time. Queue offsets are still commensurate with seconds, but that's
mostly to give them some space to be splayed by the fudge factors.

5de8ba42

Dec 06, 2021

Ensure there is no duplicated origins in the insertion batches · 0a6aac58

Vincent Sellier authored 3 years ago

when a lister try to insert duplicate origins in the same batch,
the insertion is failing because the "on cascade do update" instruction
cannot manage duplicates in the same transaction

Related to T3769

Verified

0a6aac58

Nov 22, 2021

Fix CardinalityViolation in grab_next_visits on duplicate origins · 2abb3936

vlorentz authored 3 years ago

grab_next_visits grabs from `listed_origins`, whose primary key is
`(lister_id, url, visit_type)` and uses it to upsert in origin_visit_stats,
whose primary key is `(url, visit_type)`.
This causes the error `ON CONFLICT DO UPDATE command cannot affect row a
second time` when the same (origin, type) pair is grabbed twice.

This commit deduplicates the (origin, type) pairs before upserting.

2abb3936

Oct 29, 2021
- recurrent visits: use policy weights instead of ratios · 00ff02ea
  Nicolas Dandrimont authored 3 years ago
  
  The ratios weren't checked for normalization; using relative weights explicitly ensures that the settings won't be misinterpreted.
  00ff02ea
- Improve docs rendering for recurrent visits scheduler · 7f434c3f
  Nicolas Dandrimont authored 3 years ago
  
  7f434c3f
Oct 28, 2021

Add a new cli endpoint to schedule recurrent visits in Celery · 50d7fd7f

Nicolas Dandrimont authored 3 years ago and

Antoine R. Dumont committed 3 years ago


For each known visit type, we run a loop which:
 - monitors the size of the relevant celery queue
 - schedules more visits of the relevant type once the number of
 available slots goes over a given threshold (currently set to 5% of the
 max queue size).

The scheduling of visits combines multiple scheduling policies, for now
using static ratios set in the `POLICY_RATIOS` dict. We emit a warning
if the ratio of origins fetched for each policy is skewed with respect
to the original request (allowing, for now, manual adjustement of the
ratios).

The CLI endpoint spawns one thread for each visit type, which all handle
connections to RabbitMQ and the scheduler backend separately. For now,
we handle exceptions in the visit scheduling threads by (stupidly)
respawning the relevant thread directly. We should probably improve this
to give up after a specific number of tries.

Co-authored-by: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>

Verified

50d7fd7f

Oct 27, 2021

grab_next_visits: avoid time interval calculations in PostgreSQL · 0c7ef27b

Nicolas Dandrimont authored 3 years ago

When the database is in a non-UTC timezone with DST, and a `timestamptz
- interval` calculation crosses a DST change, the result of the
calculation can be one hour off from the expected value:

PostgreSQL will vary the timestamp by the amount of days in the
interval, and will keep the same (local) time, which will be offset by
an hour because of the DST change.

Doing the datetime +- timedelta calculations in Python instead of
PostgreSQL avoids this caveat altogether.

0c7ef27b

Oct 22, 2021
- Restrict the click version to avoid conflict version with celery's · ecc0e280
  Antoine R. Dumont authored 3 years ago
  
  Otherwise, in some edge case, like run in docker, the install fails on conflict. Related to P1205#8092
  Verified
  
  ecc0e280
Oct 20, 2021
- Add docstring to runner and listener modules · 243a69fc
  Antoine R. Dumont authored 3 years ago
  
  Related to T3667
  Verified
  
  243a69fc