- Jun 03, 2022
-
-
Antoine R. Dumont authored
Attribute current_version is already set and directly used by swh db [version|init|upgrade] clis. Related to T4305
-
- May 31, 2022
-
-
David Douard authored
instead of (soon-to-be-deprecated) swh-core's postgresql_fact one.
-
- May 12, 2022
-
-
Antoine Lambert authored
Add a new enabled_only parameter set to True by default in get_listed_origins scheduler method. It enables to filter out by default disabled listed origins when requesting the result of a listing and avoid possible errors in listers implementation.
-
- May 09, 2022
-
-
Pratyush authored
-
- May 06, 2022
-
-
Antoine Lambert authored
Fix "more than one target found for cross-reference 'Origin'" sphinx warning.
-
- Apr 28, 2022
-
-
Benoit Chauvet authored
-
- Apr 27, 2022
-
- Apr 26, 2022
-
-
vlorentz authored
-
vlorentz authored
This will allow loaders to guess the forge type, and use the right API to fetch extrinsic metadata for the origin from the forge.
-
David Douard authored
-
- Apr 21, 2022
-
-
vlorentz authored
It feels off as an object method; and I am going to make it use joins in a future commit, so it makes more sense this way.
-
vlorentz authored
-
Antoine Lambert authored
That hook can be frustrating as it can discard a long commit message if it finds a typo in it so better removing it.
-
- Apr 20, 2022
-
-
David Douard authored
Add support for a configuration option "scheduling_policy" in the config file loaded by the 'swh scheduler schedule-recurrent' command. This config entry allows to specify the scheduling policies used by the schedule-recurrent tool, instead of having them hardcoded in the source code. A visit type policy config entry should have at least a 'weight' value for each policy. Default values are unchanged. Eg.: scheduling_policy: git: - policy: already_visited_order_by_lag weight: 55 tablesample: 0.5 - policy: never_visited_oldest_update_first weight: 45 tablesample: 0.5 Note: there may not be configuration entries for all visit types, but if a visit type policy is configured, the config entry should be complete (in other words, the merging of the configuration with the default values is only done at first config level).
-
- Apr 08, 2022
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Related to T3922
-
Antoine Lambert authored
black is considered stable since release 22.1.0 and the version we are currently using is quite outdated and not compatible with click 8.1.0, so it is time to bump it to its latest stable release. Please note that E501 pycodestyle warning related to line length is replaced by B950 one from flake8-bugbear as recommended by black. https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length Related to T3922
-
- Apr 06, 2022
-
-
Antoine Lambert authored
pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for pytest >= 7 so we can now drop the pytest pinning.
-
- Mar 22, 2022
-
-
Antoine Lambert authored
Due to test modules being copied in subdirectories of the build directory by setuptools, it makes pytest fail by raising ImportPathMismatchError exceptions when invoked from root directory of the module. So ignore the build folder to discover tests.
-
- Feb 24, 2022
-
-
David Douard authored
- add the `get_datastore` function in `swh.scheduler` - add the `get_current_version` method in `SchedulerBackend`, - remove dbversion management from sql init script - update tests accordingly
-
- Feb 10, 2022
-
-
Antoine Lambert authored
To install the new hook: $ pre-commit install -t commit-msg
-
- Feb 09, 2022
-
-
Antoine Lambert authored
Latest versions of celery and flask now support click >= 8.0 so we can remove the version pin.
-
- Feb 08, 2022
-
-
David Douard authored
so that tests do not depend on a lucky guess on what the scheduler db state actually is. DB initialization scripts do create task types for git, hg and svn (used in tests) but these tests depends on the fact the db fixture has been called already once before, so tables are truncated (especially the task and task_type ones). For example running a single test involved in task-type creation was failing (eg. 'pytest swh -k test_create_task_type_idempotence'). This commit does make tests not collide with any existing task or task type initialization scripts may create. Note that this also means that there is actually no test dealing with the scheduler db state after initialization, which is not grat and should be addressed.
-
- Feb 07, 2022
-
-
Antoine R. Dumont authored
Related to T3916
-
- Jan 21, 2022
-
-
vlorentz authored
-
- Jan 12, 2022
-
-
Antoine R. Dumont authored
This archives current task and task_run tables, creating new ones filtering only necessary tasks (last 2 months' oneshot tasks plus some recurring tasks; lister, indexer, ...). Those filtered tasks are the ones scheduled by the runner and runner priority services. This archiving will allow those services to be faster (corresponding query execution time will outputs results faster without the archived data). Related to T3837
-
- Jan 05, 2022
-
-
Vincent Sellier authored
Related to T3827
-
- Dec 16, 2021
-
-
Antoine R. Dumont authored
This also drops spurious copyright headers to those files if present. Related to T3812
-
- Dec 09, 2021
-
-
Nicolas Dandrimont authored
When using ``insert into <...> select <...>``, PostgreSQL disables parallel querying. Under some circumstances (in our large production database), this makes updating the scheduler metrics take a (very) long time. Parallel querying is allowed for ``create table <...> as select <...>``, and doing so restores the small(er) runtimes for this query (15 minutes instead of multiple hours). To use that, we have to turn the function into plpgsql instead of plain sql.
-
- Dec 08, 2021
-
-
Antoine R. Dumont authored
This is dead code now as this has long been stopped and disabled in production. Related to T3777
-
- Dec 07, 2021
-
-
Nicolas Dandrimont authored
In visit types with small amounts of origins having no last_update field, we would end up overflowing Python datetimes (which only go up to 31 December 9999) pretty quickly. Making the queue position a 64-bit integer should give us some more leeway. The queue position now defaults to zero instead of an arbitrary point in time. Queue offsets are still commensurate with seconds, but that's mostly to give them some space to be splayed by the fudge factors.
-
- Dec 06, 2021
-
-
Vincent Sellier authored
when a lister try to insert duplicate origins in the same batch, the insertion is failing because the "on cascade do update" instruction cannot manage duplicates in the same transaction Related to T3769
-
- Nov 22, 2021
-
-
vlorentz authored
grab_next_visits grabs from `listed_origins`, whose primary key is `(lister_id, url, visit_type)` and uses it to upsert in origin_visit_stats, whose primary key is `(url, visit_type)`. This causes the error `ON CONFLICT DO UPDATE command cannot affect row a second time` when the same (origin, type) pair is grabbed twice. This commit deduplicates the (origin, type) pairs before upserting.
-
- Oct 29, 2021
-
-
Nicolas Dandrimont authored
The ratios weren't checked for normalization; using relative weights explicitly ensures that the settings won't be misinterpreted.
-
Nicolas Dandrimont authored
-
- Oct 28, 2021
-
-
For each known visit type, we run a loop which: - monitors the size of the relevant celery queue - schedules more visits of the relevant type once the number of available slots goes over a given threshold (currently set to 5% of the max queue size). The scheduling of visits combines multiple scheduling policies, for now using static ratios set in the `POLICY_RATIOS` dict. We emit a warning if the ratio of origins fetched for each policy is skewed with respect to the original request (allowing, for now, manual adjustement of the ratios). The CLI endpoint spawns one thread for each visit type, which all handle connections to RabbitMQ and the scheduler backend separately. For now, we handle exceptions in the visit scheduling threads by (stupidly) respawning the relevant thread directly. We should probably improve this to give up after a specific number of tries. Co-authored-by:
Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
-
- Oct 27, 2021
-
-
Nicolas Dandrimont authored
When the database is in a non-UTC timezone with DST, and a `timestamptz - interval` calculation crosses a DST change, the result of the calculation can be one hour off from the expected value: PostgreSQL will vary the timestamp by the amount of days in the interval, and will keep the same (local) time, which will be offset by an hour because of the DST change. Doing the datetime +- timedelta calculations in Python instead of PostgreSQL avoids this caveat altogether.
-
- Oct 22, 2021
-
-
Antoine R. Dumont authored
Otherwise, in some edge case, like run in docker, the install fails on conflict. Related to P1205#8092
-
- Oct 20, 2021
-
-
Antoine R. Dumont authored
Related to T3667
-