- Feb 08, 2022
-
-
David Douard authored
so that tests do not depend on a lucky guess on what the scheduler db state actually is. DB initialization scripts do create task types for git, hg and svn (used in tests) but these tests depends on the fact the db fixture has been called already once before, so tables are truncated (especially the task and task_type ones). For example running a single test involved in task-type creation was failing (eg. 'pytest swh -k test_create_task_type_idempotence'). This commit does make tests not collide with any existing task or task type initialization scripts may create. Note that this also means that there is actually no test dealing with the scheduler db state after initialization, which is not grat and should be addressed.
-
- Feb 07, 2022
-
-
Antoine R. Dumont authored
Related to T3916
-
- Jan 21, 2022
-
-
vlorentz authored
-
- Jan 12, 2022
-
-
Antoine R. Dumont authored
This archives current task and task_run tables, creating new ones filtering only necessary tasks (last 2 months' oneshot tasks plus some recurring tasks; lister, indexer, ...). Those filtered tasks are the ones scheduled by the runner and runner priority services. This archiving will allow those services to be faster (corresponding query execution time will outputs results faster without the archived data). Related to T3837
-
- Jan 05, 2022
-
-
Vincent Sellier authored
Related to T3827
-
- Dec 16, 2021
-
-
Antoine R. Dumont authored
This also drops spurious copyright headers to those files if present. Related to T3812
-
- Dec 09, 2021
-
-
Nicolas Dandrimont authored
When using ``insert into <...> select <...>``, PostgreSQL disables parallel querying. Under some circumstances (in our large production database), this makes updating the scheduler metrics take a (very) long time. Parallel querying is allowed for ``create table <...> as select <...>``, and doing so restores the small(er) runtimes for this query (15 minutes instead of multiple hours). To use that, we have to turn the function into plpgsql instead of plain sql.
-
- Dec 08, 2021
-
-
Antoine R. Dumont authored
This is dead code now as this has long been stopped and disabled in production. Related to T3777
-
- Dec 07, 2021
-
-
Nicolas Dandrimont authored
In visit types with small amounts of origins having no last_update field, we would end up overflowing Python datetimes (which only go up to 31 December 9999) pretty quickly. Making the queue position a 64-bit integer should give us some more leeway. The queue position now defaults to zero instead of an arbitrary point in time. Queue offsets are still commensurate with seconds, but that's mostly to give them some space to be splayed by the fudge factors.
-
- Dec 06, 2021
-
-
Vincent Sellier authored
when a lister try to insert duplicate origins in the same batch, the insertion is failing because the "on cascade do update" instruction cannot manage duplicates in the same transaction Related to T3769
-
- Nov 22, 2021
-
-
vlorentz authored
grab_next_visits grabs from `listed_origins`, whose primary key is `(lister_id, url, visit_type)` and uses it to upsert in origin_visit_stats, whose primary key is `(url, visit_type)`. This causes the error `ON CONFLICT DO UPDATE command cannot affect row a second time` when the same (origin, type) pair is grabbed twice. This commit deduplicates the (origin, type) pairs before upserting.
-
- Oct 29, 2021
-
-
Nicolas Dandrimont authored
The ratios weren't checked for normalization; using relative weights explicitly ensures that the settings won't be misinterpreted.
-
Nicolas Dandrimont authored
-
- Oct 28, 2021
-
-
For each known visit type, we run a loop which: - monitors the size of the relevant celery queue - schedules more visits of the relevant type once the number of available slots goes over a given threshold (currently set to 5% of the max queue size). The scheduling of visits combines multiple scheduling policies, for now using static ratios set in the `POLICY_RATIOS` dict. We emit a warning if the ratio of origins fetched for each policy is skewed with respect to the original request (allowing, for now, manual adjustement of the ratios). The CLI endpoint spawns one thread for each visit type, which all handle connections to RabbitMQ and the scheduler backend separately. For now, we handle exceptions in the visit scheduling threads by (stupidly) respawning the relevant thread directly. We should probably improve this to give up after a specific number of tries. Co-authored-by:
Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
-
- Oct 27, 2021
-
-
Nicolas Dandrimont authored
When the database is in a non-UTC timezone with DST, and a `timestamptz - interval` calculation crosses a DST change, the result of the calculation can be one hour off from the expected value: PostgreSQL will vary the timestamp by the amount of days in the interval, and will keep the same (local) time, which will be offset by an hour because of the DST change. Doing the datetime +- timedelta calculations in Python instead of PostgreSQL avoids this caveat altogether.
-
- Oct 22, 2021
-
-
Antoine R. Dumont authored
Otherwise, in some edge case, like run in docker, the install fails on conflict. Related to P1205#8092
-
- Oct 20, 2021
-
-
Antoine R. Dumont authored
Related to T3667
-
Antoine R. Dumont authored
It's been deprecated for enough time. Related to T3667
-
Antoine R. Dumont authored
-
- Oct 18, 2021
-
-
Antoine R. Dumont authored
This actually fixes the debian build failure. Related to T3666
-
- Oct 15, 2021
-
-
Antoine R. Dumont authored
This scenario happens with the loader oneshot for example. This loader deals with more than 1 type of origins to ingest in the same queue. So the computation of that function returned negative value [1]. Which is ultimately not possible to execute in sql [1]. This commits fixes that behavior. This also explicits that the function must return positive values in its docstring. [1] ``` ... psycopg2.errors.InvalidRowCountInLimitClause: LIMIT must not be negative ```
-
- Sep 02, 2021
-
- Aug 27, 2021
-
-
Antoine R. Dumont authored
In the non optimal case, we may want to trigger specific case (not-yet enabled origins, origin from specific lister...). Related to T3350
-
- Aug 26, 2021
-
-
Nicolas Dandrimont authored
For origins that have never been visited, and for which we don't have a queue position yet, we want to visit them in the order they've been added.
-
Nicolas Dandrimont authored
The subcommand bypasses the legacy task-based mechanism to directly send new origin visits to celery
-
Nicolas Dandrimont authored
Running common operations on all git origins is pretty intense. Using table sampling gives us the opportunity to at least schedule some jobs in (decently small) time.
-
Antoine R. Dumont authored
-
Antoine R. Dumont authored
Queue positions are date and the current next_position_offset used to compute the new queue position was not bounded. This has the side-effect of making overflow error. This commit adapts the journal client computations to limit such next_position_offset to 10. This value was chosen because above that exponent the dates overflow (and we are way in the future already). Related to T3502
-
- Aug 18, 2021
-
-
vlorentz authored
We changed the task name/interface a while ago
-
- Aug 03, 2021
-
-
Antoine R. Dumont authored
This disable origins for either failed or not found attempts 3 times in a row. It's not definitive though as it's the lister's responsibility to activate back origins if they get listed again. Related to T2345
-
This maintains the number of successive visits resulting in the same status. This will help implementing disabling of too many successive failed or not_found visits for a given origin. Related to T2345
-
- Jul 30, 2021
-
-
Antoine R. Dumont authored
Related to T2345
-
Antoine R. Dumont authored
This is no longer required as it's called once. Related to T2345
-
- Jul 23, 2021
-
-
Nicolas Dandrimont authored
After using this schema for a while, all queries can be implemented in terms of these two timestamps, instead of the four original last_eventful, last_uneventful, last_failed and last_notfound timestamps. This ends up simplifying the logic within the journal client, as well as that of the grab_next_visits query builder. To make this change work, we also stop considering out of order messages altogether in journal_client. This welcome simplification is an accuracy tradeoff that is explained in the updated documentation of the journal client: .. [1] Ignoring out of order messages makes the initialization of the origin_visit_status table (from a full journal) less deterministic: only the `last_visit`, `last_visit_state` and `last_successful` fields are guaranteed to be exact, the `next_position_offset` field is a best effort estimate (which should converge once the client has run for a while on in-order messages).
-
Antoine R. Dumont authored
Related to D5917
-
Antoine R. Dumont authored
This simplifies and unifies properly the utility test function to compare visit stats.
-
- Jul 22, 2021
-
-
This is in charge of scheduling origins without last update. This also updates the global queue position so the journal client can initialize correctly the next position per origin and visit type. Related to T2345
-
Nicolas Dandrimont authored
This allows us to insert extra CTEs if a scheduling policy needs it.
-
- Jul 06, 2021
-
-
Antoine R. Dumont authored
For origin without any last_update information [1], the journal client is now also in charge of moving their next position in the queue for rescheduling. Depending on their status, the next position offset and next_visit_queue_position are updated after each visit completes: - if the visit has failed, increase the next visit target by the minimal visit interval (to take into account transient loading issues) - if the visit is successful, and records some changes, decrease the visit interval index by 2 (visit the origin *way* more often). - if the visit is successful, and records no changes, increase the visit interval index by 1 (visit the origin less often). We then set the next visit target to its current value + the new visit interval multiplied by a random fudge factor (picked in the -/+ 10% range). The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins e.g. when a number of origins from a single hoster are processed at once. Note that the computations happen for all origins for simplicity and code maintenance but it will only be used by a new soon-to-be scheduling policy. [1] Lister cannot provide it for some reason.
-
- Jul 01, 2021
-
-
Antoine R. Dumont authored
-