Commits · debian/upstream/0.11.0 · Platform / Development / swh-scheduler

Apr 14, 2021
- New upstream version 0.11.0 · 77ea5902
  Jenkins for Software Heritage authored 3 years ago
  
  debian/upstream/0.11.0
  
  77ea5902
Apr 13, 2021

backend: Open endpoints to peek/grab tasks with any priority · 3e2ae3d4

The priority notion becomes a blur. Any tasks with a non null priority is considered for
reading or grabbing.

In a future commit, this should allow to make the runner evolve to reroute tasks with
priority to other queues.

Related to T3084

3e2ae3d4

Feb 11, 2021

Make origin_visit_stats_get return results from all pages · ecab745a

Nicolas Dandrimont authored 3 years ago

psycopg2.extras.execute_values executes queries in batches of 100 by
default. At the end of execute_values, only the last batch of results is
available in the cursor; To fetch all results, one needs to set
fetch=True instead of using the cursor.

ecab745a

journal client: Filter out status messages without type · 86ada443
Nicolas Dandrimont authored 3 years ago
```
This allows us to support reading the journal from the beginning,
ignoring messages with the old schema.
```
86ada443

Simplify max_date() · cdb1775f

Nicolas Dandrimont authored 3 years ago

The built-in `max` function can take an iterable directly, no need to
reimplement it.

cdb1775f

Feb 09, 2021

journal_client: Fix date computations for (un)eventful visits · cf32e376

Vincent Sellier authored 3 years ago

Fix a wrong computation when several messages (>=3) for the same
snapshot are received in the wrong order
For example, before the fix, the following occurs:
```
| date | snapshot |     | last_ev  | last_unev | Snap |
| ---- | -------- | --- | -------- | --------- | ---- |
| 2022 | S2       |     | 2022     |           | S2   |
| 2020 | S2       |     | 2020     | 2022      | S2   |
| 2021 | S2       |     | **2021** | **2020**  | S2   |
```

as it should be:
```
| date | snapshot |     | last_ev  | last_unev | Snap |
| ---- | -------- | --- | -------- | --------- | ---- |
| 2022 | S2       |     | 2022     |           | S2   |
| 2020 | S2       |     | 2020     | 2022      | S2   |
| 2021 | S2       |     | **2020** | **2022**  | S2   |
```

Related to T3000

cf32e376

Feb 05, 2021
- journal_client: Deal with failed status message · aa507ac5
  Antoine R. Dumont authored 3 years ago
```
As loader will start to create failed status message, deal with them if any.

Related to T3030
```
  aa507ac5
Feb 03, 2021

New upstream version 0.10.0 · 2cf46e3b
Jenkins for Software Heritage authored 3 years ago

debian/upstream/0.10.0

2cf46e3b

celery: acknowledge tasks as soon as they're received · 14feab95

Nicolas Dandrimont authored 3 years ago

With late acknowledgements, RabbitMQ will re-send tasks to clients even
if they can't ever complete the task (e.g. when the task gets killed
because the machine is out of memory).

This problem only increases over time, leading to complete starvation of
the ingestion system.

Now that we have multiple mechanisms to issue retries of tasks, we can
use early acknowledgements for tasks instead, which should mitigate the
ongoing starvation, at the expense of having to retry tasks externally.

14feab95

Feb 01, 2021
- Simulator: allow to export results in a csv file · aaffff26
  David Douard authored 3 years ago
  
  aaffff26
- Add minimal tests for the SimulationReport.format() method · 9fce3f6f
  David Douard authored 3 years ago
  
  9fce3f6f
Jan 29, 2021
- Make plottings optional in simulator cli output · aaf7dd6f
  David Douard authored 3 years ago
  
  aaf7dd6f
Jan 26, 2021
- simulator: stop validating the scheduling policy in the CLI · cf0583b0
  Nicolas Dandrimont authored 3 years ago and vlorentz committed 3 years ago
```
We already do that in the scheduler backend function
```
  cf0583b0
- Run simulator tests on all known scheduling policies · ebb5847e
  Nicolas Dandrimont authored 3 years ago and vlorentz committed 3 years ago
  
  ebb5847e
- simulator: record visit metrics alongside scheduler metrics · 1f77521d
  Nicolas Dandrimont authored 3 years ago and vlorentz committed 3 years ago
```
This allows us to check the behavior of the archive over time in terms
of number of visits.
```
  1f77521d
- simulator: stop using the database as a cache for origin data · 88983944
  Nicolas Dandrimont authored 3 years ago and vlorentz committed 3 years ago
```
This was a significant bottleneck of the simulator. To work around this,
we:

 - Generate snapshot ids consistently in the OriginModel
 - Cache the origin data locally in the simulator, to compute the
   eventfulness of visits
 - Cache the last visit time for all origins to compute the estimated
   run time of visit tasks.
```
  88983944
- grab_next_visits: don't re-schedule visits too fast · c92ead58
  Nicolas Dandrimont authored 3 years ago and vlorentz committed 3 years ago
```
The earlier implementation would just schedule new visits for origins
forever, regardless of whether they were already scheduled or not.
```
  c92ead58
- Allow overriding the timestamp of grab_next_visits · 2b39cbca
  Nicolas Dandrimont authored 3 years ago and vlorentz committed 3 years ago
```
This makes the simulator behavior more consistent with reality.
```
  2b39cbca
- Construct grab_next_visits query arguments incrementally · 7ffbdd1b
  Nicolas Dandrimont authored 3 years ago and vlorentz committed 3 years ago
  
  7ffbdd1b
- simulator: add simple lister simulation · ea068b46
  vlorentz authored 3 years ago
  
  ea068b46
Jan 25, 2021
- New upstream version 0.9.2 · db8fa8e8
  Jenkins for Software Heritage authored 3 years ago
  
  debian/upstream/0.9.2
  
  db8fa8e8
- Factor out ListedOrigin generation to use the OriginModel · 7af98e2b
  vlorentz authored 3 years ago
```
This generates consistent last_update values according to the model and
simulated time.
```
  7af98e2b
- model/ListedOrigin: Set extra_loader_arguments type to Dict[str, Any] · 2906b4e8
  Antoine Lambert authored 3 years ago
```
Some loaders, for instance the debian one, can have non string arguments
so change the extra_loader_arguments type of the ListedOrigin model to
something more generic.

Related to T2979
```
  v0.9.2
  
  2906b4e8
Jan 23, 2021

Solve uneventful/eventful with unordered messages with snapshots · 3d13cda4

Vincent Sellier authored 3 years ago

Fix the case:
m1: date2/snapshot1
m2: date1/snaptshot1
which results to:
last_eventful = date2
last_uneventful = date2

The upsert was always keeping the most recent date when the
eventful/uneventful dates were switched

Related to T2978

3d13cda4

Do not consider duplicated messages as uneventful event · d528998d

Vincent Sellier authored 3 years ago

Avoid to copy the eventful date to the uneventful date when a
duplicated message (same date/same snapshot) is received,

related to T2978

d528998d

Jan 22, 2021
- Add a --num-origins option to the fill-test-data cli command · 86b25554
  David Douard authored 4 years ago
  
  86b25554
- Simulation: log at info level recorded metrics · abb513ca
  David Douard authored 3 years ago
```
this allows to follows what the simulation is doing.
```
  abb513ca
Jan 21, 2021
- New upstream version 0.9.1 · 70532dcd
  Jenkins for Software Heritage authored 3 years ago
  
  debian/upstream/0.9.1
  
  70532dcd
- Solve uneventful/eventful with unordered messages with snapshots · 82b7a8a4
  Vincent Sellier authored 3 years ago
```
Fix the case:
m1: date2/snapshot1
m2: date1/snaptshot1
which results to:
last_eventful = date2
last_uneventful = date2

The upsert was always keeping the most recent date when the
eventful/uneventful dates were switched

Related to T2978
```
  v0.9.1
  
  82b7a8a4
- Do not consider duplicated messages as uneventful event · 25d036ef
  Vincent Sellier authored 3 years ago
```
Avoid to copy the eventful date to the uneventful date when a
duplicated message (same date/same snapshot) is received,

related to T2978
```
  25d036ef
- Make PaginatedListedOriginList a concretization of PagedResult · b93aa5be
  vlorentz authored 3 years ago
```
1. consistent with swh-storage and swh-indexer-storage
2. we can use swh.core.api.classes.stream_results on scheduler.get_listed_origins.
```
  b93aa5be
- Reorganize grab_next_visits tests to better check sorting behavior · 03460207
  Nicolas Dandrimont authored 4 years ago and vlorentz committed 3 years ago
```
 - factor out test setup and results checking
 - properly exercize corner cases of the oldest_scheduled_first policy
```
  03460207
- Add scheduling policy for already visited origins with known last update · 2f479367
  Nicolas Dandrimont authored 4 years ago and vlorentz committed 3 years ago
```
This policy schedules origins by decreasing order of "visit lag" (that
is, origins with the most lag are scheduled first).
```
  2f479367
- Add scheduling policy for never visited origins · acad712a
  Nicolas Dandrimont authored 4 years ago and vlorentz committed 3 years ago
```
This policy orders never visited origins by increasing date of last
update (scheduling the "oldest" never visited origins first).
```
  acad712a
- Run Black. · af378989
  vlorentz authored 3 years ago
```
It wasn't ran on d464b4cc.
```
  af378989
- New upstream version 0.9.0 · 2fd414f6
  Jenkins for Software Heritage authored 3 years ago
  
  debian/upstream/0.9.0
  
  2fd414f6
- Make the grab_next_visits sql query modular · b641ac83
  Nicolas Dandrimont authored 4 years ago and vlorentz committed 4 years ago
```
This will allow us to easily plug new scheduling policies in that
function.
```
  v0.9.0
  
  b641ac83
- journal_client: Read visit_stats entries by batch out of the loop · 9fb0dd6c
  Antoine R. Dumont authored 4 years ago
```
Related to T2967
```
  9fb0dd6c
- scheduler: Make origin_visit_stats_get read multiple entries · d464b4cc
  Antoine R. Dumont authored 4 years ago
```
Related to T2967
```
  d464b4cc
Jan 20, 2021

Simplify journal client tests · ffe2aed2

David Douard authored 4 years ago

- sort visits by default (there is a test dedicated to dealing with unsorted
  messagaes from the journal),
- remove "intermediate checks" in several tests: these do not help much
  but make the code more difficult to read and maintain,
- rename VISIT_STATUSES1 as VISIT_STATUSES_1 to make less prone to
  being confused with VISIT_STATUSES (which also exists).

ffe2aed2