Skip to content
Snippets Groups Projects
  1. Feb 08, 2022
    • David Douard's avatar
      Prefix task types used in tests with 'test-' · c46ffadf
      David Douard authored
      so that tests do not depend on a lucky guess on what the scheduler db
      state actually is. DB initialization scripts do create task types for
      git, hg and svn (used in tests) but these tests depends on the fact the
      db fixture has been called already once before, so tables are
      truncated (especially the task and task_type ones).
      
      For example running a single test involved in task-type creation was
      failing (eg. 'pytest swh -k test_create_task_type_idempotence').
      
      This commit does make tests not collide with any existing task or task
      type initialization scripts may create.
      
      Note that this also means that there is actually no test dealing with
      the scheduler db state after initialization, which is not grat and
      should be addressed.
      c46ffadf
  2. Feb 07, 2022
  3. Jan 21, 2022
  4. Jan 12, 2022
    • Antoine R. Dumont's avatar
      sql: Clean up task/task_run data model · b5477ea2
      Antoine R. Dumont authored
      This archives current task and task_run tables, creating new ones filtering only
      necessary tasks (last 2 months' oneshot tasks plus some recurring tasks; lister,
      indexer, ...). Those filtered tasks are the ones scheduled by the runner and runner
      priority services.
      
      This archiving will allow those services to be faster (corresponding query execution
      time will outputs results faster without the archived data).
      
      Related to T3837
      Verified
      b5477ea2
  5. Jan 05, 2022
  6. Dec 16, 2021
  7. Dec 09, 2021
    • Nicolas Dandrimont's avatar
      Use a temporary table to update scheduler metrics · e051b320
      Nicolas Dandrimont authored
      When using ``insert into <...> select <...>``, PostgreSQL disables
      parallel querying. Under some circumstances (in our large production
      database), this makes updating the scheduler metrics take a (very) long
      time.
      
      Parallel querying is allowed for ``create table <...> as select <...>``,
      and doing so restores the small(er) runtimes for this query (15 minutes
      instead of multiple hours). To use that, we have to turn the function
      into plpgsql instead of plain sql.
      e051b320
  8. Dec 08, 2021
  9. Dec 07, 2021
    • Nicolas Dandrimont's avatar
      Make next_visit_queue_position an integer · 5de8ba42
      Nicolas Dandrimont authored
      In visit types with small amounts of origins having no last_update
      field, we would end up overflowing Python datetimes (which only go up to
      31 December 9999) pretty quickly. Making the queue position a 64-bit
      integer should give us some more leeway.
      
      The queue position now defaults to zero instead of an arbitrary point in
      time. Queue offsets are still commensurate with seconds, but that's
      mostly to give them some space to be splayed by the fudge factors.
  10. Dec 06, 2021
  11. Nov 22, 2021
  12. Oct 29, 2021
  13. Oct 28, 2021
    • Nicolas Dandrimont's avatar
      Add a new cli endpoint to schedule recurrent visits in Celery · 50d7fd7f
      Nicolas Dandrimont authored and Antoine R. Dumont's avatar Antoine R. Dumont committed
      
      For each known visit type, we run a loop which:
       - monitors the size of the relevant celery queue
       - schedules more visits of the relevant type once the number of
       available slots goes over a given threshold (currently set to 5% of the
       max queue size).
      
      The scheduling of visits combines multiple scheduling policies, for now
      using static ratios set in the `POLICY_RATIOS` dict. We emit a warning
      if the ratio of origins fetched for each policy is skewed with respect
      to the original request (allowing, for now, manual adjustement of the
      ratios).
      
      The CLI endpoint spawns one thread for each visit type, which all handle
      connections to RabbitMQ and the scheduler backend separately. For now,
      we handle exceptions in the visit scheduling threads by (stupidly)
      respawning the relevant thread directly. We should probably improve this
      to give up after a specific number of tries.
      
      Co-authored-by: default avatarAntoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
  14. Oct 27, 2021
    • Nicolas Dandrimont's avatar
      grab_next_visits: avoid time interval calculations in PostgreSQL · 0c7ef27b
      Nicolas Dandrimont authored
      When the database is in a non-UTC timezone with DST, and a `timestamptz
      - interval` calculation crosses a DST change, the result of the
      calculation can be one hour off from the expected value:
      
      PostgreSQL will vary the timestamp by the amount of days in the
      interval, and will keep the same (local) time, which will be offset by
      an hour because of the DST change.
      
      Doing the datetime +- timedelta calculations in Python instead of
      PostgreSQL avoids this caveat altogether.
      0c7ef27b
  15. Oct 22, 2021
  16. Oct 20, 2021
  17. Oct 18, 2021
  18. Oct 15, 2021
    • Antoine R. Dumont's avatar
      Return 0 slot if no more slots available in the queues · 3aed7bf1
      Antoine R. Dumont authored
      This scenario happens with the loader oneshot for example. This loader deals with more
      than 1 type of origins to ingest in the same queue. So the computation of that function
      returned negative value [1]. Which is ultimately not possible to execute in sql [1].
      
      This commits fixes that behavior. This also explicits that the function must return
      positive values in its docstring.
      
      [1]
      ```
      ...
      psycopg2.errors.InvalidRowCountInLimitClause: LIMIT must not be negative
      ```
  19. Sep 02, 2021
  20. Aug 27, 2021
  21. Aug 26, 2021
  22. Aug 18, 2021
  23. Aug 03, 2021
  24. Jul 30, 2021
  25. Jul 23, 2021
    • Nicolas Dandrimont's avatar
      Only record last_visited and last_successful in origin_visit_stats · 87e66faa
      Nicolas Dandrimont authored
      After using this schema for a while, all queries can be implemented in
      terms of these two timestamps, instead of the four original
      last_eventful, last_uneventful, last_failed and last_notfound
      timestamps.
      
      This ends up simplifying the logic within the journal client, as well as
      that of the grab_next_visits query builder.
      
      To make this change work, we also stop considering out of order messages
      altogether in journal_client. This welcome simplification is an accuracy
      tradeoff that is explained in the updated documentation of the journal
      client:
      
      .. [1] Ignoring out of order messages makes the initialization of the
            origin_visit_status table (from a full journal) less deterministic: only the
            `last_visit`, `last_visit_state` and `last_successful` fields are guaranteed
            to be exact, the `next_position_offset` field is a best effort estimate
            (which should converge once the client has run for a while on in-order
            messages).
      87e66faa
    • Antoine R. Dumont's avatar
      test_journal_client: Unify test assertion like the rest · 3ca0d659
      Antoine R. Dumont authored
      Related to D5917
      Verified
      3ca0d659
    • Antoine R. Dumont's avatar
      test: Refactor assert_visit_stats_ok to ignore_fields · 8cf2238e
      Antoine R. Dumont authored
      This simplifies and unifies properly the utility test function to compare visit stats.
      Verified
      8cf2238e
  26. Jul 22, 2021
  27. Jul 06, 2021
    • Antoine R. Dumont's avatar
      journal_client: Compute next position for origin visit · 8c4ae9f1
      Antoine R. Dumont authored
      For origin without any last_update information [1], the journal client is now also in
      charge of moving their next position in the queue for rescheduling. Depending on their
      status, the next position offset and next_visit_queue_position are updated after each
      visit completes:
      
      - if the visit has failed, increase the next visit target by the minimal visit
        interval (to take into account transient loading issues)
      - if the visit is successful, and records some changes, decrease the visit interval
        index by 2 (visit the origin *way* more often).
      - if the visit is successful, and records no changes, increase the visit interval index
        by 1 (visit the origin less often).
      
      We then set the next visit target to its current value + the new visit interval
      multiplied by a random fudge factor (picked in the -/+ 10% range).
      
      The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins
      e.g. when a number of origins from a single hoster are processed at once.
      
      Note that the computations happen for all origins for simplicity and code maintenance
      but it will only be used by a new soon-to-be scheduling policy.
      
      [1] Lister cannot provide it for some reason.
      Verified
      8c4ae9f1
  28. Jul 01, 2021
Loading