Skip to content
Snippets Groups Projects
  1. Jun 03, 2022
  2. May 31, 2022
  3. May 12, 2022
  4. May 09, 2022
  5. May 06, 2022
  6. Apr 28, 2022
  7. Apr 27, 2022
  8. Apr 26, 2022
  9. Apr 21, 2022
  10. Apr 20, 2022
    • David Douard's avatar
      Make scheduling policy used in schedule_recurrent configurable · a76bb02f
      David Douard authored
      Add support for a configuration option "scheduling_policy" in the config
      file loaded by the 'swh scheduler schedule-recurrent' command. This
      config entry allows to specify the scheduling policies used by the
      schedule-recurrent tool, instead of having them hardcoded in the source
      code.
      
      A visit type policy config entry should have at least a 'weight' value
      for each policy.
      
      Default values are unchanged.
      
      Eg.:
      
        scheduling_policy:
          git:
            - policy: already_visited_order_by_lag
              weight: 55
              tablesample: 0.5
            - policy: never_visited_oldest_update_first
              weight: 45
              tablesample: 0.5
      
      Note: there may not be configuration entries for all visit types, but if
            a visit type policy is configured, the config entry should be complete
            (in other words, the merging of the configuration with the default
            values is only done at first config level).
      a76bb02f
  11. Apr 08, 2022
  12. Apr 06, 2022
  13. Mar 22, 2022
    • Antoine Lambert's avatar
      pytest: Exclude build directory for tests discovery · 78f5579b
      Antoine Lambert authored
      Due to test modules being copied in subdirectories of the
      build directory by setuptools, it makes pytest fail by raising
      ImportPathMismatchError exceptions when invoked from root
      directory of the module.
      
      So ignore the build folder to discover tests.
      78f5579b
  14. Feb 24, 2022
    • David Douard's avatar
      Adapt to swh.core 2.0.0 · 5cc62be1
      David Douard authored
      - add the `get_datastore` function in `swh.scheduler`
      - add the `get_current_version` method in `SchedulerBackend`,
      - remove dbversion management from sql init script
      - update tests accordingly
      v1.0.0
      5cc62be1
  15. Feb 10, 2022
  16. Feb 09, 2022
  17. Feb 08, 2022
    • David Douard's avatar
      Prefix task types used in tests with 'test-' · c46ffadf
      David Douard authored
      so that tests do not depend on a lucky guess on what the scheduler db
      state actually is. DB initialization scripts do create task types for
      git, hg and svn (used in tests) but these tests depends on the fact the
      db fixture has been called already once before, so tables are
      truncated (especially the task and task_type ones).
      
      For example running a single test involved in task-type creation was
      failing (eg. 'pytest swh -k test_create_task_type_idempotence').
      
      This commit does make tests not collide with any existing task or task
      type initialization scripts may create.
      
      Note that this also means that there is actually no test dealing with
      the scheduler db state after initialization, which is not grat and
      should be addressed.
      c46ffadf
  18. Feb 07, 2022
  19. Jan 21, 2022
  20. Jan 12, 2022
    • Antoine R. Dumont's avatar
      sql: Clean up task/task_run data model · b5477ea2
      Antoine R. Dumont authored
      This archives current task and task_run tables, creating new ones filtering only
      necessary tasks (last 2 months' oneshot tasks plus some recurring tasks; lister,
      indexer, ...). Those filtered tasks are the ones scheduled by the runner and runner
      priority services.
      
      This archiving will allow those services to be faster (corresponding query execution
      time will outputs results faster without the archived data).
      
      Related to T3837
      Verified
      b5477ea2
  21. Jan 05, 2022
  22. Dec 16, 2021
  23. Dec 09, 2021
    • Nicolas Dandrimont's avatar
      Use a temporary table to update scheduler metrics · e051b320
      Nicolas Dandrimont authored
      When using ``insert into <...> select <...>``, PostgreSQL disables
      parallel querying. Under some circumstances (in our large production
      database), this makes updating the scheduler metrics take a (very) long
      time.
      
      Parallel querying is allowed for ``create table <...> as select <...>``,
      and doing so restores the small(er) runtimes for this query (15 minutes
      instead of multiple hours). To use that, we have to turn the function
      into plpgsql instead of plain sql.
      e051b320
  24. Dec 08, 2021
  25. Dec 07, 2021
    • Nicolas Dandrimont's avatar
      Make next_visit_queue_position an integer · 5de8ba42
      Nicolas Dandrimont authored
      In visit types with small amounts of origins having no last_update
      field, we would end up overflowing Python datetimes (which only go up to
      31 December 9999) pretty quickly. Making the queue position a 64-bit
      integer should give us some more leeway.
      
      The queue position now defaults to zero instead of an arbitrary point in
      time. Queue offsets are still commensurate with seconds, but that's
      mostly to give them some space to be splayed by the fudge factors.
      v0.22.0
      5de8ba42
  26. Dec 06, 2021
  27. Nov 22, 2021
    • vlorentz's avatar
      Fix CardinalityViolation in grab_next_visits on duplicate origins · 2abb3936
      vlorentz authored
      grab_next_visits grabs from `listed_origins`, whose primary key is
      `(lister_id, url, visit_type)` and uses it to upsert in origin_visit_stats,
      whose primary key is `(url, visit_type)`.
      This causes the error `ON CONFLICT DO UPDATE command cannot affect row a
      second time` when the same (origin, type) pair is grabbed twice.
      
      This commit deduplicates the (origin, type) pairs before upserting.
      v0.20.0
      2abb3936
  28. Oct 29, 2021
  29. Oct 28, 2021
    • Nicolas Dandrimont's avatar
      Add a new cli endpoint to schedule recurrent visits in Celery · 50d7fd7f
      Nicolas Dandrimont authored and Antoine R. Dumont's avatar Antoine R. Dumont committed
      
      For each known visit type, we run a loop which:
       - monitors the size of the relevant celery queue
       - schedules more visits of the relevant type once the number of
       available slots goes over a given threshold (currently set to 5% of the
       max queue size).
      
      The scheduling of visits combines multiple scheduling policies, for now
      using static ratios set in the `POLICY_RATIOS` dict. We emit a warning
      if the ratio of origins fetched for each policy is skewed with respect
      to the original request (allowing, for now, manual adjustement of the
      ratios).
      
      The CLI endpoint spawns one thread for each visit type, which all handle
      connections to RabbitMQ and the scheduler backend separately. For now,
      we handle exceptions in the visit scheduling threads by (stupidly)
      respawning the relevant thread directly. We should probably improve this
      to give up after a specific number of tries.
      
      Co-authored-by: default avatarAntoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
  30. Oct 27, 2021
    • Nicolas Dandrimont's avatar
      grab_next_visits: avoid time interval calculations in PostgreSQL · 0c7ef27b
      Nicolas Dandrimont authored
      When the database is in a non-UTC timezone with DST, and a `timestamptz
      - interval` calculation crosses a DST change, the result of the
      calculation can be one hour off from the expected value:
      
      PostgreSQL will vary the timestamp by the amount of days in the
      interval, and will keep the same (local) time, which will be offset by
      an hour because of the DST change.
      
      Doing the datetime +- timedelta calculations in Python instead of
      PostgreSQL avoids this caveat altogether.
      0c7ef27b
  31. Oct 22, 2021
  32. Oct 20, 2021
Loading