Skip to content
Snippets Groups Projects
  1. Oct 28, 2021
    • Jenkins for Software Heritage's avatar
    • Nicolas Dandrimont's avatar
      Add a new cli endpoint to schedule recurrent visits in Celery · 50d7fd7f
      Nicolas Dandrimont authored and Antoine R. Dumont's avatar Antoine R. Dumont committed
      
      For each known visit type, we run a loop which:
       - monitors the size of the relevant celery queue
       - schedules more visits of the relevant type once the number of
       available slots goes over a given threshold (currently set to 5% of the
       max queue size).
      
      The scheduling of visits combines multiple scheduling policies, for now
      using static ratios set in the `POLICY_RATIOS` dict. We emit a warning
      if the ratio of origins fetched for each policy is skewed with respect
      to the original request (allowing, for now, manual adjustement of the
      ratios).
      
      The CLI endpoint spawns one thread for each visit type, which all handle
      connections to RabbitMQ and the scheduler backend separately. For now,
      we handle exceptions in the visit scheduling threads by (stupidly)
      respawning the relevant thread directly. We should probably improve this
      to give up after a specific number of tries.
      
      Co-authored-by: default avatarAntoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
      v0.19.0
      50d7fd7f
  2. Oct 27, 2021
    • Nicolas Dandrimont's avatar
      grab_next_visits: avoid time interval calculations in PostgreSQL · 0c7ef27b
      Nicolas Dandrimont authored
      When the database is in a non-UTC timezone with DST, and a `timestamptz
      - interval` calculation crosses a DST change, the result of the
      calculation can be one hour off from the expected value:
      
      PostgreSQL will vary the timestamp by the amount of days in the
      interval, and will keep the same (local) time, which will be offset by
      an hour because of the DST change.
      
      Doing the datetime +- timedelta calculations in Python instead of
      PostgreSQL avoids this caveat altogether.
      0c7ef27b
  3. Oct 22, 2021
  4. Oct 20, 2021
  5. Oct 18, 2021
  6. Oct 15, 2021
  7. Sep 02, 2021
  8. Aug 27, 2021
  9. Aug 26, 2021
  10. Aug 18, 2021
  11. Aug 06, 2021
  12. Aug 03, 2021
  13. Jul 30, 2021
  14. Jul 23, 2021
    • Nicolas Dandrimont's avatar
      Only record last_visited and last_successful in origin_visit_stats · 87e66faa
      Nicolas Dandrimont authored
      After using this schema for a while, all queries can be implemented in
      terms of these two timestamps, instead of the four original
      last_eventful, last_uneventful, last_failed and last_notfound
      timestamps.
      
      This ends up simplifying the logic within the journal client, as well as
      that of the grab_next_visits query builder.
      
      To make this change work, we also stop considering out of order messages
      altogether in journal_client. This welcome simplification is an accuracy
      tradeoff that is explained in the updated documentation of the journal
      client:
      
      .. [1] Ignoring out of order messages makes the initialization of the
            origin_visit_status table (from a full journal) less deterministic: only the
            `last_visit`, `last_visit_state` and `last_successful` fields are guaranteed
            to be exact, the `next_position_offset` field is a best effort estimate
            (which should converge once the client has run for a while on in-order
            messages).
      87e66faa
    • Antoine R. Dumont's avatar
      test_journal_client: Unify test assertion like the rest · 3ca0d659
      Antoine R. Dumont authored
      Related to D5917
      3ca0d659
    • Antoine R. Dumont's avatar
      test: Refactor assert_visit_stats_ok to ignore_fields · 8cf2238e
      Antoine R. Dumont authored
      This simplifies and unifies properly the utility test function to compare visit stats.
      8cf2238e
  15. Jul 22, 2021
  16. Jul 06, 2021
    • Antoine R. Dumont's avatar
      journal_client: Compute next position for origin visit · 8c4ae9f1
      Antoine R. Dumont authored
      For origin without any last_update information [1], the journal client is now also in
      charge of moving their next position in the queue for rescheduling. Depending on their
      status, the next position offset and next_visit_queue_position are updated after each
      visit completes:
      
      - if the visit has failed, increase the next visit target by the minimal visit
        interval (to take into account transient loading issues)
      - if the visit is successful, and records some changes, decrease the visit interval
        index by 2 (visit the origin *way* more often).
      - if the visit is successful, and records no changes, increase the visit interval index
        by 1 (visit the origin less often).
      
      We then set the next visit target to its current value + the new visit interval
      multiplied by a random fudge factor (picked in the -/+ 10% range).
      
      The fudge factor allows the visits to spread out, avoiding "bursts" of loaded origins
      e.g. when a number of origins from a single hoster are processed at once.
      
      Note that the computations happen for all origins for simplicity and code maintenance
      but it will only be used by a new soon-to-be scheduling policy.
      
      [1] Lister cannot provide it for some reason.
      8c4ae9f1
  17. Jul 01, 2021
  18. Jun 29, 2021
  19. Jun 23, 2021
Loading