Skip to content
Snippets Groups Projects

Hackage: Implement incremental mode

Use http api lastUpload argument in search query to retrieve new or updated origins since last run

Related to #4597


Migrated from D8663 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D8663 (id=31286)

    Rebasing onto 05cd1de1...

    First, rewinding head to replay your work on top of it...
    Applying: Hackage: Implement incremental mode
    Changes applied before test
    commit 2390eefdd68e0adb58e758bc691b8ce95464a986
    Author: Franck Bret <franck.bret@octobus.net>
    Date:   Wed Oct 12 10:08:38 2022 +0200
    
        Hackage: Implement incremental mode
        
        Use http api lastUpload argument in search query to retrieve new or
        updated origins since last run
        
        Related #4597

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/787/ for more details.

  •         for entry in page:
                last_update = iso8601.parse_date(entry["lastUpload"])
                if not self.earliest_update or last_update > self.earliest_update:
                    self.earliest_update = last_update

    This makes self.earliest_update the latest lastUpload, not the earliest.

    Either way, I don't think this is the right way to do it, because:

    1. if you flip the operator, then index_last_update will barely advance in time across listings, so it won't really be incremental
    2. if you change the name and definition to match the computation, then we will miss packages uploaded while the lister is running, because it won't see some of them (listed early in its run, but updated later in its run), but may see some more recent ones (listed late in its run, and uploaded earlier)

    Assuming Hackage has no lag between the lastUpload timestamp and the moment packages are available, you should use the time the lister started.

  • Merge request was returned for changes

  • Author Contributor

    ! In !325 (closed), @vlorentz wrote:

            for entry in page:
                last_update = iso8601.parse_date(entry["lastUpload"])
                if not self.earliest_update or last_update > self.earliest_update:
                    self.earliest_update = last_update

    This makes self.earliest_update the latest lastUpload, not the earliest.

    Ok earliest is a too ambiguous term. I wanted to have something to store the max update_date of a complete listing.

    Either way, I don't think this is the right way to do it, because:

    1. if you flip the operator, then index_last_update will barely advance in time across listings, so it won't really be incremental For now the operator is greater than, but do you mean it should be dynamic at lister instantiation?
    1. if you change the name and definition to match the computation, then we will miss packages uploaded while the lister is running, because it won't see some of them (listed early in its run, but updated later in its run), but may see some more recent ones (listed late in its run, and uploaded earlier)

    Assuming Hackage has no lag between the lastUpload timestamp and the moment packages are available, you should use the time the lister started.

    ok got it, will use the time the lister started Thanks

  • ! In !325 (closed), @franckbret wrote:

    1. if you flip the operator, then index_last_update will barely advance in time across listings, so it won't really be incremental For now the operator is greater than, but do you mean it should be dynamic at lister instantiation?

    No, I was just confused by the term "earliest" and the use of greater-than

  • Author Contributor

    Incremental operations are now related to the last_listing_date

  • Build is green

    Patch application report for D8663 (id=31304)

    Rebasing onto 05cd1de1...

    First, rewinding head to replay your work on top of it...
    Applying: Hackage: Implement incremental mode
    Changes applied before test
    commit 88f8389a8506e03e311f801609febb231eb03f5a
    Author: Franck Bret <franck.bret@octobus.net>
    Date:   Wed Oct 12 10:08:38 2022 +0200
    
        Hackage: Implement incremental mode
        
        Use http api lastUpload argument in search query to retrieve new or
        updated origins since last run
        
        Related to #4597

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/788/ for more details.

  • lastUpload is documented as working on dates rather than datetime, so we are going to need to do the same stuff as with NuGet.

    Also, this code is sending datetimes, instead of dates, so the API ignores the filter entirely. For example:

    curl "https://hackage.haskell.org/packages/search" -H "Accept: application/json" -H "Content-Type: application/json" --data '{"page": 0, "sortColumn": "default", "sortDirection": "ascending", "searchQuery": "(lastUpload > 2022-11-10T01:00:00Z)"}' -X POST | jq . | head
    {
      "numberOfResults": 15644,
      "pageContents": [
        {
          "description": "Examples of 3D graphics programming with OpenGL",
          "downloads": 18,
          "lastUpload": "2016-07-22T14:26:23.038905Z",
          "maintainers": [
            {
              "display": "WolfgangJeltsch",
  • Merge request was returned for changes

  • Oh, I missed that you actually convert using .date() in the code, my bad.

  • Merge request was accepted

  • vlorentz approved this merge request

    approved this merge request

  • buuuut you are using a strict inequality, so you need to subtract one day, in order not to miss uploads submitted after the previous run of the lister but on the same day.

    Also, you should apply .astimezone(tz=timezone.utc) before converting to date, because the database is not guaranteed to return timestamps in UTC even when they were written in UTC.

    (Sorry for the back-and-forth; hopefully I'm done now.)

  • Merge request was returned for changes

  • vlorentz unapproved this merge request

    unapproved this merge request

  • Author Contributor

    Use greater than or equal instead of strict comparison when building http api query params for incremental listing

  • Author Contributor

    ! In !325 (closed), @vlorentz wrote: buuuut you are using a strict inequality, so you need to subtract one day, in order not to miss uploads submitted after the previous run of the lister but on the same day.

    Also, you should apply .astimezone(tz=timezone.utc) before converting to date, because the database is not guaranteed to return timestamps in UTC even when they were written in UTC.

    (Sorry for the back-and-forth; hopefully I'm done now.)

    The http api is able to do greater than equal filtering. Now the filter operator is ">=" instead of ">".

  • Build is green

    Patch application report for D8663 (id=31843)

    Rebasing onto ea146ce2...

    Current branch diff-target is up to date.
    Changes applied before test
    commit d475c1529985162f7c4e0b098e73f958b349decb
    Author: Franck Bret <franck.bret@octobus.net>
    Date:   Wed Oct 12 10:08:38 2022 +0200
    
        Hackage: Implement incremental mode
        
        Use http api lastUpload argument in search query to retrieve new or
        updated origins since last run
        
        Related to #4597

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/859/ for more details.

  • sweet.

    One last thing: could you make tests check the request body is as expected? See https://requests-mock.readthedocs.io/en/latest/history.html

  • Author Contributor

    Improve test for incremental listing, ensure the http searchQuery/lastUpload value is a is a date

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply
Loading