Hackage: Implement incremental mode
Use http api lastUpload argument in search query to retrieve new or updated origins since last run
Related to #4597
Migrated from D8663 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D8663 (id=31286)
Rebasing onto 05cd1de1...
First, rewinding head to replay your work on top of it... Applying: Hackage: Implement incremental mode
Changes applied before test
commit 2390eefdd68e0adb58e758bc691b8ce95464a986 Author: Franck Bret <franck.bret@octobus.net> Date: Wed Oct 12 10:08:38 2022 +0200 Hackage: Implement incremental mode Use http api lastUpload argument in search query to retrieve new or updated origins since last run Related #4597
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/787/ for more details.
for entry in page: last_update = iso8601.parse_date(entry["lastUpload"]) if not self.earliest_update or last_update > self.earliest_update: self.earliest_update = last_update
This makes
self.earliest_update
the latestlastUpload
, not the earliest.Either way, I don't think this is the right way to do it, because:
- if you flip the operator, then
index_last_update
will barely advance in time across listings, so it won't really be incremental - if you change the name and definition to match the computation, then we will miss packages uploaded while the lister is running, because it won't see some of them (listed early in its run, but updated later in its run), but may see some more recent ones (listed late in its run, and uploaded earlier)
Assuming Hackage has no lag between the
lastUpload
timestamp and the moment packages are available, you should use the time the lister started.- if you flip the operator, then
! In !325 (closed), @vlorentz wrote:
for entry in page: last_update = iso8601.parse_date(entry["lastUpload"]) if not self.earliest_update or last_update > self.earliest_update: self.earliest_update = last_update
This makes
self.earliest_update
the latestlastUpload
, not the earliest.Ok earliest is a too ambiguous term. I wanted to have something to store the max update_date of a complete listing.
Either way, I don't think this is the right way to do it, because:
- if you flip the operator, then
index_last_update
will barely advance in time across listings, so it won't really be incremental For now the operator is greater than, but do you mean it should be dynamic at lister instantiation?
- if you change the name and definition to match the computation, then we will miss packages uploaded while the lister is running, because it won't see some of them (listed early in its run, but updated later in its run), but may see some more recent ones (listed late in its run, and uploaded earlier)
Assuming Hackage has no lag between the
lastUpload
timestamp and the moment packages are available, you should use the time the lister started.ok got it, will use the time the lister started Thanks
- if you flip the operator, then
! In !325 (closed), @franckbret wrote:
- if you flip the operator, then
index_last_update
will barely advance in time across listings, so it won't really be incremental For now the operator is greater than, but do you mean it should be dynamic at lister instantiation?
No, I was just confused by the term "earliest" and the use of greater-than
- if you flip the operator, then
Build is green
Patch application report for D8663 (id=31304)
Rebasing onto 05cd1de1...
First, rewinding head to replay your work on top of it... Applying: Hackage: Implement incremental mode
Changes applied before test
commit 88f8389a8506e03e311f801609febb231eb03f5a Author: Franck Bret <franck.bret@octobus.net> Date: Wed Oct 12 10:08:38 2022 +0200 Hackage: Implement incremental mode Use http api lastUpload argument in search query to retrieve new or updated origins since last run Related to #4597
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/788/ for more details.
lastUpload
is documented as working on dates rather than datetime, so we are going to need to do the same stuff as with NuGet.Also, this code is sending datetimes, instead of dates, so the API ignores the filter entirely. For example:
curl "https://hackage.haskell.org/packages/search" -H "Accept: application/json" -H "Content-Type: application/json" --data '{"page": 0, "sortColumn": "default", "sortDirection": "ascending", "searchQuery": "(lastUpload > 2022-11-10T01:00:00Z)"}' -X POST | jq . | head { "numberOfResults": 15644, "pageContents": [ { "description": "Examples of 3D graphics programming with OpenGL", "downloads": 18, "lastUpload": "2016-07-22T14:26:23.038905Z", "maintainers": [ { "display": "WolfgangJeltsch",
buuuut you are using a strict inequality, so you need to subtract one day, in order not to miss uploads submitted after the previous run of the lister but on the same day.
Also, you should apply
.astimezone(tz=timezone.utc)
before converting to date, because the database is not guaranteed to return timestamps in UTC even when they were written in UTC.(Sorry for the back-and-forth; hopefully I'm done now.)
! In !325 (closed), @vlorentz wrote: buuuut you are using a strict inequality, so you need to subtract one day, in order not to miss uploads submitted after the previous run of the lister but on the same day.
Also, you should apply
.astimezone(tz=timezone.utc)
before converting to date, because the database is not guaranteed to return timestamps in UTC even when they were written in UTC.(Sorry for the back-and-forth; hopefully I'm done now.)
The http api is able to do greater than equal filtering. Now the filter operator is ">=" instead of ">".
Build is green
Patch application report for D8663 (id=31843)
Rebasing onto ea146ce2...
Current branch diff-target is up to date.
Changes applied before test
commit d475c1529985162f7c4e0b098e73f958b349decb Author: Franck Bret <franck.bret@octobus.net> Date: Wed Oct 12 10:08:38 2022 +0200 Hackage: Implement incremental mode Use http api lastUpload argument in search query to retrieve new or updated origins since last run Related to #4597
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/859/ for more details.
sweet.
One last thing: could you make tests check the request body is as expected? See https://requests-mock.readthedocs.io/en/latest/history.html