When we save an unknown origin due to a Save code now request, we schedule a one-shot task for the ingestion, but don't add the origin for future crawling. It might make sense to do both.
It is possibly also the only reasonable place where we can have heuristics to de-duplicate URLs that point to the same repo, e.g., non-canonical GitHub repos URLs.
It is possibly also the only reasonable place where we can have heuristics to
de-duplicate URLs that point to the same repo, e.g., non-canonical GitHub repos URLs.
That concern will be kept out of this task for now. It can be dealt with alongside [1]
When we save an unknown origin due to a Save code now request, we schedule a one-shot
task for the ingestion, but don't add the origin for future crawling. It might make
sense to do both.
It makes sense.
Implementation wise, considering the "save code now" as a Lister, it thus references
origins into the scheduler model which shall get scheduled by the "next-gen" scheduler.
The first part got deployed (modification in the webapp routines to update the save code
now statuses).
The webapp now records successfull save code now origins in the listed origins models of
the scheduler. [1]
Remain the runner to actually consume those regularly to be deployed (or maybe it's already the case, i need to check that part).
[1]
softwareheritage-scheduler=> select now(), count(*) from listed_origins lo inner join listers l on lo.lister_id=l.id where name='save-code-now';+-------------------------------+-------+| now | count |+-------------------------------+-------+| 2021-06-16 06:36:44.269692+00 | 22 |+-------------------------------+-------+(1 row)Time: 27.663 ms