GitHub loading optimization: skip repos with old enough updated_at/pushed_at timestamps
The GitHub API allows to inspect when a repo has been last modified, see updated_at
/pushed_at
fields in this example.
Given how significant GitHub is in our archive coverage it makes sense to add a forge-specific optimization that skip loading repos for which those timestamps are older than our last visit of the corresponding origins.
(Note: I'm not exactly sure what the difference among the two fields are; I'm assuming pushed_at
is for git push
and updated_at
for metadata changes. But I think even the most conservative approach, skip only if //both// fields are older than our last visit would be a good start.)
Assuming that doing an API call at the loader level is faster than actually trying to load the repo (which seems obvious to me, but it's not like I have actually benchmarked it g), this optimization should help a lot in clearing our backlog of repos to re-visit, for all GitHub repos that haven't changed.
I'm not sure where this forge-specific optimization belongs, but if it worse it's something we're can extend in the future to, e.g., GitLab.
Migrated from T2242 (view on Phabricator)