Listers won't work to fetch metadata for all forges. On Github and on Bitbucket they don't get any metadata other than the description (and on Github, whether it's a fork). On GitLab they also get the number of stars and forks, but not much more.
Existing package loaders currently load some metadata, but we may want to use dedicated loaders for that, eg. with a speicifc visit type.
vlorentzchanged title from Define an architecture to fetch extrinsic metadata outside listers to Define an architecture to fetch extrinsic metadata outside listers and loaders
changed title from Define an architecture to fetch extrinsic metadata outside listers to Define an architecture to fetch extrinsic metadata outside listers and loaders
It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)
It "feels wrong" to have forge-specific code in loaders; but we can make the metadata loading pluggable (eg. with setuptools entrypoints) if this ever becomes an issue.
! In #1739 (closed), @vlorentz wrote:
The original idea for this was to have separate tasks to fetch metadata, so that loaders did not have forge-specific code to fetch metadata.
However, the idea of loading metadata from loader is more appealing the more I think about it:
Metadata are fetched at about the same time as we snapshot code; which would allow showing more consistent states of repositories
Active repositories automatically have their metadata fetched more often than inactive ones
We don't have one more moving part to monitor and schedule
This allows the Git loader to know a new repo is a "forge fork" of another one before it starts loading, so it can do an incremental load
Yes, all these are good points. As long as forges don't provide a way of loading the metadata in bulk, it makes sense to do it at the same time as loading.
I think we will want to ensure that a failing metadata fetch doesn't fail the whole loading operation altogether, to avoid too strongly coupling these components.
I doubt we will be giving git loaders ssh keys any time soon, and I'd rather we explicited and maybe contracted out the improvements of dulwich that we need for better generic https support (which will automatically be useful for all upstreams, not just GitHub).
Either way, giving a set of forge API credentials to the git loader is just a matter of duplicating an entry in the puppet config (from lister to loader-git), so it's really not a big practical deal.
It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)
I would expect most forges to actually report a change to the origin if the metadata (only) changes, so they'd be scheduled for loading again anyway. In a situation where we're not lagging after forges, we would also be re-loading origins with no known changes too.
In terms of testing, I think we will want to do have the option to do metadata-only/code-only/both loads, so we could consider scheduling some metadata-only loads more often.
It "feels wrong" to have forge-specific code in loaders; but we can make the metadata loading pluggable (eg. with setuptools entrypoints) if this ever becomes an issue.
I think we want to design the metadata loading as a pluggable / third party module from the get go, because for forges which support multiple different VCSes, we will want to share the logic between them. This will also make testing the metadata fetching/mangling logic in isolation easier.
! In #1739 (closed), @olasd wrote:
Yes, all these are good points. As long as forges don't provide a way of loading the metadata in bulk, it makes sense to do it at the same time as loading.
Some do, including GitHub, but I think the benefits of these bulk APIs are minimal because of the way GitHub implements rate-limiting.
Either way, giving a set of forge API credentials to the git loader is just a matter of duplicating an entry in the puppet config (from lister to loader-git), so it's really not a big practical deal.
Sure; but I meant it from a security point of view: these credentials will need to be available to processes which handle non-trusted data. Not to mention that they may be accidentally leaked to Sentry too.
It doesn't provide a way to fetch metadata for inactive repositories; but we can deal with that later (eg. with an option to make loaders load only metadata; or simply to run loaders on repos even if the forge does not report changes)
I would expect most forges to actually report a change to the origin if the metadata (only) changes, so they'd be scheduled for loading again anyway. In a situation where we're not lagging after forges, we would also be re-loading origins with no known changes too.
Probably not all metadata (eg. number of stars). Either way, the semantics of updated_at is not documented by GitHub.