Decide what metadata we want to / can collect from GitHub
What we can collect is a subset of what we can see here: https://api.github.com/repos/SoftwareHeritage/swh-core
The way we'll collect it depends on what info we want; so let's try to list it exhaustively:
description | priority | REST /repositories | REST /users/{username}/repos (works for orgs too) | REST (specific query) | GraphQL via user (impossible for orgs) | GraphQL direct | comment |
---|---|---|---|---|---|---|---|
owner avatar + homepage URL | low | free (avatar only) | no | 1 req/user | free | 1 req + 1 point/user | |
description | high | free | free | 1 req | free | N/A | |
whether it's a fork | high | free | free | N/A | free | N/A | |
what it's a fork of | high | no | no | 1 req/forkedrepo | 1 point/100repos | 1 req + 1 point/forked-repo | |
whether it's a mirror | high | no | free | N/A | free | N/A | |
what it's a mirror of | high | no | free | N/A | free | N/A | |
created_at / updated_at | high | no | free | N/A | free | N/A | |
pushed_at | high | free | free | N/A | free | N/A | |
homepage | high | no | free | N/A | free | N/A | |
"topics" | high | no | no | 1 preview req/100 repos/user | MAX_TOPICS points/100repos/user | 1 req/repo + 1 point/repo (assuming less than 100 topics) | Not available in the "production" REST API |
stargazers_count / watchers_count | mid | no | free | N/A | free | N/A | |
list of stargazers / watchers | low | no | no | 1 req/100people | too expensive | 1 req/repo/100people + ceil(1 point/1repo + 1 point/100people) | |
forks count | mid | no | free | N/A | free | N/A | |
list of forks | no | no | no | 1req/100forks/repo | too expensive | ceil(1 req/repo/100forks) + ceil(1 point/1repo + 1 point/100forks) | |
license | low/mid | no | free | N/A | free | N/A | (GH extracts it from the intrinsic metadata we collect too, so probably not very useful) |
main language | low/mid | no | free | N/A | no | no | (ditto) |
all languages | low | no | no | 1req/repo (assuming <100 languages) | too expensive | 1 req/repo + 1 point/repo (assuming <100 languages) | (ditto) |
assets | out of scope | this should probably be done by a specific loader though; it's closer to a package manager than to metadata | |||||
release notes | out of scope | that aren't on git tags |
Notes:
- costs are computed assuming we will send a REST OR GraphQL query per user/org regardless of whether we want the property. These costs are:
- for REST: ceil(1 req/100repos/user)
- for GraphQL: ceil(1 req/100repos/user) + ceil(1 point/100repo)
- Rate-limits are:
- for REST: 5000req/hour/token
- for GraphQL: 5000points/hour/token (no req/hour limit AFAICT, but I'm including them in the calculation because they use resources on our side)
- N/A means it's pointless to send that extra query, as we can get it in a strictly more efficient way
Migrated from T3542 (view on Phabricator)
Edited by Phabricator Migration user