Skip to content

Decide what metadata we want to / can collect from GitHub

What we can collect is a subset of what we can see here: https://api.github.com/repos/SoftwareHeritage/swh-core

The way we'll collect it depends on what info we want; so let's try to list it exhaustively:

description priority REST /repositories REST /users/{username}/repos (works for orgs too) REST (specific query) GraphQL via user (impossible for orgs) GraphQL direct comment
owner avatar + homepage URL low free (avatar only) no 1 req/user free 1 req + 1 point/user
description high free free 1 req free N/A
whether it's a fork high free free N/A free N/A
what it's a fork of high no no 1 req/forkedrepo 1 point/100repos 1 req + 1 point/forked-repo
whether it's a mirror high no free N/A free N/A
what it's a mirror of high no free N/A free N/A
created_at / updated_at high no free N/A free N/A
pushed_at high free free N/A free N/A
homepage high no free N/A free N/A
"topics" high no no 1 preview req/100 repos/user MAX_TOPICS points/100repos/user 1 req/repo + 1 point/repo (assuming less than 100 topics) Not available in the "production" REST API
stargazers_count / watchers_count mid no free N/A free N/A
list of stargazers / watchers low no no 1 req/100people too expensive 1 req/repo/100people + ceil(1 point/1repo + 1 point/100people)
forks count mid no free N/A free N/A
list of forks no no no 1req/100forks/repo too expensive ceil(1 req/repo/100forks) + ceil(1 point/1repo + 1 point/100forks)
license low/mid no free N/A free N/A (GH extracts it from the intrinsic metadata we collect too, so probably not very useful)
main language low/mid no free N/A no no (ditto)
all languages low no no 1req/repo (assuming <100 languages) too expensive 1 req/repo + 1 point/repo (assuming <100 languages) (ditto)
assets out of scope this should probably be done by a specific loader though; it's closer to a package manager than to metadata
release notes out of scope that aren't on git tags

Notes:

  • costs are computed assuming we will send a REST OR GraphQL query per user/org regardless of whether we want the property. These costs are:
    • for REST: ceil(1 req/100repos/user)
    • for GraphQL: ceil(1 req/100repos/user) + ceil(1 point/100repo)
  • Rate-limits are:
    • for REST: 5000req/hour/token
    • for GraphQL: 5000points/hour/token (no req/hour limit AFAICT, but I'm including them in the calculation because they use resources on our side)
  • N/A means it's pointless to send that extra query, as we can get it in a strictly more efficient way

Migrated from T3542 (view on Phabricator)

Edited by Phabricator Migration user
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information