In addition, it would be nice to know the number of contributors and get a sense of how active the project is. It can be a proxy to the latest commit date on the main branch.
I did a quick review of the different forges a while ago and GitHub seemed to expose the most metadata at the organisation level, which eases a lot the retrieval.
In addition, it would be nice to know the number of contributors
Good idea! I added it to the list.
get a sense of how active the project is. It can be a proxy to the latest commit date on the main branch.
There's developmentStatus, but I highly doubt many project define it. (I've seen it as a badge on some GitHub repos, but it's rare). Using "dateModified" seems like a good idea indeed.
Actually we can get the "dateModified" based on data already in the SWH archive, because on each visit of a repo we take a snapshot of the repo and hash it; so it's just a matter of listing the visits and finding the last change to this hash. But it's rather coarse-grained, we take a snapshot of each repo every one or two years.
I think that we should fetch all metadata found in its raw form (keep in xml if xml, etc.)
Apply translation techniques with CodeMeta to while identifying relevant metadata we want to keep in a translated format.
so no need to discriminate and choose what to fetch.
Here are a couple of rare metadata that are useful in certain use cases
datePublished we use it for HAL
referencedPublication used for software citation
releaseNotes will be used in the deposit use case for creating releases