Collect Project Context data (issues, PR, ...)

Main steps

GitHub

Integrate GHarchive raw data (from 2015 to 2025)
Adapt GHarchive (mainly the "cron") for raw data; data transformation to fit SWH model; experiments with Github API (some events are delayed, > 6 hours eg 23 days!) to update
Connect with the SWH data model => meetings/iteration with 2.3, based on Github API metamodel and confronted/discussed wrt SWH data model; three main patterns: (1) keep Github concepts "as is"; (2) refine Github concepts to have more generality; (3) Github information out of the scope (philosophy: we will keep raw data to not have regrets)
POC swh-project-context implemented: equivalent to gharchive.org for hourly files, deployed on a laptop (no issue for RAM or CPU)
Software architecture for swh-project-context almost done => open questions about the data model (see UML diagram below); next steps:
- refactor the POC to implement the new architecture
- collect sizing metrics (RAM, CPU, disk space)

GitLab

Implement a crawler based on GitLab API to collect metrics to analyse GitLab volumetrics
- conclusion: GitHub represents much more data volume than all other forges combined
collect metrics across all GitLab forges >> check whether the GitLab.com crawler requires adaptation
Feed it with the list of SWH's supported GL instances

Edited Jul 18, 2025 by Benoit Chauvet