Collect Project Context data (issues, PR, ...)
Main steps
GitHub
-
Integrate GHarchive raw data (from 2015 to 2025) -
Adapt GHarchive (mainly the "cron") for raw data; data transformation to fit SWH model; experiments with Github API (some events are delayed, > 6 hours eg 23 days!) to update -
Connect with the SWH data model => meetings/iteration with 2.3, based on Github API metamodel and confronted/discussed wrt SWH data model; three main patterns: (1) keep Github concepts "as is"; (2) refine Github concepts to have more generality; (3) Github information out of the scope (philosophy: we will keep raw data to not have regrets) -
POC swh-project-context
implemented: equivalent to gharchive.org for hourly files, deployed on a laptop (no issue for RAM or CPU) -
Software architecture for swh-project-context
almost done => open questions about the data model (see UML diagram below); next steps:- refactor the POC to implement the new architecture
- collect sizing metrics (RAM, CPU, disk space)
GitLab
-
Implement a crawler based on GitLab API to collect metrics to analyse GitLab volumetrics - conclusion: GitHub represents much more data volume than all other forges combined
-
collect metrics across all GitLab forges >> check whether the GitLab.com crawler requires adaptation -
Feed it with the list of SWH's supported GL instances
Actors
- DiverSE (Inria)
Edited by Benoit Chauvet