Skip to content

Collect Project Context data (issues, PR, ...)

Main steps

GitHub

  • Integrate GHarchive raw data (from 2015 to 2025)
  • Adapt GHarchive (mainly the "cron") for raw data; data transformation to fit SWH model; experiments with Github API (some events are delayed, > 6 hours eg 23 days!) to update
  • Connect with the SWH data model => meetings/iteration with 2.3, based on Github API metamodel and confronted/discussed wrt SWH data model; three main patterns: (1) keep Github concepts "as is"; (2) refine Github concepts to have more generality; (3) Github information out of the scope (philosophy: we will keep raw data to not have regrets)
  • POC swh-project-context implemented: equivalent to gharchive.org for hourly files, deployed on a laptop (no issue for RAM or CPU)
  • Software architecture for swh-project-context almost done => open questions about the data model (see UML diagram below); next steps:
    • refactor the POC to implement the new architecture
    • collect sizing metrics (RAM, CPU, disk space)

GitLab

  • Implement a crawler based on GitLab API to collect metrics to analyse GitLab volumetrics
    • conclusion: GitHub represents much more data volume than all other forges combined
  • collect metrics across all GitLab forges >> check whether the GitLab.com crawler requires adaptation
  • Feed it with the list of SWH's supported GL instances

Actors

  • DiverSE (Inria)
Edited by Benoit Chauvet