Skip to content

Publish a "File Dataset"

This should be a generalization of what we currently call the "popular file names" dataset, which currently contains SWHID,length,filename,occurrences by adding the timestamp of first occurrence of each content, computed using swh_graph_provenance.

The rationale for using a single dataset for this varied metadata is to avoid duplicating the set of SWHIDs across multiple datasets, as it takes ~700GB every time. It also makes it easier for consumers who would need to join CSV tables if they wanted to use data we spread across multiple datasets.

For the 2024-05-16 dataset, this will be backfilled from the existing "popular file names" dataset (because it already exists) and the heads-only provenance (in order to serve as reproduction package for @AdeleDesmazieres's internship report). Future versions of the "File Dataset" will compute it directly, and use provenance from all revisions and releases.

Edited by vlorentz