Define a set of rules / heuristics to filter "noise" revisions
The current implementation of the provenance index makes no filtering at all when ingesting revisions.
On the other hand, the provenance algorithm is very sensitive to erroneous/rogue commit dates; a single revision with a very early commit date for existing (legit) content can spoof the whole provenance "first occurrence" index, and any revision that mistakenly (or intentionally) "move" the first occurrence to any arbitrary date.
However a significant proportion of the revisions in the archive is noise, from a provenance perspective:
- revisions coined by a (VCS) convertion tool (e.g. svn2git, git-cinnabar, ...) uses a made up commit date (typically EPOCH, e.g. ),
- revisions from a "toy/experiment" repository (eg https://github.com/illacceptanything/illacceptanything, https://github.com/d-xo/untrustix-git-testdata or many other script generated repos)
- revisions from non software source code (historical) material (french code of law, US constitution, etc.)
- genuine mistakes from a VCS user (wrong commit date, eg. swapped 2 fields like year and month, CS student repos, etc.)
- revisions coined intentionally with wrong dates from legit content (e.g. xxx)
List of known very noisy origins:
- https://github.com/Denton-L/00
- https://github.com/xwvvvvwx/untrustix-git-testdata
- https://github.com/illacceptanything/illacceptanything
- https://github.com/cirosantilli/test-many-commits-10m
- https://github.com/sujayshah/CProgramming
XXX to be continued