Ideas/notes about building a reduced provenance index:
filters related to revision date:
no commit date <= (EPOCH + epsilon)
no commit date <= (VCS birth date): eg. no git commit anterior to git birth date, etc. but then it makes provenance ignore a posteriori rebuilt history ("burn baby burn")
no commit date in the future (but these are not an issue in terms of provenance, just probably useless content)
select ingested revisions:
releases
VCS branch heads; pruning noise ones (MRs, review, CI related ones)
use some a priori knowledge to white/black list revisions(/origins?):
prune content for which we know there are a lot of references (typically a license file), as identified by the "most famous content" datasets
Note that for the dates filtering:
filter commit date <= EPOCH is debatable (same argument than for VCS birth date based filtering)
the main issue is identify and to prune buggy/rogue revisions; buggy revisions are mostly in the [EPOCH, EPOCH+1day] range (typically a badly configured vcs conversion tool); rogue ones are harder to identify
a possible solution could also involve whitelisting revisions which commit date is <EPOCH from a list of known origins (eg. rebuilt git history of famous vintage code base, like Apollo)
For the record, in the current complete provenance index (2.3B ingested revisions so far), around EPOCH we have:
date
number of revs
1970-01-01
248519
1970-01-02
29204
1970-01-03
477
1970-01-04
432
1970-01-05
426
1970-01-06
415
1970-01-07
406
1970-01-08
465
1970-01-09
433
1970-01-10
394
1970-01-11
364
1970-01-12
440
1970-01-13
418
1970-01-14
397
1970-01-15
350
1970-01-16
377
1970-01-17
374
1970-01-18
393
1970-01-19
402
1970-01-20
434
1970-01-21
413
1970-01-22
430
1970-01-23
373
1970-01-24
403
1970-01-25
346
1970-01-26
382
1970-01-27
384
1970-01-28
388
1970-01-29
400
1970-01-30
396
1970-01-31
5405
So filtering from 1970-01-03 seems good enough; 1970-01-31 is suspicious; more digging might be useful.
Query:
with toto as ( select sha1, date_trunc('day', date) as date from revision where date<'1970-02-01') select date, count(sha1) from toto group by date;
Here is more complete version of the query above: revs-by-date