Skip to content
Snippets Groups Projects
Open Specify the filters to apply and the values to apply for a first run
  • View options
  • Specify the filters to apply and the values to apply for a first run

  • View options
  • Specify the filters to apply and the values to apply for a first run

    Open Task created by Benoit Chauvet

    Ideas/notes about building a reduced provenance index:

    • filters related to revision date:

      • no commit date <= (EPOCH + epsilon)
      • no commit date <= (VCS birth date): eg. no git commit anterior to git birth date, etc. but then it makes provenance ignore a posteriori rebuilt history ("burn baby burn")
      • no commit date in the future (but these are not an issue in terms of provenance, just probably useless content)
    • select ingested revisions:

      • releases
      • VCS branch heads; pruning noise ones (MRs, review, CI related ones)
    • use some a priori knowledge to white/black list revisions(/origins?):

      • prune content for which we know there are a lot of references (typically a license file), as identified by the "most famous content" datasets

    Note that for the dates filtering:

    • filter commit date <= EPOCH is debatable (same argument than for VCS birth date based filtering)
    • the main issue is identify and to prune buggy/rogue revisions; buggy revisions are mostly in the [EPOCH, EPOCH+1day] range (typically a badly configured vcs conversion tool); rogue ones are harder to identify
    • a possible solution could also involve whitelisting revisions which commit date is <EPOCH from a list of known origins (eg. rebuilt git history of famous vintage code base, like Apollo)

    For the record, in the current complete provenance index (2.3B ingested revisions so far), around EPOCH we have:

    date number of revs
    1970-01-01 248519
    1970-01-02 29204
    1970-01-03 477
    1970-01-04 432
    1970-01-05 426
    1970-01-06 415
    1970-01-07 406
    1970-01-08 465
    1970-01-09 433
    1970-01-10 394
    1970-01-11 364
    1970-01-12 440
    1970-01-13 418
    1970-01-14 397
    1970-01-15 350
    1970-01-16 377
    1970-01-17 374
    1970-01-18 393
    1970-01-19 402
    1970-01-20 434
    1970-01-21 413
    1970-01-22 430
    1970-01-23 373
    1970-01-24 403
    1970-01-25 346
    1970-01-26 382
    1970-01-27 384
    1970-01-28 388
    1970-01-29 400
    1970-01-30 396
    1970-01-31 5405

    So filtering from 1970-01-03 seems good enough; 1970-01-31 is suspicious; more digging might be useful.

    Query: with toto as ( select sha1, date_trunc('day', date) as date from revision where date<'1970-02-01') select date, count(sha1) from toto group by date;

    Here is more complete version of the query above: revs-by-date

    See swh/devel/swh-provenance#4706

    Edited by David Douard
    • Merge request
    • Branch

    Linked items ... 0

  • Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first
    Loading Loading Loading Loading Loading Loading Loading Loading Loading Loading