Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Register
  • Sign in
  • P Platform
  • Group information
    • Group information
    • Activity
    • Labels
    • Members
  • Issues 1.3k
    • Issues 1.3k
    • List
    • Board
    • Milestones
  • Merge requests 78
    • Merge requests 78
  • Packages and registries
    • Packages and registries
    • Package Registry
Collapse sidebar
  • Platform
  • Milestones
  • Provenance in production [Roadmap - Tooling and infrastructure]
Open
Milestone

Provenance in production [Roadmap - Tooling and infrastructure]
Milestone ID: 85

  • Lead: douardda
  • Priority: high
  • Effort: ??

Description:

Publish swh-provenance services in production, including revision and origin layers.

Includes work:

  • Build and deploy content index based on a winnowing algorithm
  • Filter provenance pipeline to process only tags and releases
  • Setup a production infrastructure for the kafka-based revision layer (including monitoring)
  • Refactor and process the origin layer
  • Release provenance documentation

KPIs:

  • Provenance services available in production
  • % of archive covered

Gooming meeting - 2023-04-12

wrap up

Origin of the project:

  • Licence compliance tools

Principles

Identify a piece of code versus a reference corpus

  • basic approach: index of all occurences

  • smarter approach: identify the first occurence + all other occurences (scale problem)

    • browse reverse graph (potential huge processing time)
    • trade-off: keep the first occurence and iterate on others if needed
    • constraints: growth of the archive
  • world of code (code search tool, but restricted access)

    • solution: heuristic on first discovered date, first date after subversion, first date after git
  • ScanOSS: snippet detection (open source, weak documentation)

    • based on DMOS(?) algorithm
    • snippet detection database ~ 5To
  • optimization: filter on releases only

    • is the performance gain important enough ? : Query the graph to compare content count from origins or releases
    • cons: some content will be lost / missed
    • to avoid the cons -> use the winowing algo

Current project status

  • Running on met at 5 revisions /second

  • It was OK on MMCA(OVH), recurrent failures on met

  • Testing in progress of a version using only the releases: still slow (not using RabbitMQ, contention on the DB)

  • POC in the winnowing algorithm

    • actual file formating is basic (only removal of all whitespaces)
  • ~280 fingerprints(CRC32) per files

  • Estimated 3.10^12 lines in database = ~75To with line numbers

  • Stored in a redis

  • Search algo:

    • search the matching snippets
    • sort by number of mathching snippets
  • file headers introduce some bias: some heuristics need to be introduced

  • A real matching file is a file with several snippets matching in the same order

  • Mime types indexer: not enough reliable >> when needed, mime types must be computed

Possible applications

  • Extension of swh-scanner

  • Mimetypes computation:

    • During dataset generation -> additional table in Athena
    • Stored somewhere in a table to avoid a full computation each time a graph is compressed
    • Computationg possible on the ENAE cluster or the HPC?
  • Extend the swh-graph api to allow querying the most common file name table

Specify the usage of winnowing/fingerprints in the context of Provenance

  • Constitute a reverse index of all fingerprints and refs to the files where they appear
  • Heuristic for code snippets is clearly defined (ScanOSS recommendations)
  • Filter files (under a minimum size, identified data file...)

Define and validate the targeted architecture

  • what do we use as an input for a reduced provenance index ?

    • All package managers
    • releases
    • tags (applicable) -> every concrete branches on a snapshot
  • Git:

    • if tags, keep tags
    • else, keep head
  • Other vcs:

    • keep head
  • Use the most common file name dataset to identify the mime-type

    • import content occurence count pre-computed in athena in an ad-hoc table

Action plan, tasks identifiction

  • Define the heuristics of input filters for provenance index (tags, exclude epoch +/- a given range, contents sizes, number of occurences, mime types, file names...)

  • Define the infrastructure / pipeline

  • Refactor provenance code to handle input filters

  • Run the reduced index generation

  • Build a winnowing index on the scope of provenance indexed data

  • Run and test with a set of rules

    • Identify the interesting metrics
    • Identify known projects for testing

misc

  • Prepare a possible scenario to compute the mime-types on the ENEA servers/HPC
  • Repair the pipeline to keep running the current index on met
  • Issues 17
  • Merge requests 0
  • Participants 1
  • Labels 9
5% complete
5%
Start date
No start date
None
Due date
No due date
17
Issues 17
Open: 16 Closed: 1
0
Merge requests 0
Open: 0 Closed: 0 Merged: 0
Reference: %"Provenance in production [Roadmap - Tooling and infrastructure]"