Provenance in production [Roadmap - Tooling and infrastructure]

Description:

Publish swh-provenance services in production, including revision and origin layers.

Includes work:

Build and deploy content index based on a winnowing algorithm
Filter provenance pipeline to process only tags and releases
Setup a production infrastructure for the kafka-based revision layer (including monitoring)
Refactor and process the origin layer
Release provenance documentation

KPIs:

Gooming meeting - 2023-04-12

Identify a piece of code versus a reference corpus

basic approach: index of all occurences
smarter approach: identify the first occurence + all other occurences (scale problem)
- browse reverse graph (potential huge processing time)
- trade-off: keep the first occurence and iterate on others if needed
- constraints: growth of the archive
world of code (code search tool, but restricted access)
- solution: heuristic on first discovered date, first date after subversion, first date after git
ScanOSS: snippet detection (open source, weak documentation)
- based on DMOS(?) algorithm
- snippet detection database ~ 5To
optimization: filter on releases only
- is the performance gain important enough ? : Query the graph to compare content count from origins or releases
- cons: some content will be lost / missed
- to avoid the cons -> use the winowing algo

Running on met at 5 revisions /second
It was OK on MMCA(OVH), recurrent failures on met
Testing in progress of a version using only the releases: still slow (not using RabbitMQ, contention on the DB)
POC in the winnowing algorithm
- actual file formating is basic (only removal of all whitespaces)
~280 fingerprints(CRC32) per files
Estimated 3.10^12 lines in database = ~75To with line numbers
Stored in a redis
Search algo:
- search the matching snippets
- sort by number of mathching snippets
file headers introduce some bias: some heuristics need to be introduced
A real matching file is a file with several snippets matching in the same order
Mime types indexer: not enough reliable >> when needed, mime types must be computed

Extension of swh-scanner
Mimetypes computation:
- During dataset generation -> additional table in Athena
- Stored somewhere in a table to avoid a full computation each time a graph is compressed
- Computationg possible on the ENAE cluster or the HPC?
Extend the swh-graph api to allow querying the most common file name table

Constitute a reverse index of all fingerprints and refs to the files where they appear
Heuristic for code snippets is clearly defined (ScanOSS recommendations)
Filter files (under a minimum size, identified data file...)

what do we use as an input for a reduced provenance index ?
- All package managers
- releases
- tags (applicable) -> every concrete branches on a snapshot
Git:
- if tags, keep tags
- else, keep head
Other vcs:
- keep head
Use the most common file name dataset to identify the mime-type
- import content occurence count pre-computed in athena in an ad-hoc table

Define the heuristics of input filters for provenance index (tags, exclude epoch +/- a given range, contents sizes, number of occurences, mime types, file names...)
Define the infrastructure / pipeline
Refactor provenance code to handle input filters
Run the reduced index generation
Build a winnowing index on the scope of provenance indexed data
Run and test with a set of rules
- Identify the interesting metrics
- Identify known projects for testing

Unstarted Issues (open and unassigned)

swh-provenance · Detect "almost connected components" to try and separate processing
#4715 priority:Low
Meta · Fix the pipeline on met to resume the actual provenance index generation
#4988
Meta · Define the pipeline infrastructure for provenance reduced index generation
#4987
Meta · Run the filtered provenance index generation
#4986
Meta · Implement the input filters handling in provenance code
#4985
Meta · Specify the filters to apply and the values to apply for a first run
#4984
Meta · Release provenance documentation
#4922 activity::Documentation roadmap_import
Meta · Origin layer refactoring and processing
#4921 activity::Epic roadmap_import
swh-provenance · provenance: consider running the origin->revision provenance algorithm on only "releases"
#4706 priority:Normal Provenance database

Ongoing Issues (open and assigned)

swh-provenance · Detect and skip "directory bombs" during processing
#4714 activity::Implementation
swh-provenance · Factor out directory flattening from the revision processing pipeline
#4712 activity::Implementation
Meta · Build a content index based on winnowing algorithm
#4924 activity::Epic priority:Normal
Meta · Filter provenance pipeline to reduce the index volume
#4923 activity::Epic
Meta · Setup a production infrastructure for revision layer (db dump + kafka) including monitoring
#4920 activity::Deployment priority:High

Completed Issues (closed)

sysadm-environment · Serve 2023-09-06 provenances indexes stored on mam on the web
#5209
swh-provenance · Refine the implementation of the directory_ls function
#4713 activity::Implementation priority:High
sysadm-environment · Install the new bare metal server(s) for Provenance
#4743 activity::Deployment
Meta · Deploy content index based on winnowing algorithm
#4925