Provenance in production [Roadmap - Tooling and infrastructure]
- Lead: douardda
- Priority: high
- Effort: ??
Description:
Publish swh-provenance services in production, including revision and origin layers.
Includes work:
- Build and deploy content index based on a winnowing algorithm
- Filter provenance pipeline to process only tags and releases
- Setup a production infrastructure for the kafka-based revision layer (including monitoring)
- Refactor and process the origin layer
- Release provenance documentation
KPIs:
- Provenance services available in production
- % of archive covered
Gooming meeting - 2023-04-12
wrap up
Origin of the project:
- Licence compliance tools
Principles
Identify a piece of code versus a reference corpus
-
basic approach: index of all occurences
-
smarter approach: identify the first occurence + all other occurences (scale problem)
- browse reverse graph (potential huge processing time)
- trade-off: keep the first occurence and iterate on others if needed
- constraints: growth of the archive
-
world of code (code search tool, but restricted access)
- solution: heuristic on first discovered date, first date after subversion, first date after git
-
ScanOSS: snippet detection (open source, weak documentation)
- based on DMOS(?) algorithm
- snippet detection database ~ 5To
-
optimization: filter on releases only
- is the performance gain important enough ? : Query the graph to compare content count from origins or releases
- cons: some content will be lost / missed
- to avoid the cons -> use the winowing algo
Current project status
-
Running on met at 5 revisions /second
-
It was OK on MMCA(OVH), recurrent failures on met
-
Testing in progress of a version using only the releases: still slow (not using RabbitMQ, contention on the DB)
-
POC in the winnowing algorithm
- actual file formating is basic (only removal of all whitespaces)
-
~280 fingerprints(CRC32) per files
-
Estimated 3.10^12 lines in database = ~75To with line numbers
-
Stored in a redis
-
Search algo:
- search the matching snippets
- sort by number of mathching snippets
-
file headers introduce some bias: some heuristics need to be introduced
-
A real matching file is a file with several snippets matching in the same order
-
Mime types indexer: not enough reliable >> when needed, mime types must be computed
Possible applications
-
Extension of swh-scanner
-
Mimetypes computation:
- During dataset generation -> additional table in Athena
- Stored somewhere in a table to avoid a full computation each time a graph is compressed
- Computationg possible on the ENAE cluster or the HPC?
-
Extend the swh-graph api to allow querying the most common file name table
Specify the usage of winnowing/fingerprints in the context of Provenance
- Constitute a reverse index of all fingerprints and refs to the files where they appear
- Heuristic for code snippets is clearly defined (ScanOSS recommendations)
- Filter files (under a minimum size, identified data file...)
Define and validate the targeted architecture
-
what do we use as an input for a reduced provenance index ?
- All package managers
- releases
- tags (applicable) -> every concrete branches on a snapshot
-
Git:
- if tags, keep tags
- else, keep head
-
Other vcs:
- keep head
-
Use the most common file name dataset to identify the mime-type
- import content occurence count pre-computed in athena in an ad-hoc table
Action plan, tasks identifiction
-
Define the heuristics of input filters for provenance index (tags, exclude epoch +/- a given range, contents sizes, number of occurences, mime types, file names...)
-
Define the infrastructure / pipeline
-
Refactor provenance code to handle input filters
-
Run the reduced index generation
-
Build a winnowing index on the scope of provenance indexed data
-
Run and test with a set of rules
- Identify the interesting metrics
- Identify known projects for testing
misc
- Prepare a possible scenario to compute the mime-types on the ENEA servers/HPC
- Repair the pipeline to keep running the current index on met