Filter provenance pipeline to reduce the index volume
Define the heuristics of input filters for provenance index.
Proposed options are:
- Process only tags/releases
- Exclude epoch +/- a given range
- Apply contents sizes filter ranges
- Exclude "too popular" contents (number of occurrences)
- Mime types filtering
- File names
For each of these options:
- Identify the data source to query to get the metric
- Define the values to apply for a first iteration
- Implement the filter handling in provenance