Loaders do not have this information themselves, it is handled by the "filtering storage proxy" (swh.storage.proxies.filter); which runs in the same process as loaders
! In #4185, @vlorentz wrote:
Loaders do not have this information themselves, it is handled by the "filtering storage proxy" (swh.storage.proxies.filter); which runs in the same process as loaders
In practice, the loaders have this information: they send a list of objects to <foo>_add(), so they have the number of objects they've processed; and the <foo>_add() methods return counters for the number of objects that were really added.
We could add statsd probes in the loaders before calling <foo>_add() to count the number of objects processed (which would give us a "global" ratio of number of objects processed to objects effectively added), or we could just make the filtering storage proxy send said statsd probes itself (a count of inbound objects, before any filtering).
To get the per-task metrics, we need the tasks to keep both the number of objects processed and the number of objects added, and to ratio them. We have multiple options there:
Keep track of the counts in the loader directly?
Add a method to the filter storage proxy to measure this
Create a new "counting" storage proxy which would keep track of both the inbound and outbound objects and have a method to retrieve cumulative counts. We could use this proxy explicitly in the loaders.
the Git loader now exports a swh_loader_filtered_objects_total metric. We should generalize this to other loaders eventually; using one of the options above