add a tool to generate "diffstats" for one or more commits
We would like to be able to generate for a given set of commits (potentially very large), the list of files each of them modifies. Slightly more formally:
- Input: a set of revisions identified by SWHIDs
- Output: a mapping from each SWHID in input to a set of paths relative to the root directory corresponding to files that have been modified in the corresponding revision
Intuitively, this feature would be the analogous to the output of git diff --stat
, but without the detail of the number of lines added/removed to each modified path.
Caveats apply to merge commits. I'm not sure what's the exact semantics of "diff" in that case, that will need to be looked up. Also, we already have this feature implemented in swh-web. I don't know if there is code to be refactored there, but it would be nice to have a consistent semantics between the two implementations at least.
In terms of implementation, all this should be implemented by using swh-graph only, provided that the maps from edges to path names are available.
As an interface I suggest implementing this as a UNIX filter that reads one SWHID per line and outputs one ndjson line with a list of paths. This would imply encoding output to UTF-8, which will rule out some weird paths, but it could be an acceptable limitation. Alternatively, we will need to pick another single-line format that can encode lists properly.
Before starting to process input, the UNIX filter will load the compressed graph in memory using the Java API (possibly via the /dev/shm
trick, avoiding to reload the graph if it's already in memory).