Redesign list-by-earliest-revision to be parallel (!431) · Merge requests · Platform / Development / swh-graph

vlorentz requested to merge list-by-earliest-revision into master Mar 14, 2024

The old algorithm was:

Read all revisions from .orc and sort them by date (SortRevrelByDate)
Initialize an array of timestamps, one for each node
for each revision in increasing timestamp order, traverse all contents and directories from that revision. If their timestamp is unset, set it and print the (date, revrel, cntdir) triple
dump timestamps

This was intrinsically sequential, and took 2 days (1 for SortRevrelByDate, 1 for the others).

The new algorithm takes 1.5 hours. Here is how it works:

Initialize an array of [AtomicOccurrence], one for each node, to the maximum timestamp and an undefined node id
For each revision/release (in parallel): a. Get its author date (if none, skip the revision/release) b. traverse all contents and directories contained by that revision/release. For each content/directory, atomically set the [AtomicOccurrence] to the current rev/rel and its timestamp if the [AtomicOccurrence] has a higher timestamp than the revision/release
For each content/directory (in parallel): a. Get the (timestamp, revrel) pair represented by its [AtomicOccurrence]. b. If the timestamp is not the maximum (ie. if it was set as step 2.b.), then printthe (timestamp, revrel, cntdir) triple
Convert the array of [AtomicOccurrence] to an array of timestamps and dump it.

Redesign list-by-earliest-revision to be parallel