Skip to content

Redesign list-by-earliest-revision to be parallel

vlorentz requested to merge list-by-earliest-revision into master

The old algorithm was:

  1. Read all revisions from .orc and sort them by date (SortRevrelByDate)
  2. Initialize an array of timestamps, one for each node
  3. for each revision in increasing timestamp order, traverse all contents and directories from that revision. If their timestamp is unset, set it and print the (date, revrel, cntdir) triple
  4. dump timestamps

This was intrinsically sequential, and took 2 days (1 for SortRevrelByDate, 1 for the others).

The new algorithm takes 1.5 hours. Here is how it works:

  1. Initialize an array of [AtomicOccurrence], one for each node, to the maximum timestamp and an undefined node id
  2. For each revision/release (in parallel): a. Get its author date (if none, skip the revision/release) b. traverse all contents and directories contained by that revision/release. For each content/directory, atomically set the [AtomicOccurrence] to the current rev/rel and its timestamp if the [AtomicOccurrence] has a higher timestamp than the revision/release
  3. For each content/directory (in parallel): a. Get the (timestamp, revrel) pair represented by its [AtomicOccurrence]. b. If the timestamp is not the maximum (ie. if it was set as step 2.b.), then printthe (timestamp, revrel, cntdir) triple
  4. Convert the array of [AtomicOccurrence] to an array of timestamps and dump it.

Merge request reports