Skip to content

Draft: Start BFS from (sorted) origins instead of random nodes

vlorentz requested to merge bfs-origins into master

Still needs to be benchmark to make sure there is a noticeable improvement.

Script I used to sort origins:

import sys

rows = []
for line in sys.stdin:
    line = line.strip()
    if not line:
        continue
    try:
        (url, swhid) = line.split()
    except ValueError:
        # whitespaces in URL, probably invalid
        continue
    assert len(swhid) == 50, repr(line)
    reversed_url = "/".join(reversed(url.rstrip("/").split("/")))
    print(f"{reversed_url}\t{swhid}")

and then run it with:

pv /srv/softwareheritage/ssd/data/vlorentz/datasets/2023-09-06-recompressed/compressed/origins/* | zstdcat | python3 ../sort_origins.py | TMPDIR=/srv/softwareheritage/tmp/ sort --parallel=96 -S 100M | pv --wait | sed "s#.*\t##" > /srv/softwareheritage/ssd/data/vlorentz/datasets/2023-09-06-recompressed/compressed/sorted_origins.txt

Merge request reports