Complete proposal to implement the above solution:
add this to swh.model.model.DirectoryEntry (Directory.from_dict would set the rank if it's missing):
rank = attr.ib(type=int, validators=type_validator())"""Zero-based index of this entry in a directory."""
In postgresql, change the type of dir_entries/file_entries/rev_entries from bigint[] to bigint[2][]: pairs of (directory_entry object id, rank). The migration would be: 1. duplicate columns and make python write to both but read only the old 2. fill the new columns (looooong) 3. drop the old columns, rename and make python use the new ones
in cassandra, add column rank to directory_entry table. We can initialize them at 0 and make the Python code fill it when reading. (we can also have a script fill it, but it's not mandatory)
I came across a rather small repository [1] which i believe raise the same issue.
So it may help to keep its reference to ease the testing of the improvment discussed here.
Feel free to dismiss if not that useful.
swh-loader_1 | [2021-10-22 11:47:39,586: INFO/MainProcess] Task swh.loader.git.tasks.UpdateGitRepository[3b8b9037-f344-44e5-ab97-a3a310d9214f] receivedswh-loader_1 | [2021-10-22 11:47:39,632: INFO/ForkPoolWorker-1] Load origin 'https://github.com/technoweenie/attachment_fu' with type 'git'swh-loader_1 | Enumerating objects: 3549, done.swh-loader_1 | Total 3549 (delta 0), reused 0 (delta 0), pack-reused 3549swh-loader_1 | [2021-10-22 11:47:41,486: INFO/ForkPoolWorker-1] Listed 23 refs for repo https://github.com/technoweenie/attachment_fuswh-loader_1 | [2021-10-22 11:47:42,063: ERROR/ForkPoolWorker-1] Loading failure, updating to `failed` statusswh-loader_1 | Traceback (most recent call last):swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 339, in loadswh-loader_1 | self.store_data()swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/core/loader.py", line 457, in store_dataswh-loader_1 | for directory in self.get_directories():swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/git/loader.py", line 376, in get_directoriesswh-loader_1 | yield converters.dulwich_tree_to_directory(raw_obj)swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/git/converters.py", line 104, in dulwich_tree_to_directoryswh-loader_1 | check_id(dir_)swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/loader/git/converters.py", line 39, in check_idswh-loader_1 | f"Expected {type(obj).__name__} hash to be {obj.id.hex()}, "swh-loader_1 | swh.loader.git.converters.HashMismatch: Expected Directory hash to be 6ac72a0858a5d5028d7f502de8777fbd5bdb8cae, got e23127f28dd0e1cf6e92a7b81cb9dbc53b44aa37
$ time git clone https://github.com/technoweenie/attachment_fuCloning into 'attachment_fu'...remote: Enumerating objects: 1737, done.remote: Total 1737 (delta 0), reused 0 (delta 0), pack-reused 1737Receiving objects: 100% (1737/1737), 377.44 KiB | 3.85 MiB/s, done.Resolving deltas: 100% (740/740), done.git clone https://github.com/technoweenie/attachment_fu 0.05s user 0.02s system 8% cpu 0.837 total$ du -sh attachment_fu964K attachment_fu