This task will take us one step towards a searchable archive :-)
We should keep a very conservative approach, I would suggest to keep the metadata and just copy.
this way, you don't need to distinguish between the fields that require and those that do not.
Finally, it will be less stressful to run a script that doesn't change the archive but is very useful for the search mechanisms we want to implement on the ERMDS (Extrinsic Raw MetaData Storage).
Processed 0.46M rows (~0.2%, last revision: 0095624edf008b754fb1ed5bd656d22c63f984ff)Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 1204, in <module> main(storage_dbconn, storage_url, deposit_dbconn, bytes.fromhex(first_id), True) File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 1165, in main handle_row(row, storage, deposit_cur, dry_run) File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 975, in handle_row storage, row["id"], metadata["original_artifact"][0]["filename"] File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 261, in pypi_origin_from_filename project_name = pypi_project_from_filename(filename) File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 252, in pypi_project_from_filename assert match, original_filenameAssertionError: pypops-201408-r4.tar.gz
I'll make the script log the revisions it's unable to process, rather than uselessly fall flat on its face.
2021-04-06 20:19:19,898 __main__ ERROR Could not parse revision metadata 00959a167bd98452c98ce73382f4b42179d53d32Traceback (most recent call last): File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 1161, in main handle_row(row, storage, deposit_cur, dry_run) File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 979, in handle_row storage, row["id"], metadata["original_artifact"][0]["filename"] File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 265, in pypi_origin_from_filename project_name = pypi_project_from_filename(filename) File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 256, in pypi_project_from_filename assert match, original_filenameAssertionError: pypops-201408-r4.tar.gz
2021-04-06 20:54:44,962 __main__ ERROR Could not parse revision metadata 00c6e2fe046dee3b5ef629f74f4801345840e70aTraceback (most recent call last): File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 1161, in main handle_row(row, storage, deposit_cur, dry_run) File "/usr/lib/python3/dist-packages/swh/storage/migrate_extrinsic_metadata.py", line 843, in handle_row assert "id" in actual_metadata or "title" in actual_metadataAssertionError
Doing a final spot check of revisions with a metadata field that haven't been replicated as raw_extrinsic_metadata:
\copy (select id, type, directory, metadata from revision tablesample system (0.1) where metadata is not null and ((metadata - 'extra_headers') - 'node') != '{}' and not exists (select 1 from raw_extrinsic_metadata where target = ('swh:1:dir:' || encode(revision.directory::bytea, 'hex')))) to '/tmp/no_metadata' csv;
15:29 guest@softwareheritage => \copy (select id, type, directory, metadata from revision where metadata is not null and ((metadata - 'extra_headers') - 'node') != '{}' and not exists (select 1 from raw_extrinsic_metadata where target = ('swh:1:dir:' || encode(revision.directory::bytea, 'hex')))) to '/tmp/no_metadata' csv;COPY 50
50 revisions have metadata fields but no associated raw_extrinsic_metadata. Most of them are Debian packages.