It order to know which object to not try to replay in the mirrors, we should extract the object in the redis used to store the errors during the cassandra replay.
extract format: csv <to be defined exactly>
TODO: check what really need to be extracted
The objects are dispatched according the following cases:
I'm not sure what the status of this issue is; am I supposed to run the script mentioned above? If so, on which machine? (the script seems to expect redis to be running locally)
It's all on my side, the first step is to run the error checking script for the ids in db1 (I'm currently checking everything is still ok to avoid any data lost) and then extract and enrich the data stored in db4.
The script expects a redis listening on localhost mostly because I run it on a tunneled redis.
The first error triage is done. Some edge case delayed a little the execution:
548 ids were not swhid and were in the format 'directory:<hash, there is probably an incorrect error when a deserialization error occurs (check next point). After converting them to swh:1:dir:<hash`, the check ran until the end
...Renaming directory:848b916a039f98c0bd743544ef9a884fde79f888 to swh:1:dir:848b916a039f98c0bd743544ef9a884fde79f888Renaming directory:feecdc9867f8d7022b7ebfdaa3e7a6451ce628b0 to swh:1:dir:feecdc9867f8d7022b7ebfdaa3e7a6451ce628b0Renaming directory:90119b82fcf283a02b57ee5bca6ab20ccc842f95 to swh:1:dir:90119b82fcf283a02b57ee5bca6ab20ccc842f95Renaming directory:14359a6478e6141dd6d887d770be9fdae5a53b1b to swh:1:dir:14359a6478e6141dd6d887d770be9fdae5a53b1bRenaming directory:ea2f878cd7952aef5fc2a4fdd17018ddc793bde7 to swh:1:dir:ea2f878cd7952aef5fc2a4fdd17018ddc793bde7Renaming directory:6f3b91e9a55a366e84620aac36461f2fa4490b5b to swh:1:dir:6f3b91e9a55a366e84620aac36461f2fa4490b5bRenaming directory:51a2139e8640fb7f3f62117ae891baf60891ed1d to swh:1:dir:51a2139e8640fb7f3f62117ae891baf60891ed1dRenaming directory:aef01aa03d44d55f9151fbeffd4d9a87b5a1745d to swh:1:dir:aef01aa03d44d55f9151fbeffd4d9a87b5a1745dRenaming directory:befc125a54a4a205cad8bbad3d86ce492d080285 to swh:1:dir:befc125a54a4a205cad8bbad3d86ce492d080285Renaming directory:3004eac6d34c2e41172a4c1d8460a088a1d1c497 to swh:1:dir:3004eac6d34c2e41172a4c1d8460a088a1d1c497Renaming directory:8b5f9efeec8829a63ef44e28e51c063d04680abe to swh:1:dir:8b5f9efeec8829a63ef44e28e51c063d04680abe...
A new db 9 were created for deserialization errors:
...Error deserializing swh:1:dir:848b916a039f98c0bd743544ef9a884fde79f888: swh:1:dir:848b916a039f98c0bd743544ef9a884fde79f888 has duplicated entry name: b'main'Error deserializing swh:1:dir:38c35c38bc318c659192cfc3fbf8a156732a7d37: swh:1:dir:38c35c38bc318c659192cfc3fbf8a156732a7d37 has duplicated entry name: b'test'Error deserializing swh:1:dir:3cbf64da896442946b9e734997e8a88d7458dfdf: swh:1:dir:3cbf64da896442946b9e734997e8a88d7458dfdf has duplicated entry name: b'First Openlayers experiment.png'Error deserializing swh:1:dir:8159cd7b10b5c643c52ec8caa5682a5091f2a162: swh:1:dir:8159cd7b10b5c643c52ec8caa5682a5091f2a162 has duplicated entry name: b'js'Error deserializing swh:1:dir:cac80b8798c7a7ffbceb1601b1c5fabde4eda13c: swh:1:dir:cac80b8798c7a7ffbceb1601b1c5fabde4eda13c has duplicated entry name: b'sfFormExtraPlugin'Error deserializing swh:1:dir:a0d3bc8f10e5899af1ed85250323fa3d1ccf831b: swh:1:dir:a0d3bc8f10e5899af1ed85250323fa3d1ccf831b has duplicated entry name: b'crypto'Error deserializing swh:1:dir:a039441f63e6b7c67d691de068f04f1ab28d8465: swh:1:dir:a039441f63e6b7c67d691de068f04f1ab28d8465 has duplicated entry name: b'rails-widgets'Error deserializing swh:1:dir:56998280c2aefc35e521bdf02a1c1f8618660604: swh:1:dir:56998280c2aefc35e521bdf02a1c1f8618660604 has duplicated entry name: b'MAD2'Error deserializing swh:1:dir:6f2e3a48d75f350e0c94bdc770880fbea29667b5: swh:1:dir:6f2e3a48d75f350e0c94bdc770880fbea29667b5 has duplicated entry name: b'_posts'Error deserializing swh:1:dir:934b3f5013da9820a38af45da5c605de4aa1da77: swh:1:dir:934b3f5013da9820a38af45da5c605de4aa1da77 has duplicated entry name: b'_posts'Error deserializing swh:1:dir:49f0d4ae82d4c3551b01dc50ba28a5b7d4514ca8: swh:1:dir:49f0d4ae82d4c3551b01dc50ba28a5b7d4514ca8 has duplicated entry name: b'_posts'Error deserializing swh:1:dir:b44dbb5bd7bbb4f22f2cca01fdc1baaaa8afef3a: swh:1:dir:b44dbb5bd7bbb4f22f2cca01fdc1baaaa8afef3a has duplicated entry name: b'src'Error deserializing swh:1:dir:12bce7a0ab2be9b16d2cb9b4897edf4ff15daf7e: swh:1:dir:12bce7a0ab2be9b16d2cb9b4897edf4ff15daf7e has duplicated entry name: b'data'Error deserializing swh:1:dir:4692fd14f99f1d40eb89bb36f82e8bb50aa39e16: swh:1:dir:4692fd14f99f1d40eb89bb36f82e8bb50aa39e16 has duplicated entry name: b'web'...
They look like strange directories with 2 entries with the same name as the same level. They are present in the archive.
I can process the 123_610 invalid hash entries to generate the final file now.