swh-scrubber=# copy (select config_id, first_occurrence, id from missing_object where first_occurrence >= '2024-11-01') to '/tmp/missing_objects.csv' with CSV quote ';';COPY 14993
Check in production on the last 10k occurences of storage-primary-references-directory:
ᐅ ./check_missing_object.sh production > last_10k_missing_objects_storage-primary-references-directory_production 2>&1
ᐅ awk'/^Only in/' last_10k_missing_objects_storage-primary-references-directory_productionOnly in cassandra 0xeb05e0fd956c4c631f7d7e98939680c2071e789a cnt 2024-09-25Only in cassandra 0xdfc8655296e1dbeeb7c512ec502331b92b1ab668 cnt 2024-09-25Only in cassandra 0xf68939343bd9aa3acce85aece2bc7d8debb9e0aa cnt 2024-09-25Only in cassandra 0xea9730b97f09893aae772c618abd3907aa4a92a1 cnt 2024-09-25Only in cassandra 0xeedca4d102b017fdfdddc9ac07277ae7ac67896c cnt 2024-09-25Only in cassandra 0xee66a2aa48d9790c71ef56aa059a56832f808873 cnt 2024-09-25Only in cassandra 0x5fbac6b9558ef6255e12528367c84948f8fc4308 cnt 2024-09-25Only in cassandra 0xcc14b234d6076db8c6e7800dab2f05a0afd4b03b cnt 2024-09-25Only in cassandra 0x3bb7e33352528c519c471cef9eaac9df88f806d9 dir <stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined column name sha256 in table swh.directory"Only in cassandra 0x623c3fa0069d3a40d2901d83183fdc9a8b89bfb5 cnt 2024-11-14Only in cassandra 0xcc7cf5895024b1ddd7f69533db7f778836e9abec cnt 2024-09-25Only in cassandra 0x7e553b1035ddc126a0a80761859f1417f7f7bec7 cnt 2024-09-25Only in cassandra 0x603efeb7db1824a2a8feef5dd5c960c5f66c5989 cnt 2024-09-25Only in cassandra 0x41ca986502ccd7fa87534cec87e00bcc3b646395 cnt 2024-09-25Only in cassandra 0x8937742da07b265a76c781d3c4c2e187fc77efbb dir <stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined column name sha256 in table swh.directory"Only in cassandra 0x6b1b40331f1d55f01429171966c8717213f5bae1 cnt 2024-09-25Only in cassandra 0x0cef7f78f0d67a2e2af9b5ec5e97a29557e1992b dir <stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined column name sha256 in table swh.directory"Only in cassandra 0xa77f0d0092c4b3102d8bf3a1ae1b12c7b478255e cnt 2024-09-25Only in cassandra 0xf665da6864c4531d96ebacfdf4ca5beceabd9741 cnt 2024-09-25Only in cassandra 0x17223d31a4d329880430d75f1978cf92f0861936 cnt 2024-09-25Only in cassandra 0x80b09944856d2d3a75deb43f77d52715689c7df4 cnt 2024-10-26Only in cassandra 0xdc074d4760983e562739ba16472a3607b011fbfa cnt 2024-09-25Only in cassandra 0x033f2aebbae17774b8b7a62539d408bc35d8b86c cnt 2024-11-19Only in cassandra 0xd3dd8ed6fcf10b1021e4139b5ceee38961e42aa5 dir <stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined column name sha256 in table swh.directory"Only in cassandra 0x1928a679fe330d0d7c7f53b78d1238d9ce977039 dir <stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined column name sha256 in table swh.directory"Only in cassandra 0x7b1cfb09b837ddd915c81f924cb8a590b98a6cdb dir <stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined column name sha256 in table swh.directory"Only in cassandra 0x55f74b32a6ae6977ab92699a957084696503cf88 dir <stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined column name sha256 in table swh.directory"Only in cassandra 0x43279c64392a450cdd7a7df5552d092566b1ce2a cnt 2024-09-25Only in cassandra 0xb816e7c8496723420595d3e057fbb7aa7f7c5a77 cnt 2024-10-31Only in cassandra 0x12ef3374dec9a1083bde78e22bed8cb4fb3ae804 cnt 2024-09-25Only in cassandra 0xe884b280a0b1e1e0613f7cfbd49654f39520fe5d dir <stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="Undefined column name sha256 in table swh.directory"Only in cassandra 0x547ccb19340801d9451f3b1acbcb8c9c2b4d1fee cnt 2024-09-25
ᐅ awk'/^ko: both absent/' last_10k_missing_objects_storage-primary-references-directory_production | wc-l9953
ᐅ awk'/^ko: both absent/{}END{print}' last_10k_missing_objects_storage-primary-references-directory_productionko: both absent cnt 12a37ea04074cabe2d41eed7d190927dc4ec8d62
ᐅ awk'/^ok: both exist/' last_10k_missing_objects_storage-primary-references-directory_production | wc-l15
The vast majority of missing content is missing from cassandra and postgresql.
I will fix the script error on retrieving directory first occurrence date and run the script again with the last 100k occurrences.
Checking the last 100k occurrences (order by first_occurrence desc limit 100000) of storage-primary-references-directory in scrubber table missing_object:
ᐅ ./check_missing_object.sh production > last_100k_missing_objects_storage-primary-references-directory_production 2>&1
ᐅ awk'BEGIN{count=0;last_line=""}{if($0 ~ /^Connection error/){count++;last_line=$0}}END{printf "Error: `%s`\nTotal: %s\n",last_line,count}' last_100k_missing_objects_storage-primary-references-directory_production Error: `Connection error: ('Unable to connect to any servers', {'192.168.100.181:9042': TimeoutError(110, "Tried connecting to [('192.168.100.181', 9042)]. Last error: Connection timed out")})`Total: 81
Connection error: ('Unable to connect to any servers', {'192.168.100.181:9042': TimeoutError(110, "Tried connecting to [('192.168.100.181', 9042)]. Last error: Connection timed out")})
A very small number of id were not processed (directory has no ingestion date and I forgot a line break):
ᐅ awk'$0~/Only in cassandra/&&NF>12' last_100k_missing_objects_storage-primary-references-directory_productionOnly in cassandra 0x8bfbe7ccac92b1277d0a7ab90cc8d633169228e9 dir Only in cassandra 0x7820373dd5634ce05df6ffd47daa41483f0e9e64 dir Only in cassandra 0x6f734246208126aa489cd1efb3c68b157b758075 dir ko: both absent cnt 1687154f36ed8ff1a74b71668924bb772293619bOnly in cassandra 0xd3dd8ed6fcf10b1021e4139b5ceee38961e42aa5 dir Only in cassandra 0x1928a679fe330d0d7c7f53b78d1238d9ce977039 dir Only in cassandra 0x7b1cfb09b837ddd915c81f924cb8a590b98a6cdb dir Only in cassandra 0x55f74b32a6ae6977ab92699a957084696503cf88 dir ko: both absent cnt 4aad046e3824cfebd96ffa3d0714007747ee6bd9
Conclusion
In summary 1, it seems that the vast majority (99.71%) of swhid in the scrubber missing_object table for the storage-primary-references-directory are in these two cases:
object exists in postregsql and cassandra (43.09%);
object is absent in postgresql and cassandra(56.62%).
Notes
`check_missing_object.sh`
#!/usr/bin/env bashset-u[[$# == 1 ]]||exit 1[["$1"=~ ^production|staging$ ]]||exit 1source"$HOME"/cqlshEnv/bin/activatetypeset-rDB_CMD="select id,first_occurrence from missing_object where config_id = 5 order by first_occurrence desc limit 100000;"typeset-A TYPETYPE["cnt"]="content"TYPE["dir"]="directory"if[["$1"=="production"]];thentypeset-rSCRUBBER_PG_ARGS="dbname=swh-scrubber host=db.internal.softwareheritage.org xxx"typeset-rSWH_PG_ARGS="dbname=softwareheritage host=db.internal.softwareheritage.org xxx"typeset-rSWH_CASS_ARGS="--cqlshrc=$HOME/.config/swh/prod_cqlshrc"elif[["$1"=="staging"]];thentypeset-rSCRUBBER_PG_ARGS="dbname=swh-scrubber host=db1.internal.staging.swh.network xxx"typeset-rSWH_PG_ARGS="dbname=swh host=db1.internal.staging.swh.network xxx"typeset-rSWH_CASS_ARGS="--cqlshrc=$HOME/.config/swh/stg_cqlshrc"fiHASHES=$(psql -d"$SCRUBBER_PG_ARGS"-c"$DB_CMD"-At | awk-F"|"'{print $1}')for swhid in$HASHES;dotype=$(awk-F":"'{print $3}'<<<"$swhid")hash=$(awk-F":"'{print $4}'<<<"$swhid")if[["$type"=="cnt"]];thenpg=$(psql -d"$SWH_PG_ARGS"-c"select sha1_git from ${TYPE[$type]} where sha1_git = '\x${hash}';"-At)cs=$(cqlsh "$SWH_CASS_ARGS"-e"select * from content_by_sha1_git where sha1_git = 0x${hash};" | awk'$1~/^0x/')fi if[["$type"=="dir"]];thenpg=$(psql -d"$SWH_PG_ARGS"-c"select id from ${TYPE[$type]} where id = '\x${hash}';"-At)cs=$(cqlsh "$SWH_CASS_ARGS"-e"select * from ${TYPE[$type]} where id = 0x${hash};" | awk'$1~/^0x/')fi if[[-n"$pg"]]&&[[-n"$cs"]];thenecho ok: both exist "$type$hash"continue fi if[[-z"$pg"]]&&[[-z"$cs"]];thenecho ko: both absent "$type$hash"continue fi[[-n"$pg"]]&&echo"Only in postgresql $pg"if[[-n"$cs"]];thenecho-n"Only in cassandra ${cs%%|*}$type "token=$(awk-F'|''{print $2}'<<<"$cs")if[["$type"=="cnt"]]thencqlsh "$SWH_CASS_ARGS"-e"select toDate(writetime(status)/1000) from ${TYPE[$type]} where token(sha256) = $token;" | \awk'NF==1&&$1~/^20/{gsub (" ", "", $0); print}'elseechofi fidone
`missing_objects_counters.awk`
#!/usr/bin/awk -f# vim: ai ts=4 sts=4 sw=4BEGIN{format="%-16s %-8s %s\n"counters["total"]=0}{#printcounters["total"]++if($0~/^Only in cassandra/){counters["only_in_cs"]++if($6~/^[^20]/){errors_counter["errors"]++}else{cs_date_counter[$6]++}}if($0~/^Only in postgresql/){counters["only_in_pg"]++if($6~/^[^20]/){errors_counter["errors"]++}}if($0~/^Connection error/){errors_counter["errors"]++}if($0~/^ok: both exist/){counters["both_exist"]++}if($0~/^ko: both absent/){counters["both_absent"]++}}END{"date"|getlinedateclose("date")printfformat,"# cs pg status","# hits","# rates"printfformat,"--------------","------","-------"format="%-16s %-8s %.2f\n"# sort by description name#PROCINFO["sorted_in"] = "@ind_str_asc"# sort by countersPROCINFO["sorted_in"]="@val_num_asc"for(iincounters)printfformat,i,counters[i],counters[i]/counters["total"]*100format="%s\n"printfformat,"---"format="%-16s %-8s %s\n"printfformat,"# cs only dates","# hits","# rates"printfformat,"---------------","------","-------"format="%-16s %-8s %.2f\n"for(iincs_date_counter)printfformat,i,cs_date_counter[i],cs_date_counter[i]/counters["only_in_cs"]*100format="%s\n"printfformat,"---"format="%-16s %-8s %.2f\n"printfformat,"errors:",errors_counter["errors"],errors_counter["errors"]/counters["total"]*100printf"---\nReported on %s\n",date}
we can conclude that awk is really awesome (as usual). ↩
we can conclude that awk is really awesome (as usual).
More seriously it's s good news that the errors are not so frequent
Maybe it would be worth generating the complete list of missing contents in Cass or PG to check if they can be repaired, but it could be done in another dedicated issue.
I've no idea if it's realistic or not in term of execution time.
copying scripts and configs files in the debug pod;
creating a venv with cqlsh in the debug pod.
Everything seems to works fine:
ᐅ kbs exec-ti-n swh-cassandra -c swh-toolbox checking-missing-contents -- bash
swh@checking-missing-contents:~$ check_missing_object stagingko: both absent dir 09cd2b52b31402b2e34cca97956e932a087a6af5ko: both absent dir 0d3a983d21acdb0614c47cd12542d74a112a8c74Only in cassandra cnt 0xa5550dd359146561f0372191c818df603d3c3ff2 2024-10-16Only in cassandra cnt 0xd126bc8cde3503e55a81452f07f306e8058b5fee 2024-07-24ko: both absent dir ebe84ddacbca167fdc659ae8cedfc99597fd10bcko: both absent cnt 53776a65301dcf3637e8cf75879f2002389cf3fdko: both absent dir 8c0ee48e872933185db2fd3e19bb4a69dfd12fc1Only in cassandra cnt 0xcbc5b755d0fb5c5ca49aed22f57789b5b1e44fb4 2024-11-27Only in cassandra cnt 0x69dc7e5eac4a18e063860bc9db5bb483e516d7cc 2024-11-14Only in cassandra cnt 0x2a88e1e4d8068dc2adc0469aea477ed387cd0018 2024-11-14
swh@checking-missing-contents:~$ check_missing_object staging ascok: both exist cnt 917f473632e5ba47e613604622b2ab83b6c7ffcb \x917f473632e5ba47e613604622b2ab83b6c7ffcb 0x917f473632e5ba47e613604622b2ab83b6c7ffcb | -2263884508483967155ok: both exist cnt 003b77527d968c6fa60e12333deb8b75371bf4d5 \x003b77527d968c6fa60e12333deb8b75371bf4d5 0x003b77527d968c6fa60e12333deb8b75371bf4d5 | 7073853444204460170ok: both exist cnt 3220a2029b15056659440a4d8d1bade024ce8893 \x3220a2029b15056659440a4d8d1bade024ce8893 0x3220a2029b15056659440a4d8d1bade024ce8893 | -7874141961238189960ok: both exist cnt 3099fb5a937644a1d92f86f07c85b7436a6cde95 \x3099fb5a937644a1d92f86f07c85b7436a6cde95 0x3099fb5a937644a1d92f86f07c85b7436a6cde95 | 3189955565862811696ok: both exist cnt 3c46c7f7c9a89596b58636d23fad67bd8fe8dd3b \x3c46c7f7c9a89596b58636d23fad67bd8fe8dd3b 0x3c46c7f7c9a89596b58636d23fad67bd8fe8dd3b | 8162213414131792167ok: both exist cnt 5a32bf57bc347eb9dc8adc32e66befc074a4b069 \x5a32bf57bc347eb9dc8adc32e66befc074a4b069 0x5a32bf57bc347eb9dc8adc32e66befc074a4b069 | 8048195691597345888ok: both exist cnt 697bbecfd56890d69063019cecc10feb68dd7dad \x697bbecfd56890d69063019cecc10feb68dd7dad 0x697bbecfd56890d69063019cecc10feb68dd7dad | 6516629249060593557ok: both exist cnt 60622ad1398423cfe0d2aceb7d7c330949896e94 \x60622ad1398423cfe0d2aceb7d7c330949896e94 0x60622ad1398423cfe0d2aceb7d7c330949896e94 | -6895134055073361785ok: both exist cnt 35592b133028e44781cb9b9bf9b48f0454b21543 \x35592b133028e44781cb9b9bf9b48f0454b21543 0x35592b133028e44781cb9b9bf9b48f0454b21543 | 4807782215291803383ok: both exist cnt 315d0884829a38016b3a7243df9408ac464efa2b \x315d0884829a38016b3a7243df9408ac464efa2b 0x315d0884829a38016b3a7243df9408ac464efa2b | -1589591585623535194
swh@checking-missing-contents:~$ check_missing_object staging | missing_objects_counters.awk# status # hits # rates---------------------only_in_cs 5 50.00both_absent 5 50.00total 10 100.00# only_in_cs # hits # rates-------------------------2024-10-16 1 20.002024-07-24 4 80.00# errors # hits # rates---------------------Reported on Thu Jan 16 11:51:12 UTC 2025
There are a deployment with a ceph-rbd volume, check-missing-contents, in staging and production.
staging:
~ ᐅ kbs get po,deploy,pvc -lapp=check-missing-contents -n swhNAME READY STATUS RESTARTS AGEpod/check-missing-contents-ddd9d746f-bp8wq 1/1 Running 0 5d16hNAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/check-missing-contents 1/1 1 1 5d16hNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGEpersistentvolumeclaim/cmc-pvc Bound pvc-ab5ad099-fe51-4b35-9162-a810b7967d20 5Gi RWO ceph-rbd 5d19h
production:
~ ᐅ kbp get po,deploy,pvc -lapp=check-missing-contents -n swhNAME READY STATUS RESTARTS AGEpod/check-missing-contents-ddd9d746f-c7mmj 1/1 Running 0 5d16hNAME READY UP-TO-DATE AVAILABLE AGEdeployment.apps/check-missing-contents 1/1 1 1 5d16hNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGEpersistentvolumeclaim/cmc-pvc Bound pvc-33d8b07a-37b5-4ed6-9ec8-f8f6376b5725 5Gi RWO ceph-rbd 5d17h
The ceph-rbd volume is mounted on /data.
ᐅ kbs get po -lapp=check-missing-contents -n swh -o yaml | \yq '.items[].spec.containers[].volumeMounts[]|select(.name == "cmc-persistent-storage")'mountPath: /dataname: cmc-persistent-storage
Scripts and results are on the ceph-rbd volume.
Here are the current results since Tue Jan 21 ~18:00.
staging:
ᐅ kbs exec-ti-n swh $(kbs get po -n swh -lapp=check-missing-contents -ojsonpath='{.items[0].metadata.name}')--\/data/swh/missing_objects_counters.awk /data/swh/storage-references-directory_staging_asc_all# status # hits # rates---------------------both_exist 298837 96.13both_absent 12023 3.87total 310860 100.00Reported on Mon Jan 27 11:45:59 UTC 2025
production:
ᐅ kbp exec-ti-n swh $(kbp get po -n swh -lapp=check-missing-contents -ojsonpath='{.items[0].metadata.name}')--\/data/swh/missing_objects_counters.awk /data/swh/storage-primary-references-directory_production_asc_all# status # hits # rates---------------------both_exist 2347 0.67only_in_cs 26 0.01both_absent 348518 99.32total 350891 100.00Reported on Mon Jan 27 11:46:03 UTC 2025