When retrieving the archives, we checked for size and md5.
This task is about checking the archive's content which are either svndump, git repository or hg repository.
Around ~120k done.
It's rather slow, around 1.1/s.
|-------------------------------+----------------+| date-snapshot | messages_ready ||-------------------------------+----------------+| Thu May 05 23:32:35 CEST 2016 | 1302268 || Fri May 06 10:47:07 CEST 2016 | 1258667 |#+BEGIN_SRC lisp(let ((speed (swh-worker-average-speed-per-second "Thu May 05 23:32:35 CEST 2016" 1302268 "Fri May 06 10:47:07 CEST 2016" 1258667)) ;; 1.0773127100217434 j/s (remaining-jobs 1258667)) (swh-worker-remains-in-days speed remaining-jobs));; 13.522447992188424 remaining days#+END_SRC
On such sample, only 40 errors (which i did not yet analyze).
psql -c "select level, message from log where src_host='worker01.softwareheritage.org' and ts between '2016-05-04 18:00:00.00+01' and '2016-05-06 10:55:00.00+01' and level = 'error';" service=swh-log > swh-fetcher-googlecode-checks-in-errors-between-04-and-06-may-2016ardumont@worker01:~$ grep -c FAILURE swh-fetcher-googlecode-checks-in-errors-between-04-and-06-may-201640
As this won't complete in the time frame we have left and i forgot to randomize the sample (duh!), i purged the actual queue. I rescheduled a complete randomized samples.
Only 4132 out of 1379346 files were in errors during checks (~0.29%)
Checking some manually gave no error.
It is possible the worker ran out of disk space or out of memory during checks (if too much concurrent tasks were ran for example).
So those were rescheduled for checking (with less concurrency this time).
Taking a look at those checks in logs (worker01), i see no error either for now.