Skip to content

nixguix: fails to finish as downloading artifacts step hangs

Another issue exists, sometimes the worker just hangs forever... For example, right now a nixguix process (runs on worker0.internal.staging.swh.network) is currently hanging on a download connection [1]

The only solution i see is to kill the process. Which will result in an unfinished visit state (stuck in ongoing state). Which gives credits to the origin visit reaper proposition btw T2310#43199.

Adding some timeout to the download connection sounds sensible [2] to avoid this kind of caveat [3]. Quoting the documentation of requests [2], "Failure to do so can cause your program to hang indefinitely". Well we had been warned :D

Note: It's probably shared to other package loaders. Right now, it's more obvious with this one as it treats a lot of artifacts in one round.

  • [1]

Last log entry as of now:

Apr 09 17:37:09 worker0 python3[1914]: [2020-04-09 17:37:09,838: DEBUG/ForkPoolWorker-1] package_info: {'url': 'http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.4.0.tar.gz', 'raw': {'url': 'http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.4.0.tar.gz', 'integrity': 'sha256-+EkmHcfJfvHxXyIulVsNPa+ZTsE8nbd2bxrH53uqQEI='}}

Stracing the issue, it's currently waiting on file descriptor 95:

# strace -p 1914
strace: Process 1914 attached
recvfrom(95,

Which leads to socket:

# file /proc/1914/fd/95
/proc/1914/fd/95: symbolic link to socket:[74794390]

Indeed, it's stuck at the http connection.

root@worker0:~# lsof -p 1914 | grep 74794390
python3 1914 swhworker   95u     IPv4           74794390      0t0      TCP worker0.internal.staging.swh.network:58952->hx-xfer-prod.ebi.ac.uk:http (ESTABLISHED)

Migrated from T2357 (view on Phabricator)

Edited by Phabricator Migration user