Here are some tests from my laptop and from the toolbox pod in the staging cluster:
swh@swh-toolbox-b78d9ff5f-qkrtl:~$ git clone https://sourceware.org/git/automake.gitCloning into 'automake'...fatal: unable to access 'https://sourceware.org/git/automake.git/': Failed to connect to sourceware.org port 443: Connection refusedswh@swh-toolbox-b78d9ff5f-qkrtl:~$ curl -I https://sourceware.org/git/curl: (7) Failed to connect to sourceware.org port 443: Connection refused
swh@swh-toolbox-66489dbb5f-b76vw:~$ swh scheduler -C$SWH_CONFIG_FILENAME origin check-listed-origins gitweb sourceware.org -lurl last_seen last_update-------------------------------------------------------------------------------------------------------https://sourceware.org/git/abidb.git 2024-04-03 07:56:26.305784+00:00 2024-02-14 07:55:58.782517+00:00[...]https://sourceware.org/git/valgrind.git 2024-04-03 07:56:26.305784+00:00 2024-04-03 06:33:26.292161+00:00Forge sourceware.org (gitweb) has 70 listed origins in the scheduler database.
schedule the first ingestion:
swh@swh-toolbox-66489dbb5f-b76vw:~$ swh scheduler -C$SWH_CONFIG_FILENAME\ add-forge-now --preset production \ schedule-first-visits \--type-name git \--lister-name gitweb \--lister-instance-name sourceware.org10000 slots available in celery queue70 visits to send to celery
after a lots of save-code-now, the first ingestion is completed:
swh@swh-toolbox-66489dbb5f-b76vw:~$ swh scheduler -C$SWH_CONFIG_FILENAME origin check-ingested-origins gitweb sourceware.orgForge sourceware.org (gitweb) has 70 scheduled ingests in the scheduler.failed : 2None : 0not_found : 0successful : 68total : 70success rate: 97.14%
@guillaume some feedback from IRC that needs to be addressed:
The connection refused errors seems to come from a combination of apache-modqos and fail2ban. modqos delays parallel requests and then when SWH tries to download really large repos (abidb.git and bunsendb.git) over http, that takes too long, triggering fail2ban and the connection refused errors.
abidb.git (large dumps of Linux distro ABI info) and bunsendb.git (testsuite log archive) are not code so not interesting to SWH, so both should probably not be archived by SWH, and are both very large so probably should not be attempted to be archived by SWH, since they fail due to their size and also trigger the connection refused errors.
Could you block these from the ingestion process? This needs doing for both the sourceware cgit and gitweb instances.