This lister scrapes instances of gitweb.
This retrieves the list of origins from the main html page. Then executes an extra request to the summary page for each repository found to extract the metadata_url if present.
Depending on some instances, we have some specific heuristics, some instances:
There is no pagination as no instances were found with it.
Refs. #1800 (moved)
Supersedes !464 (closed)
[1]
2023-07-10 15:40:23 swh-scheduler@localhost:5433 λ select instance_name, count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where lister_id in (select id from listers where name='gitweb') group by instance_name;
+-----------------------------+-------+
| instance_name | count |
+-----------------------------+-------+
| geeqie.org | 2 |
| git.distorted.org.uk | 132 |
| git.erp5.org | 35 |
| git.ffmpeg.org | 10 |
| git.freespeechextremist.com | 12 |
| git.postgresql.org | 80 |
| git.shadowcat.co.uk | 256 |
| git.ti.com | 527 |
| gitweb.dragonflybsd.org | 11 |
| gitweb.hugovil.com | 12 |
| osm.etsi.org | 27 |
| sourceware.org | 67 |
+-----------------------------+-------+
(12 rows)
Time: 1.574 ms
Note: for each of those instance, i cloned randomly one origin with success.