Skip to content

Add Gitweb lister

This lister scrapes instances of gitweb.

This retrieves the list of origins from the main html page. Then executes an extra request to the summary page for each repository found to extract the metadata_url if present.

Depending on some instances, we have some specific heuristics, some instances:

  • have summary pages which do not not list metadata_url (so some computation happens to list git:// origins which are cloneable)
  • have summary page which reference metadata_url as a multiple comma separated urls
  • lists relative urls of the repository so we need to join it with the main instance url to have a complete cloneable origins (or summary page)
  • lists "down" http origins (cloning those won't work) so lists those as cloneable https ones (when the main url is behind https).

There is no pagination as no instances were found with it.

Refs. #1800 (moved)

Supersedes !464 (closed)

[1]

2023-07-10 15:40:23 swh-scheduler@localhost:5433 λ select instance_name, count(*) from listed_origins lo inner join listers l on l.id=lo.lister_id where lister_id in (select id from listers where name='gitweb') group by instance_name;
+-----------------------------+-------+
|        instance_name        | count |
+-----------------------------+-------+
| geeqie.org                  |     2 |
| git.distorted.org.uk        |   132 |
| git.erp5.org                |    35 |
| git.ffmpeg.org              |    10 |
| git.freespeechextremist.com |    12 |
| git.postgresql.org          |    80 |
| git.shadowcat.co.uk         |   256 |
| git.ti.com                  |   527 |
| gitweb.dragonflybsd.org     |    11 |
| gitweb.hugovil.com          |    12 |
| osm.etsi.org                |    27 |
| sourceware.org              |    67 |
+-----------------------------+-------+
(12 rows)

Time: 1.574 ms

Note: for each of those instance, i cloned randomly one origin with success.

Edited by Antoine R. Dumont

Merge request reports

Loading