gitweb: Check git URLs can be cloned before creating loading tasks
Git URLs scapped from gitweb pages or derived from project URLs are now checked for clonability before sending them to the scheduler.
It allows to discard invalid clone URLs as it exists cases where only one URL can be cloned from the list provided by a gitweb project.
This slightly slow downs the listing process but ensures better result.
Fixes #4713.
Merge request reports
Activity
Jenkins job DLS/gitlab-builds #456 succeeded in 3 min 39 sec.
See Console Output, Blue Ocean and Coverage Report for more details.mentioned in issue #4713
I think it's more than a bit unfriendly to always be doing a ls_remote on all repositories at the listing stage, as it would generate quite some load.
I think the initial request would have been taken care of by
base_git_url
, but the lister doesn't know what to do about that when gitweb is set up without rewrites (that is, when the links to projects are ?p=XXXX URLs). If the url pattern on the project list is unknown, we could at least have the lister fall back to using the /detected/ URLs that matchbase_git_url
rather than ignore the setting altogether.Edited by Nicolas DandrimontI think it's more than a bit unfriendly to always be doing a ls_remote on all repositories at the listing stage, as it would generate quite some load.
Well, we are already scrapping the gitweb pages so attempting to list the remotes is as unfriendly from my point of view.
I think the initial request would have been taken care of by
base_git_url
, but the lister doesn't know what to do about that when gitweb is set up without rewrites (that is, when the links to projects are ?p=XXXX URLs). If the url pattern on the project list is unknown, we could at least have the lister fall back to using the /detected/ URLs that matchbase_git_url
rather than ignore the setting altogether.Ack, will also push a merge request using that approach.
No we really scrap all repository home pages as the clone URLs are displayed in those.
What we must do when providing
base_git_url
is to return the clone URL that starts with it.Weird. I was convinced that the point of
base_git_url
was to generate URLs from the project list page directly without scraping the project pages individually, by grabbing the project relative URLs from links in the listing page, and appending them tobase_git_url
. Did I misremember, or did this feature get lost along the way?Scraping individual project pages at all times is a bigger problem, IMO.
The original idea (I had, at least) for
base_git_url
was never to work in all cases, it was to work on the 90+% of cases where there /is/ an easy, regular pattern to map between project page URLs and clone URLs, and where it's therefore unnecessary to hit any of the project pages.There were some cases where there is two different patterns (some cgit instance where there were user repositories in a different namespace than the base namespace), which made it ineffective.
And now that I write this, I notice that the behavior I'm thinking of was implemented in the cgit lister but not in the gitweb lister. Duh.
So I guess it becomes a question of whether the cgit lister behavior can be adopted by the gitweb lister.
I think it would be nice to automatically discover the clone URL pattern by fetching a single project page and fallback on using
base_git_url
when clone URLs are not displayed in project pages (it exists such cases in the wild). I created #4714 to keep track of this for gitweb but same processing coud also be adapted for the cgit case.
Closing this in favor of !552.