Skip to content
Snippets Groups Projects

gitweb: Check git URLs can be cloned before creating loading tasks

Closed Antoine Lambert requested to merge anlambert/swh-lister:gitweb-check-clone-urls into master
1 unresolved thread

Git URLs scapped from gitweb pages or derived from project URLs are now checked for clonability before sending them to the scheduler.

It allows to discard invalid clone URLs as it exists cases where only one URL can be cloned from the list provided by a gitweb project.

This slightly slow downs the listing process but ensures better result.

Fixes #4713.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Jenkins job DLS/gitlab-builds #456 succeeded in 3 min 39 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • Antoine Lambert mentioned in issue #4713

    mentioned in issue #4713

  • I think it's more than a bit unfriendly to always be doing a ls_remote on all repositories at the listing stage, as it would generate quite some load.

    I think the initial request would have been taken care of by base_git_url, but the lister doesn't know what to do about that when gitweb is set up without rewrites (that is, when the links to projects are ?p=XXXX URLs). If the url pattern on the project list is unknown, we could at least have the lister fall back to using the /detected/ URLs that match base_git_url rather than ignore the setting altogether.

    Edited by Nicolas Dandrimont
    • Author Maintainer

      I think it's more than a bit unfriendly to always be doing a ls_remote on all repositories at the listing stage, as it would generate quite some load.

      Well, we are already scrapping the gitweb pages so attempting to list the remotes is as unfriendly from my point of view.

      I think the initial request would have been taken care of by base_git_url, but the lister doesn't know what to do about that when gitweb is set up without rewrites (that is, when the links to projects are ?p=XXXX URLs). If the url pattern on the project list is unknown, we could at least have the lister fall back to using the /detected/ URLs that match base_git_url rather than ignore the setting altogether.

      Ack, will also push a merge request using that approach.

    • We're only scraping the pages when base_git_url is ineffective, so we should really make sure it works in most cases (and we use it in as many places as possible).

    • Author Maintainer

      No we really scrap all repository home pages as the clone URLs are displayed in those.

      What we must do when providing base_git_url is to return the clone URL that starts with it.

    • Weird. I was convinced that the point of base_git_url was to generate URLs from the project list page directly without scraping the project pages individually, by grabbing the project relative URLs from links in the listing page, and appending them to base_git_url. Did I misremember, or did this feature get lost along the way?

      Scraping individual project pages at all times is a bigger problem, IMO.

    • Author Maintainer

      Well finding the clone URL from a project name is not always straightforward so it is better to extract those from the project page.

      But we could indeed fetch a single project page and find the clone URL patterns instead of fetching all project pages.

    • The original idea (I had, at least) for base_git_url was never to work in all cases, it was to work on the 90+% of cases where there /is/ an easy, regular pattern to map between project page URLs and clone URLs, and where it's therefore unnecessary to hit any of the project pages.

      There were some cases where there is two different patterns (some cgit instance where there were user repositories in a different namespace than the base namespace), which made it ineffective.

      And now that I write this, I notice that the behavior I'm thinking of was implemented in the cgit lister but not in the gitweb lister. Duh.

      So I guess it becomes a question of whether the cgit lister behavior can be adopted by the gitweb lister.

    • Author Maintainer

      I think it would be nice to automatically discover the clone URL pattern by fetching a single project page and fallback on using base_git_url when clone URLs are not displayed in project pages (it exists such cases in the wild). I created #4714 to keep track of this for gitweb but same processing coud also be adapted for the cgit case.

      Until then !552 fixes the issue observed in #4713.

    • Please register or sign in to reply
  • Author Maintainer

    Closing this in favor of !552.

Please register or sign in to reply
Loading