Skip to content
Snippets Groups Projects

launchpad: Ignore erratic page and continue listing

1 unresolved thread

In case of all retry tryouts failed, finally ignore that page and continue listing the next pages.

Related to #3948 (closed)

Test Plan

tox

Applied on staging and triggered the listing. That actually allowed the listing to finish properly.

Feb 17 17:40:37 worker1 python3[2168786]: [2022-02-17 17:40:37,571: INFO/MainProcess] Received task: swh.lister.gitlab.tasks.IncrementalGitLabLister[496b82a5-6e7a-4e39-841c-e9158334fc1e]
Feb 17 17:40:37 worker1 python3[2168796]: [2022-02-17 17:40:37,796: INFO/ForkPoolWorker-4] Task swh.lister.gitlab.tasks.IncrementalGitLabLister[496b82a5-6e7a-4e39-841c-e9158334fc1e] succeeded in 0.19229740649461746s: {'pages': 1, 'origins': 0}

It finished:

Feb 17 20:36:17 worker1 python3[2168793]: [2022-02-17 20:36:17,861: INFO/ForkPoolWorker-1] Task swh.lister.launchpad.tasks.FullLaunchpadLister[d8bb1c71-4c2c-406c-9431-f358cdca5ec5] succeeded in 10058.709554322995s: {'pages': 2, 'origins':
216789}

Migrated from D7198 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
170 origin_url,
171 vcs_type,
172 last_update,
173 )
170 174
171 yield ListedOrigin(
172 lister_id=self.lister_obj.id,
173 visit_type=vcs_type,
174 url=origin_url,
175 last_update=last_update,
176 )
175 yield ListedOrigin(
176 lister_id=self.lister_obj.id,
177 visit_type=vcs_type,
178 url=origin_url,
179 last_update=last_update,
  • I misunderstood what is the repos object in the code so we did not use the throttling_retry decorator the right way.

    repos is an instance of lazr.restfulclient.resource import Collection which implements an iterator fetching pages from the launchpad REST API on a regular basis.

    What we have to retry here is the next operation on the iterator as that call can raise a RestfulError exception.

    But you are right, if after the max attempts of retry we still get an exception, we have to stop the listing of the current page.

    Below is the diff about how I manage to implement what it is described above for you to get the idea:

    diff --git a/swh/lister/launchpad/lister.py b/swh/lister/launchpad/lister.py
    index 722b299..ccc2378 100644
    --- a/swh/lister/launchpad/lister.py
    +++ b/swh/lister/launchpad/lister.py
    @@ -11,7 +11,8 @@ from typing import Any, Dict, Iterator, Optional, Tuple
     import iso8601
     from launchpadlib.launchpad import Launchpad
     from lazr.restfulclient.errors import RestfulError
    -from lazr.restfulclient.resource import Collection
    +from lazr.restfulclient.resource import Collection, Resource
    +from tenacity.before_sleep import before_sleep_log
     
     from swh.lister.utils import retry_if_exception, throttling_retry
     from swh.scheduler.interface import SchedulerInterface
    @@ -99,10 +100,13 @@ class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]):
                         d[attribute_name] = date_last_modified.isoformat()
             return d
     
    -    @throttling_retry(retry=retry_if_restful_error)
    +    @throttling_retry(
    +        retry=retry_if_restful_error,
    +        before_sleep=before_sleep_log(logger, logging.WARNING),
    +    )
         def _page_request(
             self, launchpad, vcs_type: str, date_last_modified: Optional[datetime]
    -    ) -> Optional[Collection]:
    +    ) -> Collection:
             """Querying the page of results for a given vcs_type since the date_last_modified. If
             some issues occurs, this will deal with the retrying policy.
     
    @@ -141,7 +145,13 @@ class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]):
                     continue
                 yield vcs_type, result
     
    -    @throttling_retry(retry=retry_if_restful_error)
    +    @throttling_retry(
    +        retry=retry_if_restful_error,
    +        before_sleep=before_sleep_log(logger, logging.WARNING),
    +    )
    +    def get_next_repo(self, repos_it: Iterator[Resource]) -> Resource:
    +        return next(repos_it)
    +
         def get_origins_from_page(self, page: LaunchpadPageType) -> Iterator[ListedOrigin]:
             """
             Iterate on all git repositories and yield ListedOrigin instances.
    @@ -149,9 +159,10 @@ class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]):
             assert self.lister_obj.id is not None
     
             vcs_type, repos = page
    -
    +        repos_it = iter(repos)
             try:
    -            for repo in repos:
    +            while True:
    +                repo = self.get_next_repo(repos_it)
                     origin_url = origin(vcs_type, repo)
     
                     # filter out origins with invalid URL
    @@ -175,6 +186,8 @@ class LaunchpadLister(Lister[LaunchpadListerState, LaunchpadPageType]):
                         url=origin_url,
                         last_update=last_update,
                     )
    +        except StopIteration:
    +            pass
             except RestfulError as e:
                 logger.warning("Listing %s origins raised %s", vcs_type, e)
     
    diff --git a/swh/lister/launchpad/tests/test_lister.py b/swh/lister/launchpad/tests/test_lister.py
    index 59fe605..12c3095 100644
    --- a/swh/lister/launchpad/tests/test_lister.py
    +++ b/swh/lister/launchpad/tests/test_lister.py
    @@ -26,18 +26,23 @@ class _Repo:
     class _Collection:
         entries: List[_Repo] = []
     
    -    def __init__(self, file):
    -        self.entries = [_Repo(r) for r in file]
    +    def __init__(self, repos):
    +        self.repos = repos
    +        self.it = iter(self.repos)
    +
    +    def __next__(self):
    +        return next(self.it)
     
         def __getitem__(self, key):
    -        return self.entries[key]
    +        return self.repos[key]
     
         def __len__(self):
    -        return len(self.entries)
    +        return len(self.repos)
     
     
     def _launchpad_response(datadir, datafile):
    -    return _Collection(json.loads(Path(datadir, datafile).read_text()))
    +    repos = json.loads(Path(datadir, datafile).read_text())
    +    return _Collection([_Repo(r) for r in repos])
     
     
     @pytest.fixture
    @@ -194,7 +199,9 @@ def test_launchpad_incremental_lister(
     def test_launchpad_lister_invalid_url_filtering(
         swh_scheduler, mocker,
     ):
    -    invalid_origin = [_Repo({"git_https_url": "tag:launchpad.net:2008:redacted",})]
    +    invalid_origin = _Collection(
    +        [_Repo({"git_https_url": "tag:launchpad.net:2008:redacted",})]
    +    )
         _mock_launchpad(mocker, invalid_origin)
         lister = LaunchpadLister(scheduler=swh_scheduler)
         stats = lister.run()
    @@ -213,7 +220,7 @@ def test_launchpad_lister_duplicated_origin(
                 "date_last_modified": "2021-01-14 21:05:31.231406+00:00",
             }
         )
    -    origins = [origin, origin]
    +    origins = _Collection([origin, origin])
         _mock_launchpad(mocker, origins)
         lister = LaunchpadLister(scheduler=swh_scheduler)
         stats = lister.run()
    
  • Merge request was returned for changes

  • Adapt according to anlambert's suggestion

  • repos is an instance of lazr.restfulclient.resource import Collection which implements an iterator fetching pages from the launchpad REST API on a regular basis.

    yes, totally.

    What we have to retry here is the next operation on the iterator as that call can raise a RestfulError exception.

    And that part escaped me, indeed, let's try the next origin... Thanks for pointing it out! I've adapted the diff accordingly. I did not change the diff description though. Since that fixes the same problem with a better implementation, i think it's fine that way.

  • Build is green

    Patch application report for D7198 (id=26102)

    Rebasing onto 6a747955...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 46401122b9596c96c0d5250458eb0ad4bce76904
    Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
    Date:   Thu Feb 17 18:16:43 2022 +0100
    
        launchpad: Ignore erratic origin reading failure
        
        Related to #3948

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/464/ for more details.

  • No need for a full blown try: except: block, restrict to where the iteration consumption happen.

  • Build is green

    Patch application report for D7198 (id=26103)

    Rebasing onto 6a747955...

    Current branch diff-target is up to date.
    Changes applied before test
    commit d18f7b53a2368ee9ffb5cb86c4b6b87bf8433263
    Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
    Date:   Thu Feb 17 18:16:43 2022 +0100
    
        launchpad: Ignore erratic origin reading failure
        
        Related to #3948

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/465/ for more details.

  • Add coverage (although i don't see yet how to mock retry here)

  • Build is green

    Patch application report for D7198 (id=26104)

    Rebasing onto 6a747955...

    Current branch diff-target is up to date.
    Changes applied before test
    commit f3c6a9ee80660aef19019d95b747a2f5835b3afd
    Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
    Date:   Thu Feb 17 18:16:43 2022 +0100
    
        launchpad: Ignore erratic origin reading failure
        
        Related to #3948

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/466/ for more details.

  • Revert to initial commit (other suggestions do not work as retry on iterator cannot work)

  • So after some tests, what I proposed in !267 (closed) simply does not work as iterators cannot be retried.

    So wrapping the repos iteration in a try/except block is equivalent.

    Let's use that and see on staging if the incremental lister can continue where it stopped when a RestfulError is raised.

  • Merge request was accepted

  • Antoine Lambert approved this merge request

    approved this merge request

  • Build is green

    Patch application report for D7198 (id=26105)

    Rebasing onto 6a747955...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 2568ecc7c2fa1082f7a739c383b4659756d476cc
    Author: Antoine R. Dumont (@ardumont) <ardumont@softwareheritage.org>
    Date:   Thu Feb 17 18:16:43 2022 +0100
    
        launchpad: Ignore erratic page and continue listing next page
        
        The decorator is dropped on `get_origins_from_page` as we cannot retry an iterator
        consumption anyway.
        
        Related to #3948

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/467/ for more details.

  • Merge request was merged

  • Please register or sign in to reply
    Loading