Skip to content
Snippets Groups Projects

Setup async interface for discovery module

2 unresolved threads

This will allow us to use this interface in async code like swh-scanner.

Unfortunately, this means calling asyncio.run for sync code, but the performance impact should be negligible.

The swh_storage.*missing* APIs are inconsistent for each type, which requires a lot of boilerplate code. This should be addressed in a follow-up.

There is also one hack (marked as such inline) needed to have an API that isn't also plagued by implementation details of needing to keep ID -> object matchers around. This should be a removed as a direct consequence of aligning the storage APIs.


Migrated from D8538 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D8538 (id=30790)

    Rebasing onto 798f749e...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 19db5769cbe618b77c322189b66994222e2c8668
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Mon Sep 26 18:21:54 2022 +0200
    
        Setup async interface for discovery module
        
        This will allow us to use this interface in async code like ``swh-scanner``.
        
        Unfortunately, this means calling ``asyncio.run`` for sync code, but the
        performance impact should be negligible.
        
        The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
        requires a lot of boilerplate code. This should be addressed in a
        follow-up.
        
        There is also one hack (marked as such inline) needed to have an API that
        isn't also plagued by implementation details of needing to keep
        ID -> object matchers around. This should be a removed as a direct
        consequence of aligning the storage APIs.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/896/ for more details.

  • Author Contributor

    One question for reviewers: I've tried using Protocol for ArchiveDiscoveryInterface (should I?), but I'm not sure if it's possible to define a concrete method within the protocol, namely if __init__ could be implemented by default in ArchiveDiscoveryInterface.

  • classes should not inherit from protocols. Use an abstract class instead of a protocol if you want to use inheritance.

  • Author Contributor

    Use ABC

  • Build is green

    Patch application report for D8538 (id=30795)

    Rebasing onto 798f749e...

    Current branch diff-target is up to date.
    Changes applied before test
    commit fe6610629b1bff545b30ea86743d8e90a0487a47
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Mon Sep 26 18:21:54 2022 +0200
    
        Setup async interface for discovery module
        
        This will allow us to use this interface in async code like ``swh-scanner``.
        
        Unfortunately, this means calling ``asyncio.run`` for sync code, but the
        performance impact should be negligible.
        
        The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
        requires a lot of boilerplate code. This should be addressed in a
        follow-up.
        
        There is also one hack (marked as such inline) needed to have an API that
        isn't also plagued by implementation details of needing to keep
        ID -> object matchers around. This should be a removed as a direct
        consequence of aligning the storage APIs.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/897/ for more details.

  • vlorentz
    vlorentz @vlorentz started a thread on the diff
39 def __init__(
40 self,
41 contents: List[model.Content],
42 skipped_contents: List[model.SkippedContent],
43 directories: List[model.Directory],
44 ) -> None:
45 self.contents = contents
46 self.skipped_contents = skipped_contents
47 self.directories = directories
48
49 @abc.abstractmethod
50 async def content_missing(self, contents: List[Sha1Git]) -> Iterable[Sha1Git]:
51 """List content missing from the archive by sha1"""
52
53 @abc.abstractmethod
54 async def skipped_content_missing(
  • I don't understand why inconsistencies in swh-storage's API require passing this dict around.

    Could you explain what needs to change in the swh-storage API, and how this code would look like after the change?

  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 26 27 )
    27 28
    28 29
    30 class ArchiveDiscoveryInterface(abc.ABC):
  • Author Contributor

    Add docstring, abstract method decorators, explain hack

  • Build is green

    Patch application report for D8538 (id=30851)

    Rebasing onto 798f749e...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 404c1b9a30928f42a8f32bbfbcfe64986fdc8126
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Mon Sep 26 18:21:54 2022 +0200
    
        Setup async interface for discovery module
        
        This will allow us to use this interface in async code like ``swh-scanner``.
        
        Unfortunately, this means calling ``asyncio.run`` for sync code, but the
        performance impact should be negligible.
        
        The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
        requires a lot of boilerplate code. This should be addressed in a
        follow-up.
        
        There is also one hack (marked as such inline) needed to have an API that
        isn't also plagued by implementation details of needing to keep
        ID -> object matchers around. This should be a removed as a direct
        consequence of aligning the storage APIs. See inline comment.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/899/ for more details.

  • I still don't understand why you need the _all_contents hack instead of doing this:

        async def skipped_content_missing(
            self, skipped_contents: List[Sha1Git]
        ) -> Iterable[Sha1Git]:
            """List skipped content missing from the archive by sha1"""
            contents = [{"sha1_git": s} s in skipped_contents]
            return (d["sha1_git"] for d in self.storage.skipped_content_missing(contents))
  • Author Contributor

    ! In !321 (closed), @vlorentz wrote: I still don't understand why you need the _all_contents hack instead of doing this:

        async def skipped_content_missing(
            self, skipped_contents: List[Sha1Git]
        ) -> Iterable[Sha1Git]:
            """List skipped content missing from the archive by sha1"""
            contents = [{"sha1_git": s} s in skipped_contents]
            return (d["sha1_git"] for d in self.storage.skipped_content_missing(contents))

    Ah. I was sure skipped_content_missing took SkippedContent instead of dicts. Sorry about that, will update this patch soon.

  • Author Contributor

    Remove ugly hack

  • Build is green

    Patch application report for D8538 (id=30912)

    Rebasing onto 798f749e...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 1facea3cd215155f77ae4083a33837b7c6f642b0
    Author: Raphaël Gomès <rgomes@octobus.net>
    Date:   Mon Sep 26 18:21:54 2022 +0200
    
        Setup async interface for discovery module
        
        This will allow us to use this interface in async code like ``swh-scanner``.
        
        Unfortunately, this means calling ``asyncio.run`` for sync code, but the
        performance impact should be negligible.
        
        The ``swh_storage.*missing*`` APIs are inconsistent for each type, which
        requires a lot of boilerplate code. This should be addressed in a
        follow-up.

    See https://jenkins.softwareheritage.org/job/DLDBASE/job/tests-on-diff/905/ for more details.

  • Merge request was accepted

  • vlorentz approved this merge request

    approved this merge request

  • Author Contributor

    Merge request was merged

  • I'm coming late to the battle ¯_(ツ)_/¯.

    Still a couple of remarks/questions.

    1. Isn't this missing an asyncio dependency requirements in /requirements.txt

    2. Why is this defined here and not in the storage since all the current dicsovery methods involve the storage?

  • Still a couple of remarks/questions.

    1. Isn't this missing an asyncio dependency requirements in /requirements.txt

    nope, it's fine [1]

    • [1]
    10:35 <+ardumont> is asyncio part of python3.7?
    10:36 <+ardumont> (for loader.core and the new discovery async code ^)
    10:36 <Alphare> yes
    10:36 <Alphare> https://docs.python.org/3.7/library/asyncio.html
    1. Why is this defined here and not in the storage since all the current dicsovery methods involve the storage?

    see [2]

  • mentioned in merge request swh-scanner!64 (closed)

  • Please register or sign in to reply
    Loading