Skip to content
Snippets Groups Projects

Draft: Add 'discovery' model

Closed Franck Bret requested to merge generated-differential-D8937-source into master
2 unresolved threads

It adds primitives for finding the unknown parts of disk contents efficiently. It provides ArchiveDiscoveryInterface and BaseDiscoveryGraph classes for discovery algorithms.

This is a follow up of comments made in swh-scanner!64 (closed) about splitting some code to common module

Related swh-scanner!64 (closed)

Related swh-scanner#4591


Migrated from D8937 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
77 """List content missing from the archive by sha1"""
78 return self.storage.content_missing_per_sha1_git(contents)
79
80 async def skipped_content_missing(
81 self, skipped_contents: List[Sha1Git]
82 ) -> Iterable[Sha1Git]:
83 """List skipped content missing from the archive by sha1"""
84 contents = [
85 {"sha1_git": s, "sha1": None, "sha256": None, "blake2s256": None}
86 for s in skipped_contents
87 ]
88 return (d["sha1_git"] for d in self.storage.skipped_content_missing(contents))
89
90 async def directory_missing(self, directories: List[Sha1Git]) -> Iterable[Sha1Git]:
91 """List directories missing from the archive by sha1"""
92 return self.storage.directory_missing(directories)
  • As all the logic related to the discovery algorithm should be found here, that module should also contain the RandomDirSamplingDiscoveryGraph class and the filter_known_objects function.

    class RandomDirSamplingDiscoveryGraph(BaseDiscoveryGraph):
        """Use a random sampling using only directories.
    
        This allows us to find a statistically good spread of entries in the graph
        with a smaller population than using all types of entries. When there are
        no more directories, only contents or skipped contents are undecided if any
        are left: we send them directly to the storage since they should be few and
        their structure flat."""
    
        async def get_sample(self) -> Sample:
            if self._undecided_directories:
                if len(self._undecided_directories) <= SAMPLE_SIZE:
                    return Sample(
                        contents=set(),
                        skipped_contents=set(),
                        directories=set(self._undecided_directories),
                    )
                sample = random.sample(self._undecided_directories, SAMPLE_SIZE)
                directories = {o for o in sample}
                return Sample(
                    contents=set(), skipped_contents=set(), directories=directories
                )
    
            contents = set()
            skipped_contents = set()
    
            for sha1 in self.undecided:
                obj = self._all_contents[sha1]
                obj_type = obj.object_type
                if obj_type == model.Content.object_type:
                    contents.add(sha1)
                elif obj_type == model.SkippedContent.object_type:
                    skipped_contents.add(sha1)
                else:
                    raise TypeError(f"Unexpected object type {obj_type}")
    
            return Sample(
                contents=contents, skipped_contents=skipped_contents, directories=set()
            )
    
    
    async def filter_known_objects(
        archive: ArchiveDiscoveryInterface, graph: Optional[BaseDiscoveryGraph] = None
    ):
        """Filter ``archive``'s ``contents``, ``skipped_contents`` and ``directories``
        to only return those that are unknown to the SWH archive using a discovery
        algorithm."""
        contents = archive.contents
        skipped_contents = archive.skipped_contents
        directories = archive.directories
    
        contents_count = len(contents)
        skipped_contents_count = len(skipped_contents)
        directories_count = len(directories)
    
        if graph is None:
            graph = RandomDirSamplingDiscoveryGraph(contents, skipped_contents, directories)
    
        while graph.undecided:
            sample = await graph.get_sample()
            await graph.do_query(archive, sample)
    
        contents = [c for c in contents if c.sha1_git in graph.unknown]
        skipped_contents = [c for c in skipped_contents if c.sha1_git in graph.unknown]
        directories = [c for c in directories if c.id in graph.unknown]
    
        logger.debug(
            "Filtered out %d contents, %d skipped contents and %d directories",
            contents_count - len(contents),
            skipped_contents_count - len(skipped_contents),
            directories_count - len(directories),
        )
    
        return (contents, skipped_contents, directories)
    
  • Antoine Lambert mentioned in merge request swh-storage!850

    mentioned in merge request swh-storage!850

  • vlorentz mentioned in merge request !327 (closed)

    mentioned in merge request !327 (closed)

  • Franck Bret mentioned in merge request !329 (closed)

    mentioned in merge request !329 (closed)

  • closed

  • Please register or sign in to reply
    Loading