Draft: Add 'discovery' model
It adds primitives for finding the unknown parts of disk contents efficiently. It provides ArchiveDiscoveryInterface and BaseDiscoveryGraph classes for discovery algorithms.
This is a follow up of comments made in swh-scanner!64 (closed) about splitting some code to common module
Related swh-scanner!64 (closed)
Related swh-scanner#4591
Migrated from D8937 (view on Phabricator)
Merge request reports
Activity
Build has FAILED
Patch application report for D8937 (id=32206)
Rebasing onto 818ad826...
First, rewinding head to replay your work on top of it... Applying: Add 'discovery' model
Changes applied before test
commit 8b9bf14fc44b42ac0c20b2aac06ff018d4dc3d2c Author: Franck Bret <franck.bret@octobus.net> Date: Wed Dec 7 11:20:27 2022 +0100 Add 'discovery' model It adds primitives for finding the unknown parts of disk contents efficiently. It provides ArchiveDiscoveryInterface and BaseDiscoveryGraph classes for discovery algorithms. This is a follow up of comments made in swh/devel/swh-scanner!64 about splitting some code to common module Related swh/devel/swh-scanner!64 Related swh/devel/swh-scanner#4591
Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/546/ See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/546/console
mentioned in merge request swh-scanner!64 (closed)
- swh/model/discovery.py 0 → 100644
77 """List content missing from the archive by sha1""" 78 return self.storage.content_missing_per_sha1_git(contents) 79 80 async def skipped_content_missing( 81 self, skipped_contents: List[Sha1Git] 82 ) -> Iterable[Sha1Git]: 83 """List skipped content missing from the archive by sha1""" 84 contents = [ 85 {"sha1_git": s, "sha1": None, "sha256": None, "blake2s256": None} 86 for s in skipped_contents 87 ] 88 return (d["sha1_git"] for d in self.storage.skipped_content_missing(contents)) 89 90 async def directory_missing(self, directories: List[Sha1Git]) -> Iterable[Sha1Git]: 91 """List directories missing from the archive by sha1""" 92 return self.storage.directory_missing(directories) - Comment on lines +63 to +92
This class should be moved in
swh.storage.algos.discovery
as it is storage specific.Edited by Antoine Lambert
As all the logic related to the discovery algorithm should be found here, that module should also contain the
RandomDirSamplingDiscoveryGraph
class and thefilter_known_objects
function.class RandomDirSamplingDiscoveryGraph(BaseDiscoveryGraph): """Use a random sampling using only directories. This allows us to find a statistically good spread of entries in the graph with a smaller population than using all types of entries. When there are no more directories, only contents or skipped contents are undecided if any are left: we send them directly to the storage since they should be few and their structure flat.""" async def get_sample(self) -> Sample: if self._undecided_directories: if len(self._undecided_directories) <= SAMPLE_SIZE: return Sample( contents=set(), skipped_contents=set(), directories=set(self._undecided_directories), ) sample = random.sample(self._undecided_directories, SAMPLE_SIZE) directories = {o for o in sample} return Sample( contents=set(), skipped_contents=set(), directories=directories ) contents = set() skipped_contents = set() for sha1 in self.undecided: obj = self._all_contents[sha1] obj_type = obj.object_type if obj_type == model.Content.object_type: contents.add(sha1) elif obj_type == model.SkippedContent.object_type: skipped_contents.add(sha1) else: raise TypeError(f"Unexpected object type {obj_type}") return Sample( contents=contents, skipped_contents=skipped_contents, directories=set() ) async def filter_known_objects( archive: ArchiveDiscoveryInterface, graph: Optional[BaseDiscoveryGraph] = None ): """Filter ``archive``'s ``contents``, ``skipped_contents`` and ``directories`` to only return those that are unknown to the SWH archive using a discovery algorithm.""" contents = archive.contents skipped_contents = archive.skipped_contents directories = archive.directories contents_count = len(contents) skipped_contents_count = len(skipped_contents) directories_count = len(directories) if graph is None: graph = RandomDirSamplingDiscoveryGraph(contents, skipped_contents, directories) while graph.undecided: sample = await graph.get_sample() await graph.do_query(archive, sample) contents = [c for c in contents if c.sha1_git in graph.unknown] skipped_contents = [c for c in skipped_contents if c.sha1_git in graph.unknown] directories = [c for c in directories if c.id in graph.unknown] logger.debug( "Filtered out %d contents, %d skipped contents and %d directories", contents_count - len(contents), skipped_contents_count - len(skipped_contents), directories_count - len(directories), ) return (contents, skipped_contents, directories)
mentioned in merge request swh-storage!850
@anlambert I did push fixes on my fork on the same branch but It does not link to existing MR. Can you apply the changes or Do I create another MR?
@vlorentz @anlambert I've made 2 new MR with updated code !327 (closed) and !329 (closed) so we can close this one
mentioned in merge request !327 (closed)
mentioned in merge request !329 (closed)