code scanner prototype
First prototype of the code scanner
Usage example
swh scanner scan "/path/to/scan"
sample output (using swh-py-template as a source path):
swh/
debian/
.git/
│ refs/
│ │ heads/
│ │ │ master
│ │ remotes/
│ │ tags/
│ hooks/
│ │ pre-receive.sample
│ │ pre-push.sample
│ │ update.sample
│ │ pre-rebase.sample
│ │ fsmonitor-watchman.sample
│ │ post-update.sample
│ │ pre-commit
│ │ pre-commit.sample
│ │ prepare-commit-msg.sample
│ │ commit-msg.sample
│ │ pre-applypatch.sample
│ │ applypatch-msg.sample
│ objects/
│ │ pack/
│ │ │ pack-750cdb447ff562005695fc225fc7f81ba936b697.pack
│ │ │ pack-750cdb447ff562005695fc225fc7f81ba936b697.idx
│ │ info/
│ info/
│ logs/
│ │ refs/
│ │ │ heads/
│ │ │ │ master
│ │ │ remotes/
│ │ │ │ origin/
│ │ │ │ │ HEAD
│ │ HEAD
│ branches/
│ description
│ ORIG_HEAD
│ HEAD
│ config
│ FETCH_HEAD
│ packed-refs
│ index
docs/
requirements-swh.txt
README.md
AUTHORS
requirements.txt
MANIFEST.in
pytest.ini
mypy.ini
LICENSE
CONTRIBUTORS
requirements-test.txt
.gitignore
Makefile
tox.ini
setup.py
CODE_OF_CONDUCT.md
.pre-commit-config.yaml
The output has colors indicating if a directory or a file is discovered: If a directory is blue means that all the contents inside are found, otherwise it's red. For each red directory, the file is green if it's found otherwise it's red.
Migrated from D2657 (view on Phabricator)
Merge request reports
Activity
- swh/scanner/cli.py 0 → 100644
13 from swh.core.cli import CONTEXT_SETTINGS 14 15 @click.group(name='scanner', context_settings=CONTEXT_SETTINGS) 16 @click.pass_context 17 def scanner(ctx): 18 '''Software Heritage Scanner tools.''' 19 pass 20 21 22 def parse_url(url): 23 if not url.startswith('http://') or not url.startswith('https://'): 24 url = 'https://' + url 25 if not url.endswith('/'): 26 url += '/' 27 return url 28 - swh/scanner/scanner.py 0 → 100644
14 from .model import Tree 15 16 from swh.model.cli import pid_of_file, pid_of_dir 17 from swh.model.identifiers import ( 18 parse_persistent_identifier, 19 DIRECTORY, CONTENT 20 ) 21 22 23 async def pids_discovery( 24 pids: List[str], session: aiohttp.ClientSession, api_url: str, 25 ) -> Dict[str, Dict[str, bool]]: 26 """API Request to get information about the persistent identifiers given in 27 input. 28 29 Args: - swh/scanner/scanner.py 0 → 100644
55 56 return await resp.json() 57 58 if len(pids) > chunk_size: 59 for pids_chunk in get_chunk(pids): 60 requests.append(asyncio.create_task( 61 make_request(pids_chunk))) 62 63 res = await asyncio.gather(*requests) 64 # concatenate list of dictionaries 65 return dict(itertools.chain.from_iterable(e.items() for e in res)) 66 else: 67 return await make_request(pids) 68 69 70 def get_subpaths( Missing empty lines (see https://sphinxcontrib-napoleon.readthedocs.io/en/latest/ )
- swh/scanner/scanner.py 0 → 100644
72 """Find the persistent identifier of the directories and files under a 73 given path. 74 75 Args: 76 path: the root path 77 78 Yields: 79 pairs of: path, the relative persistent identifier 80 81 """ 82 def pid_of(path): 83 if path.is_dir(): 84 return pid_of_dir(bytes(path)) 85 elif path.is_file(): 86 return pid_of_file(bytes(path)) 87 - swh/scanner/model.py 0 → 100644
40 """ 41 relative_path = path.relative_to(self.path) 42 43 if relative_path == PosixPath('.'): 44 if pid is not None: 45 self.pid = pid 46 return 47 48 new_path = self.path.joinpath(relative_path.parts[0]) 49 if new_path not in self.children: 50 self.children[new_path] = Tree(new_path, self) 51 52 self.children[new_path].addNode(path, pid) 53 54 def show(self) -> None: 55 """Print all the tree""" - swh/scanner/scanner.py 0 → 100644
16 from swh.model.cli import pid_of_file, pid_of_dir 17 from swh.model.identifiers import ( 18 parse_persistent_identifier, 19 DIRECTORY, CONTENT 20 ) 21 22 23 async def pids_discovery( 24 pids: List[str], session: aiohttp.ClientSession, api_url: str, 25 ) -> Dict[str, Dict[str, bool]]: 26 """API Request to get information about the persistent identifiers given in 27 input. 28 29 Args: 30 pids: a list of persistent identifier 31 api_url: url for the API request - swh/scanner/scanner.py 0 → 100644
57 58 if len(pids) > chunk_size: 59 for pids_chunk in get_chunk(pids): 60 requests.append(asyncio.create_task( 61 make_request(pids_chunk))) 62 63 res = await asyncio.gather(*requests) 64 # concatenate list of dictionaries 65 return dict(itertools.chain.from_iterable(e.items() for e in res)) 66 else: 67 return await make_request(pids) 68 69 70 def get_subpaths( 71 path: PosixPath) -> Iterator[Tuple[PosixPath, str]]: 72 """Find the persistent identifier of the directories and files under a As per a recent discussion, you can replace this by
Iterator[Tuple[PosixPath, str]]
.@vlorentz ^ tell me if i misremember ;)
- swh/scanner/scanner.py 0 → 100644
110 parsed_pids = await pids_discovery( 111 list(parsed_paths.values()), session, api_url) 112 113 def unpack(tup): 114 subpath, pid = tup 115 return (subpath, pid, parsed_pids[pid]['known']) 116 117 return map(unpack, parsed_paths.items()) 118 119 120 async def run( 121 root: PosixPath, api_url: str, source_tree: Tree) -> None: 122 """Start scanning from the given root. 123 124 It fills the source tree with the path discovered. 125 Looks good.
Mostly nitpicks on docstrings as well:
- I'd make the color output optional, that'd make for something more easily parsable.
- Can we have a sample output of a run in the diff description, that'd be nice ;)
- The logger module seems overkill as i see it used only once though
Cheers,
! In !1 (closed), @ardumont wrote:
- I'd make the color output optional, that'd make for something more easily parsable.
Ack. A good default behavior on this is to colorize output when stdout is connected to a terminal and //not// do it when it is connected to a pipe. That's what most
git
commands do and it's a very good default.- swh/scanner/cli.py 0 → 100644
19 pass 20 21 22 def parse_url(url): 23 if not url.startswith('http://') or not url.startswith('https://'): 24 url = 'https://' + url 25 if not url.endswith('/'): 26 url += '/' 27 return url 28 29 30 @scanner.command(name='scan') 31 @click.argument('path', required=True, type=click.Path(exists=True)) 32 @click.option('--api-url', default='https://archive.softwareheritage.org/api/1', 33 metavar='API_URL', show_default=True, 34 help="url for the api request") ! In !1 (closed), @ardumont wrote:
- The logger module seems overkill as i see it used only once though
Yes, i see. I manly use it to benchmark different algorithms so i can remove it.
! In !1 (closed), @vlorentz wrote: It's still missing a description of what it does (ie. check that files exist in swh), in the CLI help and/or README.
and I still think you should use a subcommand, in case you want to add new features in the future without breaking scripts that depend on this one.
I will update the description, anyway the only supported command by the module right now is the scan operation so i could add a subcommand when a new feature is needed .