Skip to content
Snippets Groups Projects

code scanner prototype

10 unresolved threads

First prototype of the code scanner

Usage example

swh scanner scan "/path/to/scan"

sample output (using swh-py-template as a source path):

swh/
debian/
.git/
│   refs/
│   │   heads/
│   │   │   master
│   │   remotes/
│   │   tags/
│   hooks/
│   │   pre-receive.sample
│   │   pre-push.sample
│   │   update.sample
│   │   pre-rebase.sample
│   │   fsmonitor-watchman.sample
│   │   post-update.sample
│   │   pre-commit
│   │   pre-commit.sample
│   │   prepare-commit-msg.sample
│   │   commit-msg.sample
│   │   pre-applypatch.sample
│   │   applypatch-msg.sample
│   objects/
│   │   pack/
│   │   │   pack-750cdb447ff562005695fc225fc7f81ba936b697.pack
│   │   │   pack-750cdb447ff562005695fc225fc7f81ba936b697.idx
│   │   info/
│   info/
│   logs/
│   │   refs/
│   │   │   heads/
│   │   │   │   master
│   │   │   remotes/
│   │   │   │   origin/
│   │   │   │   │   HEAD
│   │   HEAD
│   branches/
│   description
│   ORIG_HEAD
│   HEAD
│   config
│   FETCH_HEAD
│   packed-refs
│   index
docs/
requirements-swh.txt
README.md
AUTHORS
requirements.txt
MANIFEST.in
pytest.ini
mypy.ini
LICENSE
CONTRIBUTORS
requirements-test.txt
.gitignore
Makefile
tox.ini
setup.py
CODE_OF_CONDUCT.md
.pre-commit-config.yaml

The output has colors indicating if a directory or a file is discovered: If a directory is blue means that all the contents inside are found, otherwise it's red. For each red directory, the file is green if it's found otherwise it's red.


Migrated from D2657 (view on Phabricator)

Merge request reports

Closed by Phabricator Migration userPhabricator Migration user 5 years ago (Feb 28, 2020 2:12pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
13 from swh.core.cli import CONTEXT_SETTINGS
14
15 @click.group(name='scanner', context_settings=CONTEXT_SETTINGS)
16 @click.pass_context
17 def scanner(ctx):
18 '''Software Heritage Scanner tools.'''
19 pass
20
21
22 def parse_url(url):
23 if not url.startswith('http://') or not url.startswith('https://'):
24 url = 'https://' + url
25 if not url.endswith('/'):
26 url += '/'
27 return url
28
  • Should have a docstring explaining what it does.

    And if the whole package contains "scanner tools", then the command should have a more specific name telling what it does

  • Please register or sign in to reply
  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 14 from .model import Tree
    15
    16 from swh.model.cli import pid_of_file, pid_of_dir
    17 from swh.model.identifiers import (
    18 parse_persistent_identifier,
    19 DIRECTORY, CONTENT
    20 )
    21
    22
    23 async def pids_discovery(
    24 pids: List[str], session: aiohttp.ClientSession, api_url: str,
    25 ) -> Dict[str, Dict[str, bool]]:
    26 """API Request to get information about the persistent identifiers given in
    27 input.
    28
    29 Args:
  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 55
    56 return await resp.json()
    57
    58 if len(pids) > chunk_size:
    59 for pids_chunk in get_chunk(pids):
    60 requests.append(asyncio.create_task(
    61 make_request(pids_chunk)))
    62
    63 res = await asyncio.gather(*requests)
    64 # concatenate list of dictionaries
    65 return dict(itertools.chain.from_iterable(e.items() for e in res))
    66 else:
    67 return await make_request(pids)
    68
    69
    70 def get_subpaths(
  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 72 """Find the persistent identifier of the directories and files under a
    73 given path.
    74
    75 Args:
    76 path: the root path
    77
    78 Yields:
    79 pairs of: path, the relative persistent identifier
    80
    81 """
    82 def pid_of(path):
    83 if path.is_dir():
    84 return pid_of_dir(bytes(path))
    85 elif path.is_file():
    86 return pid_of_file(bytes(path))
    87
  • I'm missing context on what this code scanner does, so I'm only commenting on documentation for now.

    New code should use type annotations, instead of having types in the docstring.

  • Merge request was returned for changes

  • The cli supports only one operation: swh scanner [options] path, where path is the root path we want to scan. The output now is a tree like structure with a green/blue text if the path is found, red otherwise.

  • 1 # Copyright (C) 2019 The Software Heritage developers
    1 # Copyright (C) 2020 The Software Heritage developers
  • 40 """
    41 relative_path = path.relative_to(self.path)
    42
    43 if relative_path == PosixPath('.'):
    44 if pid is not None:
    45 self.pid = pid
    46 return
    47
    48 new_path = self.path.joinpath(relative_path.parts[0])
    49 if new_path not in self.children:
    50 self.children[new_path] = Tree(new_path, self)
    51
    52 self.children[new_path].addNode(path, pid)
    53
    54 def show(self) -> None:
    55 """Print all the tree"""
  • 16 from swh.model.cli import pid_of_file, pid_of_dir
    17 from swh.model.identifiers import (
    18 parse_persistent_identifier,
    19 DIRECTORY, CONTENT
    20 )
    21
    22
    23 async def pids_discovery(
    24 pids: List[str], session: aiohttp.ClientSession, api_url: str,
    25 ) -> Dict[str, Dict[str, bool]]:
    26 """API Request to get information about the persistent identifiers given in
    27 input.
    28
    29 Args:
    30 pids: a list of persistent identifier
    31 api_url: url for the API request
  • 57
    58 if len(pids) > chunk_size:
    59 for pids_chunk in get_chunk(pids):
    60 requests.append(asyncio.create_task(
    61 make_request(pids_chunk)))
    62
    63 res = await asyncio.gather(*requests)
    64 # concatenate list of dictionaries
    65 return dict(itertools.chain.from_iterable(e.items() for e in res))
    66 else:
    67 return await make_request(pids)
    68
    69
    70 def get_subpaths(
    71 path: PosixPath) -> Iterator[Tuple[PosixPath, str]]:
    72 """Find the persistent identifier of the directories and files under a
  • 110 parsed_pids = await pids_discovery(
    111 list(parsed_paths.values()), session, api_url)
    112
    113 def unpack(tup):
    114 subpath, pid = tup
    115 return (subpath, pid, parsed_pids[pid]['known'])
    116
    117 return map(unpack, parsed_paths.items())
    118
    119
    120 async def run(
    121 root: PosixPath, api_url: str, source_tree: Tree) -> None:
    122 """Start scanning from the given root.
    123
    124 It fills the source tree with the path discovered.
    125
  • Looks good.

    Mostly nitpicks on docstrings as well:

    • I'd make the color output optional, that'd make for something more easily parsable.
    • Can we have a sample output of a run in the diff description, that'd be nice ;)
    • The logger module seems overkill as i see it used only once though

    Cheers,

  • ! In !1 (closed), @ardumont wrote:

    • I'd make the color output optional, that'd make for something more easily parsable.

    Ack. A good default behavior on this is to colorize output when stdout is connected to a terminal and //not// do it when it is connected to a pipe. That's what most git commands do and it's a very good default.

  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 19 pass
    20
    21
    22 def parse_url(url):
    23 if not url.startswith('http://') or not url.startswith('https://'):
    24 url = 'https://' + url
    25 if not url.endswith('/'):
    26 url += '/'
    27 return url
    28
    29
    30 @scanner.command(name='scan')
    31 @click.argument('path', required=True, type=click.Path(exists=True))
    32 @click.option('--api-url', default='https://archive.softwareheritage.org/api/1',
    33 metavar='API_URL', show_default=True,
    34 help="url for the api request")
  • It's still missing a description of what it does (ie. check that files exist in swh), in the CLI help and/or README.

    and I still think you should use a subcommand, in case you want to add new features in the future without breaking scripts that depend on this one.

  • Merge request was returned for changes

  • ! In !1 (closed), @ardumont wrote:

    • The logger module seems overkill as i see it used only once though

    Yes, i see. I manly use it to benchmark different algorithms so i can remove it.

  • ! In !1 (closed), @vlorentz wrote: It's still missing a description of what it does (ie. check that files exist in swh), in the CLI help and/or README.

    and I still think you should use a subcommand, in case you want to add new features in the future without breaking scripts that depend on this one.

    I will update the description, anyway the only supported command by the module right now is the scan operation so i could add a subcommand when a new feature is needed .

    • cli: scanner group with scan option, updated description
    • color only when stdout is connected to terminal
  • default url

  • Merge request was accepted

  • vlorentz approved this merge request

    approved this merge request

  • Merge request was merged

  • Please register or sign in to reply
    Loading