Skip to content
Snippets Groups Projects

[WIP] Add arch lister module.

First stab at a Arch Linux lister.

Arch linux provides several way to discover packages but no easy way to get history of previous released version of a package. After some discussion on Archlinux forum, https://bbs.archlinux.org/viewtopic.php?id=275574 I've gone the git repository way.

This lister fetch a git repository to list origins, parsing PKGBUILD files.

Arch Linux distribution is made of 'core', 'extra' and 'community' repository. Core and extra packages listed in https://github.com/archlinux/svntogit-packages, and 'community' in https://github.com/archlinux/svntogit-community

For now it fetches only 'core' and 'extra' packages from the first repository (421.44 MiB at this time). I'll add the second one if we are ok with first implementation (1.58 GiB). Both of git repository have several commit a day.

PKGBUILD file are bash executable file. The common way for building a package is to use makepkg which has a internal PKGBUILD parser, https://gitlab.archlinux.org/pacman/pacman/blob/master/scripts/makepkg.sh.in I did not found a PKGBUILD file parser in python in Pypi. There is one python module on github named 'parched' https://github.com/sebnow/parched I written a naïve parser, but it's not solid yet to manage all special cases.

Example of some PKGBUILD i've found that can be really hard to parse:

Related to T4233

Test Plan

Will run this one on docker to see how much time it takes on first run and evaluate parsing result accuracy.


Migrated from D7812 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build has FAILED

    Patch application report for D7812 (id=28215)

    Rebasing onto aa8c8cb3...

    Current branch diff-target is up to date.
    Changes applied before test
    commit b016519fc6cf9810be42364f140d294c96c9c7c2
    Author: Franck Bret <franck.bret@octobus.net>
    Date:   Wed May 11 13:34:32 2022 +0200
    
        [WIP] Add arch lister module.
        
        For now it fetch a git repository to list origin parsing PKGBUILD files.

    Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/520/ See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/520/console

  • Author Contributor

    Mypy fix

  • Build is green

    Patch application report for D7812 (id=28217)

    Could not rebase; Attempt merge onto aa8c8cb3...

    Updating aa8c8cb..5a1dd24
    Fast-forward
     CONTRIBUTORS                                       |   1 +
     setup.py                                           |   1 +
     swh/lister/arch/__init__.py                        |  12 +
     swh/lister/arch/lister.py                          | 303 +++++++++++++++++++++
     swh/lister/arch/tasks.py                           |  19 ++
     swh/lister/arch/tests/__init__.py                  |  31 +++
     .../fake-archlinux-svntogit-packages-index.tar.gz  | Bin 0 -> 12173 bytes
     .../tests/data/fake_archlinux_repository_init.sh   | 129 +++++++++
     swh/lister/arch/tests/test_lister.py               | 131 +++++++++
     swh/lister/arch/tests/test_tasks.py                |  19 ++
     10 files changed, 646 insertions(+)
     create mode 100644 swh/lister/arch/__init__.py
     create mode 100644 swh/lister/arch/lister.py
     create mode 100644 swh/lister/arch/tasks.py
     create mode 100644 swh/lister/arch/tests/__init__.py
     create mode 100644 swh/lister/arch/tests/data/fake-archlinux-svntogit-packages-index.tar.gz
     create mode 100755 swh/lister/arch/tests/data/fake_archlinux_repository_init.sh
     create mode 100644 swh/lister/arch/tests/test_lister.py
     create mode 100644 swh/lister/arch/tests/test_tasks.py
    Changes applied before test
    commit 5a1dd245b5eedc6deb7e414a826710c3762c5770
    Author: Franck Bret <franck.bret@octobus.net>
    Date:   Wed May 11 14:59:24 2022 +0200
    
        Mypy fix, Use Typing.List instead
    
    commit b016519fc6cf9810be42364f140d294c96c9c7c2
    Author: Franck Bret <franck.bret@octobus.net>
    Date:   Wed May 11 13:34:32 2022 +0200
    
        [WIP] Add arch lister module.
        
        For now it fetch a git repository to list origin parsing PKGBUILD files.

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/521/ for more details.

  • Author Contributor

    Updating !423 (closed): [WIP] Add arch lister module.

  • Build is green

    Patch application report for D7812 (id=28218)

    Rebasing onto aa8c8cb3...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 5a1dd245b5eedc6deb7e414a826710c3762c5770
    Author: Franck Bret <franck.bret@octobus.net>
    Date:   Wed May 11 14:59:24 2022 +0200
    
        Mypy fix, Use Typing.List instead
    
    commit b016519fc6cf9810be42364f140d294c96c9c7c2
    Author: Franck Bret <franck.bret@octobus.net>
    Date:   Wed May 11 13:34:32 2022 +0200
    
        [WIP] Add arch lister module.
        
        For now it fetch a git repository to list origin parsing PKGBUILD files.

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/522/ for more details.

  • Author Contributor

    I've made several experiments in order to find a better way to list arch linux package.

    The most efficient way I've found is to download tar.gz files which contains package name as directory and a "desc" file with easy to parse metadata. It works fine but retrieve only the latest version of a package.

    Here are some time execution metrics for downloading archive and parse desc files.

    Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds
    
    Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds
    
    Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds
    

    Example of retrieved package data after parsing:

    {'arch': 'x86_64',
     'repo': 'core',
     'base': 'acl',
     'builddate': '1643730617',
     'conflicts': 'xfsacl',
     'csize': '138970',
     'desc': 'Access control list utilities, libraries and headers',
     'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst',
     'isize': '325349',
     'license': 'LGPL',
     'md5sum': '718c93159ce4dfc6f789ffe27ce276e8',
     'name': 'acl',
     'packager': 'Christian Hesse <eworm@archlinux.org>',
     'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=',
     'project_url': 'https://savannah.nongnu.org/projects/acl',
     'replaces': 'xfsacl',
     'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218',
     'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst',
     'version': '2.3.1-2'}

    If we are ok to get only latest version, we can go this way.

    Nonetheless, it's possible to get other versions of a package through two different strategies, each with some pros and cons:

    1. Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
    mercurial-4.8.2-1-x86_64
    mercurial-4.9-1-x86_64
    mercurial-4.9.1-1-x86_64
    mercurial-5.0-1-x86_64
    mercurial-5.0.1-1-x86_64
    mercurial-5.0.2-1-x86_64
    mercurial-5.1-1-x86_64
    mercurial-5.1.2-1-x86_64
    mercurial-5.2-1-x86_64
    mercurial-5.2.1-1-x86_64
    mercurial-5.2.2-1-x86_64
    mercurial-5.2.2-2-x86_64
    mercurial-5.3-1-x86_64
    mercurial-5.3.1-1-x86_64
    mercurial-5.3.2-1-x86_64
    mercurial-5.4-1-x86_64
    mercurial-5.4.1-1-x86_64
    mercurial-5.4-2-x86_64
    mercurial-5.4.2-1-x86_64
    mercurial-5.5-1-x86_64
    mercurial-5.5.1-1-x86_64
    mercurial-5.5.2-1-x86_64
    mercurial-5.6-1-x86_64
    mercurial-5.6.1-1-x86_64
    mercurial-5.6-2-x86_64
    mercurial-5.6-3-x86_64
    mercurial-5.7-1-x86_64
    mercurial-5.7.1-1-x86_64
    mercurial-5.8-1-x86_64
    mercurial-5.8.1-1-x86_64
    mercurial-5.8-2-x86_64
    mercurial-5.9.1-1-x86_64
    mercurial-5.9.1-2-x86_64
    mercurial-5.9.2-1-x86_64
    mercurial-5.9.3-1-x86_64
    mercurial-6.0-1-x86_64
    mercurial-6.0.1-1-x86_64
    mercurial-6.0-2-x86_64
    mercurial-6.0.2-1-x86_64
    mercurial-6.0-3-x86_64
    mercurial-6.0.3-1-x86_64
    mercurial-6.1-1-x86_64
    mercurial-6.1.1-1-x86_64
    mercurial-6.1-2-x86_64
    mercurial-6.1.2-1-x86_64
    

    Pro : One 500 ko file to download, one dynamic regex to find matches Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries. It tooks 16 min for the regex to find match for +/- 15000 packages...

    1. Scrap server directory listing to get previous versions of a package with its release date, for example https://archive.archlinux.org/packages/m/mercurial/ Pro: Easy to scrap + a release date is associated to a version Cons: Scrapping +/- 15000 pages can be quite slow, no metadata

    @vlorentz @ardumont @bchauvet what do you think, what do you prefer?

    Also do you I cancel that issue and create a new one to go on?

  • Note: List the contents of pacman databases as JSON for web applications snippet [1]

  • I've made several experiments in order to find a better way to list arch linux package.

    The most efficient way I've found is to download tar.gz files which contains package name as directory and a "desc" file with easy to parse metadata. It works fine but retrieve only the latest version of a package.

    Here are some time execution metrics for downloading archive and parse desc files.

    Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds
    
    Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds
    
    Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds
    

    Example of retrieved package data after parsing:

    {'arch': 'x86_64',
     'repo': 'core',
     'base': 'acl',
     'builddate': '1643730617',
     'conflicts': 'xfsacl',
     'csize': '138970',
     'desc': 'Access control list utilities, libraries and headers',
     'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst',
     'isize': '325349',
     'license': 'LGPL',
     'md5sum': '718c93159ce4dfc6f789ffe27ce276e8',
     'name': 'acl',
     'packager': 'Christian Hesse <eworm@archlinux.org>',
     'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=',
     'project_url': 'https://savannah.nongnu.org/projects/acl',
     'replaces': 'xfsacl',
     'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218',
     'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst',
     'version': '2.3.1-2'}

    If we are ok to get only latest version, we can go this way.

    (as a data point) That's currently the way we are retrieving information for CRAN packages. CRAN (infra) only exposes the latest version of a package (it exposes archived versions with a dedicated instance we are not currently listing).

    But our lister is listing them everyday so from the moment we started ingested them, we should have some versions for one package already. At some point, we'll have to attend to the archived ones as well.

    So I guess, given your current experiments reported here (through the description and this very comment), it'd be ok to do the same than CRAN here.

    Nonetheless, it's possible to get other versions of a package through two different strategies, each with some pros and cons:

    1. Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
    mercurial-4.8.2-1-x86_64
    mercurial-4.9-1-x86_64
    mercurial-4.9.1-1-x86_64
    mercurial-5.0-1-x86_64
    ...

    Pro : One 500 ko file to download, one dynamic regex to find matches Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries. It tooks 16 min for the regex to find match for +/- 15000 packages...

    1. Scrap server directory listing to get previous versions of a package with its release date, for example https://archive.archlinux.org/packages/m/mercurial/ Pro: Easy to scrap + a release date is associated to a version Cons: Scrapping +/- 15000 pages can be quite slow, no metadata

    @vlorentz @ardumont @bchauvet what do you think, what do you prefer?

    As mentioned, I'd go for the simplest solution (first one which allows more simple metadata retrieval for the latest version only).

    @vlorentz @bchauvet thoughts?

    Also do you I cancel that issue and create a new one to go on?

    You can go either way. If you keep that one, it'd be easier to compare with your future version (and the future review will be simpler, no noisy old comments). If you keep it, we can still find its initial version through the history tab (within the web ui).

    Well, go simple, create a new one? (yeah, the opposite of what i said to you on irc on friday ¯_(ツ)_/¯ ;)

    Cheers,

  • vlorentz
    vlorentz @vlorentz started a thread on the diff
70 "b2sums",
71 ]
72
73 pkg: Dict = {}
74
75 # For each mapping iterate over to match content
76 for k, v in authors_mapping.items():
77 AUTHORS_RE = re.compile(rf"{v}\s*(.*$)", re.MULTILINE)
78 pkg[k] = AUTHORS_RE.findall(content)
79
80 for k in str_mapping:
81 SINGLE_RE = re.compile(rf"(?<={k}=)(.+)", re.M)
82 single = SINGLE_RE.findall(content)
83 # cleanup the result from enclosing single or double quotes
84 res = single[0]
85 res = res.strip('"').strip("'")
  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 133 repository_path: Path, pkgbuild_path: Path
    134 ) -> List[Dict[str, Any]]:
    135 """Retrieve all previous versions of an Arch Linux package.
    136
    137 Note that Arch Linux strives to maintain the latest stable release
    138 versions of its software. The git repository listing PKGBUILD files do not
    139 have an explicit list of previous released versions of a package, just the
    140 latest one.
    141
    142 To be able to list all previous existing versions we need to introspect the
    143 history of the PKGBUILD file through git log to git patch command.
    144 """
    145 cmd = (
    146 rf"git log --pretty='+date=%cI' -p -L '^/pkgver=.*/,+1:{pkgbuild_path}'"
    147 rf" | grep '+pkgver\|+date'"
    148 )
    • shell injection: pkgbuild_path is not trusted and neither escaped nor validated.

      try Dulwich, it's more reliable than parsing Git's output anyway.

    • Please register or sign in to reply
  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 210 Each file path corresponds to a PKGBUILD file that contains information
    211 about a Arch linux official package referenced as 'core', 'extra' or
    212 'community' repository.
    213
    214 https://wiki.archlinux.org/title/Official_repositories
    215 """
    216 arch_index = sorted(
    217 path
    218 for path in self.DESTINATION_PATH.rglob("*")
    219 if not any(part.startswith(".") for part in path.parts)
    220 and path.is_file()
    221 and path.name == "PKGBUILD"
    222 and (
    223 "core" in path.parent.name
    224 or "extra" in path.parent.name
    225 or "community" in path.parent.name
  • vlorentz
    vlorentz @vlorentz started a thread on the diff
  • 249 page = []
    250 with arch.open("rb") as current_file:
    251 pkg = pkgbuild_parser(content=current_file.read().decode())
    252 versions = pkgbuild_get_versions(
    253 repository_path=self.DESTINATION_PATH, pkgbuild_path=arch
    254 )
    255 for version in versions:
    256 page.append(
    257 dict(
    258 pkgname=pkg["pkgname"],
    259 pkgver=version["pkgver"],
    260 arch=pkg["arch"][0], # TODO: There can be more than arch
    261 repo=arch.parent.name.split("-")[0],
    262 pkg=pkg["source"][
    263 0
    264 ], # TODO: The can be more than one source
  • Some early comments before I forget, in case they are useful. feel free to ignore them if you are going to change these parts of the code anyway

  • mentioned in commit 1bf11aa2

  • Author Contributor

    ! In !423 (closed), @ardumont wrote: I've made several experiments in order to find a better way to list arch linux package.

    The most efficient way I've found is to download tar.gz files which contains package name as directory and a "desc" file with easy to parse metadata. It works fine but retrieve only the latest version of a package.

    Here are some time execution metrics for downloading archive and parse desc files.

    Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds
    
    Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds
    
    Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds
    

    Example of retrieved package data after parsing:

    {'arch': 'x86_64',
     'repo': 'core',
     'base': 'acl',
     'builddate': '1643730617',
     'conflicts': 'xfsacl',
     'csize': '138970',
     'desc': 'Access control list utilities, libraries and headers',
     'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst',
     'isize': '325349',
     'license': 'LGPL',
     'md5sum': '718c93159ce4dfc6f789ffe27ce276e8',
     'name': 'acl',
     'packager': 'Christian Hesse <eworm@archlinux.org>',
     'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=',
     'project_url': 'https://savannah.nongnu.org/projects/acl',
     'replaces': 'xfsacl',
     'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218',
     'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst',
     'version': '2.3.1-2'}

    If we are ok to get only latest version, we can go this way.

    (as a data point) That's currently the way we are retrieving information for CRAN packages. CRAN (infra) only exposes the latest version of a package (it exposes archived versions with a dedicated instance we are not currently listing).

    But our lister is listing them everyday so from the moment we started ingested them, we should have some versions for one package already. At some point, we'll have to attend to the archived ones as well.

    So I guess, given your current experiments reported here (through the description and this very comment), it'd be ok to do the same than CRAN here.

    Nonetheless, it's possible to get other versions of a package through two different strategies, each with some pros and cons:

    1. Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
    mercurial-4.8.2-1-x86_64
    mercurial-4.9-1-x86_64
    mercurial-4.9.1-1-x86_64
    mercurial-5.0-1-x86_64
    ...

    Pro : One 500 ko file to download, one dynamic regex to find matches Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries. It tooks 16 min for the regex to find match for +/- 15000 packages...

    1. Scrap server directory listing to get previous versions of a package with its release date, for example https://archive.archlinux.org/packages/m/mercurial/ Pro: Easy to scrap + a release date is associated to a version Cons: Scrapping +/- 15000 pages can be quite slow, no metadata

    @vlorentz @ardumont @bchauvet what do you think, what do you prefer?

    As mentioned, I'd go for the simplest solution (first one which allows more simple metadata retrieval for the latest version only).

    @vlorentz @bchauvet thoughts?

    Also do you I cancel that issue and create a new one to go on?

    You can go either way. If you keep that one, it'd be easier to compare with your future version (and the future review will be simpler, no noisy old comments). If you keep it, we can still find its initial version through the history tab (within the web ui).

    Well, go simple, create a new one? (yeah, the opposite of what i said to you on irc on friday ¯_(ツ)_/¯ ;)

    Cheers,

    New patch that fetch archives instead of git repository is !424 (closed)

  • New patch that fetch archives instead of git repository is !424 (closed)

    Awesome, thanks.

    You can close this one then (it won't disappear).

  • Author Contributor

    Abandoned in favor of !424 (closed)

  • Author Contributor

    Merge request was abandoned

  • closed

  • mentioned in merge request !424 (closed)

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading