[WIP] Add arch lister module.
First stab at a Arch Linux lister.
Arch linux provides several way to discover packages but no easy way to get history of previous released version of a package. After some discussion on Archlinux forum, https://bbs.archlinux.org/viewtopic.php?id=275574 I've gone the git repository way.
This lister fetch a git repository to list origins, parsing PKGBUILD files.
Arch Linux distribution is made of 'core', 'extra' and 'community' repository. Core and extra packages listed in https://github.com/archlinux/svntogit-packages, and 'community' in https://github.com/archlinux/svntogit-community
For now it fetches only 'core' and 'extra' packages from the first repository (421.44 MiB at this time). I'll add the second one if we are ok with first implementation (1.58 GiB). Both of git repository have several commit a day.
PKGBUILD file are bash executable file. The common way for building a package is to use makepkg which has a internal PKGBUILD parser, https://gitlab.archlinux.org/pacman/pacman/blob/master/scripts/makepkg.sh.in I did not found a PKGBUILD file parser in python in Pypi. There is one python module on github named 'parched' https://github.com/sebnow/parched I written a naïve parser, but it's not solid yet to manage all special cases.
Example of some PKGBUILD i've found that can be really hard to parse:
-
Firefox translations (xpi files), source and pkgname are dynamically generated, https://github.com/archlinux/svntogit-packages/blob/packages/firefox-i18n/repos/extra-any/PKGBUILD
-
Licenses, a set of non executable files, https://github.com/archlinux/svntogit-packages/blob/master/licenses/repos/core-any/PKGBUILD
-
Bash, Sometimes internal bash variable is ${myvar} sometimes $myvar, https://github.com/archlinux/svntogit-packages/blob/master/bash/repos/core-x86_64/PKGBUILD
-
Libtool, Dvcs sources, https://github.com/archlinux/svntogit-packages/blob/master/libtool/repos/core-x86_64/PKGBUILD
Related to T4233
Test Plan
Will run this one on docker to see how much time it takes on first run and evaluate parsing result accuracy.
Migrated from D7812 (view on Phabricator)
Merge request reports
Activity
Build has FAILED
Patch application report for D7812 (id=28215)
Rebasing onto aa8c8cb3...
Current branch diff-target is up to date.
Changes applied before test
commit b016519fc6cf9810be42364f140d294c96c9c7c2 Author: Franck Bret <franck.bret@octobus.net> Date: Wed May 11 13:34:32 2022 +0200 [WIP] Add arch lister module. For now it fetch a git repository to list origin parsing PKGBUILD files.
Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/520/ See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/520/console
Build is green
Patch application report for D7812 (id=28217)
Could not rebase; Attempt merge onto aa8c8cb3...
Updating aa8c8cb..5a1dd24 Fast-forward CONTRIBUTORS | 1 + setup.py | 1 + swh/lister/arch/__init__.py | 12 + swh/lister/arch/lister.py | 303 +++++++++++++++++++++ swh/lister/arch/tasks.py | 19 ++ swh/lister/arch/tests/__init__.py | 31 +++ .../fake-archlinux-svntogit-packages-index.tar.gz | Bin 0 -> 12173 bytes .../tests/data/fake_archlinux_repository_init.sh | 129 +++++++++ swh/lister/arch/tests/test_lister.py | 131 +++++++++ swh/lister/arch/tests/test_tasks.py | 19 ++ 10 files changed, 646 insertions(+) create mode 100644 swh/lister/arch/__init__.py create mode 100644 swh/lister/arch/lister.py create mode 100644 swh/lister/arch/tasks.py create mode 100644 swh/lister/arch/tests/__init__.py create mode 100644 swh/lister/arch/tests/data/fake-archlinux-svntogit-packages-index.tar.gz create mode 100755 swh/lister/arch/tests/data/fake_archlinux_repository_init.sh create mode 100644 swh/lister/arch/tests/test_lister.py create mode 100644 swh/lister/arch/tests/test_tasks.py
Changes applied before test
commit 5a1dd245b5eedc6deb7e414a826710c3762c5770 Author: Franck Bret <franck.bret@octobus.net> Date: Wed May 11 14:59:24 2022 +0200 Mypy fix, Use Typing.List instead commit b016519fc6cf9810be42364f140d294c96c9c7c2 Author: Franck Bret <franck.bret@octobus.net> Date: Wed May 11 13:34:32 2022 +0200 [WIP] Add arch lister module. For now it fetch a git repository to list origin parsing PKGBUILD files.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/521/ for more details.
Updating !423 (closed): [WIP] Add arch lister module.
Build is green
Patch application report for D7812 (id=28218)
Rebasing onto aa8c8cb3...
Current branch diff-target is up to date.
Changes applied before test
commit 5a1dd245b5eedc6deb7e414a826710c3762c5770 Author: Franck Bret <franck.bret@octobus.net> Date: Wed May 11 14:59:24 2022 +0200 Mypy fix, Use Typing.List instead commit b016519fc6cf9810be42364f140d294c96c9c7c2 Author: Franck Bret <franck.bret@octobus.net> Date: Wed May 11 13:34:32 2022 +0200 [WIP] Add arch lister module. For now it fetch a git repository to list origin parsing PKGBUILD files.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/522/ for more details.
I've made several experiments in order to find a better way to list arch linux package.
The most efficient way I've found is to download tar.gz files which contains package name as directory and a "desc" file with easy to parse metadata. It works fine but retrieve only the latest version of a package.
Here are some time execution metrics for downloading archive and parse desc files.
Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds
Example of retrieved package data after parsing:
{'arch': 'x86_64', 'repo': 'core', 'base': 'acl', 'builddate': '1643730617', 'conflicts': 'xfsacl', 'csize': '138970', 'desc': 'Access control list utilities, libraries and headers', 'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst', 'isize': '325349', 'license': 'LGPL', 'md5sum': '718c93159ce4dfc6f789ffe27ce276e8', 'name': 'acl', 'packager': 'Christian Hesse <eworm@archlinux.org>', 'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=', 'project_url': 'https://savannah.nongnu.org/projects/acl', 'replaces': 'xfsacl', 'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218', 'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst', 'version': '2.3.1-2'}
If we are ok to get only latest version, we can go this way.
Nonetheless, it's possible to get other versions of a package through two different strategies, each with some pros and cons:
- Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
mercurial-4.8.2-1-x86_64 mercurial-4.9-1-x86_64 mercurial-4.9.1-1-x86_64 mercurial-5.0-1-x86_64 mercurial-5.0.1-1-x86_64 mercurial-5.0.2-1-x86_64 mercurial-5.1-1-x86_64 mercurial-5.1.2-1-x86_64 mercurial-5.2-1-x86_64 mercurial-5.2.1-1-x86_64 mercurial-5.2.2-1-x86_64 mercurial-5.2.2-2-x86_64 mercurial-5.3-1-x86_64 mercurial-5.3.1-1-x86_64 mercurial-5.3.2-1-x86_64 mercurial-5.4-1-x86_64 mercurial-5.4.1-1-x86_64 mercurial-5.4-2-x86_64 mercurial-5.4.2-1-x86_64 mercurial-5.5-1-x86_64 mercurial-5.5.1-1-x86_64 mercurial-5.5.2-1-x86_64 mercurial-5.6-1-x86_64 mercurial-5.6.1-1-x86_64 mercurial-5.6-2-x86_64 mercurial-5.6-3-x86_64 mercurial-5.7-1-x86_64 mercurial-5.7.1-1-x86_64 mercurial-5.8-1-x86_64 mercurial-5.8.1-1-x86_64 mercurial-5.8-2-x86_64 mercurial-5.9.1-1-x86_64 mercurial-5.9.1-2-x86_64 mercurial-5.9.2-1-x86_64 mercurial-5.9.3-1-x86_64 mercurial-6.0-1-x86_64 mercurial-6.0.1-1-x86_64 mercurial-6.0-2-x86_64 mercurial-6.0.2-1-x86_64 mercurial-6.0-3-x86_64 mercurial-6.0.3-1-x86_64 mercurial-6.1-1-x86_64 mercurial-6.1.1-1-x86_64 mercurial-6.1-2-x86_64 mercurial-6.1.2-1-x86_64
Pro : One 500 ko file to download, one dynamic regex to find matches Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries. It tooks 16 min for the regex to find match for +/- 15000 packages...
- Scrap server directory listing to get previous versions of a package with its release date, for example https://archive.archlinux.org/packages/m/mercurial/ Pro: Easy to scrap + a release date is associated to a version Cons: Scrapping +/- 15000 pages can be quite slow, no metadata
@vlorentz @ardumont @bchauvet what do you think, what do you prefer?
Also do you I cancel that issue and create a new one to go on?
Note: List the contents of pacman databases as JSON for web applications snippet [1]
I've made several experiments in order to find a better way to list arch linux package.
The most efficient way I've found is to download tar.gz files which contains package name as directory and a "desc" file with easy to parse metadata. It works fine but retrieve only the latest version of a package.
Here are some time execution metrics for downloading archive and parse desc files.
Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds
Example of retrieved package data after parsing:
{'arch': 'x86_64', 'repo': 'core', 'base': 'acl', 'builddate': '1643730617', 'conflicts': 'xfsacl', 'csize': '138970', 'desc': 'Access control list utilities, libraries and headers', 'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst', 'isize': '325349', 'license': 'LGPL', 'md5sum': '718c93159ce4dfc6f789ffe27ce276e8', 'name': 'acl', 'packager': 'Christian Hesse <eworm@archlinux.org>', 'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=', 'project_url': 'https://savannah.nongnu.org/projects/acl', 'replaces': 'xfsacl', 'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218', 'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst', 'version': '2.3.1-2'}
If we are ok to get only latest version, we can go this way.
(as a data point) That's currently the way we are retrieving information for CRAN packages. CRAN (infra) only exposes the latest version of a package (it exposes archived versions with a dedicated instance we are not currently listing).
But our lister is listing them everyday so from the moment we started ingested them, we should have some versions for one package already. At some point, we'll have to attend to the archived ones as well.
So I guess, given your current experiments reported here (through the description and this very comment), it'd be ok to do the same than CRAN here.
Nonetheless, it's possible to get other versions of a package through two different strategies, each with some pros and cons:
- Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
mercurial-4.8.2-1-x86_64 mercurial-4.9-1-x86_64 mercurial-4.9.1-1-x86_64 mercurial-5.0-1-x86_64 ...
Pro : One 500 ko file to download, one dynamic regex to find matches Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries. It tooks 16 min for the regex to find match for +/- 15000 packages...
- Scrap server directory listing to get previous versions of a package with its release date, for example https://archive.archlinux.org/packages/m/mercurial/ Pro: Easy to scrap + a release date is associated to a version Cons: Scrapping +/- 15000 pages can be quite slow, no metadata
@vlorentz @ardumont @bchauvet what do you think, what do you prefer?
As mentioned, I'd go for the simplest solution (first one which allows more simple metadata retrieval for the latest version only).
Also do you I cancel that issue and create a new one to go on?
You can go either way. If you keep that one, it'd be easier to compare with your future version (and the future review will be simpler, no noisy old comments). If you keep it, we can still find its initial version through the history tab (within the web ui).
Well, go simple, create a new one? (yeah, the opposite of what i said to you on irc on friday ¯_(ツ)_/¯ ;)
Cheers,
- swh/lister/arch/lister.py 0 → 100644
70 "b2sums", 71 ] 72 73 pkg: Dict = {} 74 75 # For each mapping iterate over to match content 76 for k, v in authors_mapping.items(): 77 AUTHORS_RE = re.compile(rf"{v}\s*(.*$)", re.MULTILINE) 78 pkg[k] = AUTHORS_RE.findall(content) 79 80 for k in str_mapping: 81 SINGLE_RE = re.compile(rf"(?<={k}=)(.+)", re.M) 82 single = SINGLE_RE.findall(content) 83 # cleanup the result from enclosing single or double quotes 84 res = single[0] 85 res = res.strip('"').strip("'") - swh/lister/arch/lister.py 0 → 100644
133 repository_path: Path, pkgbuild_path: Path 134 ) -> List[Dict[str, Any]]: 135 """Retrieve all previous versions of an Arch Linux package. 136 137 Note that Arch Linux strives to maintain the latest stable release 138 versions of its software. The git repository listing PKGBUILD files do not 139 have an explicit list of previous released versions of a package, just the 140 latest one. 141 142 To be able to list all previous existing versions we need to introspect the 143 history of the PKGBUILD file through git log to git patch command. 144 """ 145 cmd = ( 146 rf"git log --pretty='+date=%cI' -p -L '^/pkgver=.*/,+1:{pkgbuild_path}'" 147 rf" | grep '+pkgver\|+date'" 148 ) - swh/lister/arch/lister.py 0 → 100644
210 Each file path corresponds to a PKGBUILD file that contains information 211 about a Arch linux official package referenced as 'core', 'extra' or 212 'community' repository. 213 214 https://wiki.archlinux.org/title/Official_repositories 215 """ 216 arch_index = sorted( 217 path 218 for path in self.DESTINATION_PATH.rglob("*") 219 if not any(part.startswith(".") for part in path.parts) 220 and path.is_file() 221 and path.name == "PKGBUILD" 222 and ( 223 "core" in path.parent.name 224 or "extra" in path.parent.name 225 or "community" in path.parent.name - swh/lister/arch/lister.py 0 → 100644
249 page = [] 250 with arch.open("rb") as current_file: 251 pkg = pkgbuild_parser(content=current_file.read().decode()) 252 versions = pkgbuild_get_versions( 253 repository_path=self.DESTINATION_PATH, pkgbuild_path=arch 254 ) 255 for version in versions: 256 page.append( 257 dict( 258 pkgname=pkg["pkgname"], 259 pkgver=version["pkgver"], 260 arch=pkg["arch"][0], # TODO: There can be more than arch 261 repo=arch.parent.name.split("-")[0], 262 pkg=pkg["source"][ 263 0 264 ], # TODO: The can be more than one source pass the list of archs and source packages; the loader can deal with them by creating them as different releases in the same origin (kind of like this: https://forge.softwareheritage.org/source/swh-loader-core/browse/master/swh/loader/package/pypi/loader.py$121-126 )
mentioned in commit 1bf11aa2
! In !423 (closed), @ardumont wrote: I've made several experiments in order to find a better way to list arch linux package.
The most efficient way I've found is to download tar.gz files which contains package name as directory and a "desc" file with easy to parse metadata. It works fine but retrieve only the latest version of a package.
Here are some time execution metrics for downloading archive and parse desc files.
Found 266 packages from https://archive.archlinux.org/repos/last/core/os/x86_64/core.files.tar.gz in 1.4924319160054438 seconds Found 3035 packages from https://archive.archlinux.org/repos/last/extra/os/x86_64/extra.files.tar.gz in 5.644616681995103 seconds Found 9161 packages from https://archive.archlinux.org/repos/last/community/os/x86_64/community.files.tar.gz in 16.14458583202213 seconds
Example of retrieved package data after parsing:
{'arch': 'x86_64', 'repo': 'core', 'base': 'acl', 'builddate': '1643730617', 'conflicts': 'xfsacl', 'csize': '138970', 'desc': 'Access control list utilities, libraries and headers', 'filename': 'acl-2.3.1-2-x86_64.pkg.tar.zst', 'isize': '325349', 'license': 'LGPL', 'md5sum': '718c93159ce4dfc6f789ffe27ce276e8', 'name': 'acl', 'packager': 'Christian Hesse <eworm@archlinux.org>', 'pgpsig': 'iHUEABYIAB0WIQQEKYl95fO9rFN6MGltQr3RFuAGjwUCYflW2QAKCRBtQr3RFuAGj/waAP9U7gJZ0YRfftuGdc4shJdSIfspuWb3nZK+fj7My5z4zQD/SBpepSM3Cxr8Pw2LU5adq4UI0HWFZFsHrg3179XJqgI=', 'project_url': 'https://savannah.nongnu.org/projects/acl', 'replaces': 'xfsacl', 'sha256sum': '20873a994a0728de5b05857129c290e9a8c9bba2236cc30bcffa7b746ffe9218', 'url': 'https://archive.archlinux.org/packages/.all/acl-2.3.1-2-x86_64.pkg.tar.zst', 'version': '2.3.1-2'}
If we are ok to get only latest version, we can go this way.
(as a data point) That's currently the way we are retrieving information for CRAN packages. CRAN (infra) only exposes the latest version of a package (it exposes archived versions with a dedicated instance we are not currently listing).
But our lister is listing them everyday so from the moment we started ingested them, we should have some versions for one package already. At some point, we'll have to attend to the archived ones as well.
So I guess, given your current experiments reported here (through the description and this very comment), it'd be ok to do the same than CRAN here.
Nonetheless, it's possible to get other versions of a package through two different strategies, each with some pros and cons:
- Download index https://archive.archlinux.org/packages/.all/index.0.xz which contains a file that list several previous versions, for example:
mercurial-4.8.2-1-x86_64 mercurial-4.9-1-x86_64 mercurial-4.9.1-1-x86_64 mercurial-5.0-1-x86_64 ...
Pro : One 500 ko file to download, one dynamic regex to find matches Cons: we only get a filename, no date, no metadata. The files is +/- 400000 entries. It tooks 16 min for the regex to find match for +/- 15000 packages...
- Scrap server directory listing to get previous versions of a package with its release date, for example https://archive.archlinux.org/packages/m/mercurial/ Pro: Easy to scrap + a release date is associated to a version Cons: Scrapping +/- 15000 pages can be quite slow, no metadata
@vlorentz @ardumont @bchauvet what do you think, what do you prefer?
As mentioned, I'd go for the simplest solution (first one which allows more simple metadata retrieval for the latest version only).
Also do you I cancel that issue and create a new one to go on?
You can go either way. If you keep that one, it'd be easier to compare with your future version (and the future review will be simpler, no noisy old comments). If you keep it, we can still find its initial version through the history tab (within the web ui).
Well, go simple, create a new one? (yeah, the opposite of what i said to you on irc on friday ¯_(ツ)_/¯ ;)
Cheers,
New patch that fetch archives instead of git repository is !424 (closed)
New patch that fetch archives instead of git repository is !424 (closed)
Awesome, thanks.
You can close this one then (it won't disappear).
Abandoned in favor of !424 (closed)
mentioned in merge request !424 (closed)