Implement cran loader with package manager mechanism
The extrinsic metadata are provided by the lister as the loading task parameters.
The intrinsic metadata lies within the DESCRIPTION file at the root tree of a tarball [1].
DESCRIPTION uses a simple file format called DCF, the Debian control format.
See [2] for the necessary parsing tools.
-
[2] python3-debian
Migrated from T2026 (view on Phabricator)
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Antoine R. Dumont added Archive coverage Origin-CRAN priority:Normal labels
added Archive coverage Origin-CRAN priority:Normal labels
- Antoine R. Dumont changed the description
changed the description
- Phabricator Migration user mentioned in commit swh/devel/swh-loader-core@f5ea782f
mentioned in commit swh/devel/swh-loader-core@f5ea782f
- Author Maintainer
def parse_debian_control(filepath: str) -> Dict: """Parse debian control at filepath""" metadata = {} logger.debug('Debian control file %s', filepath) for paragraph in Deb822.iter_paragraphs(open(filepath)): logger.debug('paragraph: %s', paragraph) metadata.update(**paragraph) logger.debug('metadata parsed: %s', metadata) return metadata
seems to do the trick
- Phabricator Migration user marked this issue as related to #2029 (closed)
marked this issue as related to #2029 (closed)
- Author Maintainer
To have a look at possible fields (regarding parsing date and author), here is a sample of the artifacts listed by the cran lister [1]
There are many fields for Date (Published, etc...) and Author (Maintainer). And their values can vary...
- Author Maintainer
Here are some analysis sample on the cran dataset about publication "date" and "author" fields.
'Date' and 'Published' fields:
$ python ./analysis.py --with-date-repartition --dataset ./list-all-packages.R.json.gz 2019-10-03 19:46:04,095 24852 filepath: ./list-all-packages.R.json.gz 2019-10-03 19:46:04,304 24852 len(data): 15008 {'date_and_published': 9565, 'published': 5443}
Some extra work is needed to parse those:
$ python ./analysis.py --with-pattern-date-repartition --dataset ./list-all-packages.R.json.gz 2019-10-03 19:50:31,381 25448 filepath: ./list-all-packages.R.json.gz 2019-10-03 19:50:31,592 25448 len(data): 15008 2019-10-03 19:50:32,669 25448 Summary for 'Date' field {None: 5443, '%Y-%d-%m': 3854, '%Y-%m-%d': 9456, '%Y-%m-%d %H:%M:%S': 14, '%Y/%m/%d': 16, '%d %B %Y': 2, '%d %b %Y': 1, '%d.%m.%Y': 2, '%d.%m.%y': 1, '%d/%m/%Y': 7, 'invalid': 49, 'valid': 13353} 2019-10-03 19:50:32,669 25448 Unknown date format for 'Date' field ['Tue Dec 27 15:06:08 PST 2011', 'Fabruary 21, 2012', '8-14-2013', '2019-05-28"', '2011-01', '04-12-2014', '2017-03-01 today', '2016-11-0110.1093/icesjms/fsw182', '2019-07-010', '2015-02.23', '2018-08-24, 10:40:10', '2013-October-16', 'Aug 23, 2013', 'Apr 12, 2013', '27-11-2014', '19-02-2013', '20013-12-30', '2019-09-26,', '9/25/2014', 'Fri Jun 27 17:23:53 2014', '2016-08-017', '2019-02-07l', '2014-07', '28-04-2014', '2014-05', '2018-05-010', '04-14-2014', '2019-09-27 KST', '2019-05-08 14:17:31 UTC', '$Date$', 'Wed May 21 13:50:39 CEST 2014', '2018-04-10 00:01:04 KST', '2019-09-27 KST', '2019-06-22 $Date$', '2014-11', '2019-08-25 10:45', '$Date: 2013-01-18 12:49:03 -0600 (Fri, 18 Jan 2013) $', '2015-7-013', 'March 9, 2015', '2018-05-023', 'Aug. 18, 2012', '2014-Dec-17', 'March 01, 2013', '2017-04-08.', "Check NEWS file for changes: news(package='simSummary')", '2014-Apr-22', 'Mon Jan 12 19:54:04 2015', 'May 22, 2014', '2014-08-12 09:55:10 EDT'] 2019-10-03 19:50:34,320 25448 Summary for 'Published' field {'%Y-%d-%m': 5880, '%Y-%m-%d': 15008, 'valid': 20888} 2019-10-03 19:50:34,320 25448 Unknown date format for 'Published' field - [ ]
About 'Author' and 'Maintainer' fields:
$ python ./analysis.py --with-author-repartition --dataset ./list-all-packages.R.json.gz 2019-10-03 19:43:29,309 24451 filepath: ./list-all-packages.R.json.gz 2019-10-03 19:43:29,511 24451 len(data): 15008 {'maintainer_and_author': 15008}
$ python ./analysis.py --with-pattern-author-repartition --dataset ./list-all-packages.R.json.gz 2019-10-03 21:28:37,223 3731 filepath: ./list-all-packages.R.json.gz 2019-10-03 21:28:37,432 3731 len(data): 15008 2019-10-03 21:28:37,493 3731 Summary for 'Maintainer' field {'ORPHANED': 62, "[\\'ØA-Za-z ].*": 14964, '[a-zA-Z].*\\n<[a-zA-Z0-9.@].*>': 19, '[Ø\\\'"a-zA-Z].*<[a-zA-Z0-9.@].*>.*': 14927, 'valid': 29972} 2019-10-03 21:28:37,493 3731 Unknown format for 'Maintainer' field - [ ] 2019-10-03 21:28:37,553 3731 Summary for 'Author' field {"[\\'ØA-Za-z ].*": 14979, '[Ø\\\'"a-zA-Z].*<[a-zA-Z0-9.@].*>.*': 3024, '\\n': 13, 'invalid': 2, 'valid': 18016} 2019-10-03 21:28:37,553 3731 Unknown format for 'Author' field ['"Shingo Yamamoto (gloops, Inc.)" [aut, cre],\n' ' RStudio, Inc. [cph],\n' ' Michael Bostock [ctb, cph] (swh/devel/swh-storage!2.js library),\n' ' jQuery Foundation [cph] (jQuery library and jQuery UI library),\n' ' jQuery contributors [ctb, cph] (jQuery library; authors listed in ' 'inst/htmlwidgets/lib/jquery/jquery-AUTHORS.txt),\n' ' jQuery UI contributors [ctb, cph] (jQuery UI library; authors listed ' 'in inst/htmlwidgets/lib/jquery-ui/AUTHORS.txt)', '"DecisionPatterns [aut, cre]"']
- Phabricator Migration user mentioned in commit swh/devel/snippets@47dae37f
mentioned in commit swh/devel/snippets@47dae37f
- Phabricator Migration user mentioned in commit swh/devel/snippets@0c3f0151
mentioned in commit swh/devel/snippets@0c3f0151
- Phabricator Migration user mentioned in commit swh/devel/snippets@b6980b15
mentioned in commit swh/devel/snippets@b6980b15
- Author Maintainer
Removing that package loader implementation from the main task. It's not a blocker to close the main task.
- Antoine R. Dumont changed the description
changed the description
- Phabricator Migration user mentioned in commit swh/infra/puppet/puppet-swh-site@b844cf92
mentioned in commit swh/infra/puppet/puppet-swh-site@b844cf92
- Phabricator Migration user mentioned in commit swh/infra/puppet/puppet-swh-site@9f78a530
mentioned in commit swh/infra/puppet/puppet-swh-site@9f78a530
- Phabricator Migration user mentioned in commit swh/infra/puppet/puppet-swh-site@5a95d508
mentioned in commit swh/infra/puppet/puppet-swh-site@5a95d508
- Phabricator Migration user mentioned in commit swh/infra/puppet/puppet-swh-site@f835e023
mentioned in commit swh/infra/puppet/puppet-swh-site@f835e023
- Phabricator Migration user mentioned in commit swh/infra/puppet/puppet-swh-site@b8e50e92
mentioned in commit swh/infra/puppet/puppet-swh-site@b8e50e92
- Antoine R. Dumont closed
closed