How to load Rubygems metadata?
The following discussion from !482 should be addressed:
@KShivendu started a discussion: (+12 comments)> How does the Rubygems repo get the dependency info? do they actually run the Ruby code?
Based on their source code, they extract dependencies from
metadata.gz
of the .gem file:--- !ruby/object:Gem::Specification name: rails version: !ruby/object:Gem::Version version: 7.0.4.3 platform: ruby authors: - David Heinemeier Hansson autorequire: bindir: bin cert_chain: [] date: 2023-03-13 00:00:00.000000000 Z dependencies: - !ruby/object:Gem::Dependency name: activesupport requirement: !ruby/object:Gem::Requirement requirements: - - '=' - !ruby/object:Gem::Version version: 7.0.4.3 type: :runtime prerelease: false version_requirements: !ruby/object:Gem::Requirement requirements: - - '=' - !ruby/object:Gem::Version version: 7.0.4.3 - !ruby/object:Gem::Dependency name: actionpack requirement: !ruby/object:Gem::Requirement requirements: - - '=' - !ruby/object:Gem::Version version: 7.0.4.3 type: :runtime prerelease: false version_requirements: !ruby/object:Gem::Requirement requirements: - - '=' - !ruby/object:Gem::Version version: 7.0.4.3 ...
This is a YAML-like file that we can also parse. Our rubygems loader also downloads this file. However, we just discarded this metadata file in this PR. So should I change that now?
Looking at this, I think we should be careful while discarding any files in the loader from now on (except checksum files)
Discussed on #swh-devel:
15:15:19 <+anlambert> KShivendu: you could also get the extrinsic metadata from the URL provided by the lister as it also provides the gem deps, check cpan loader for an example
15:24:34 < KShivendu> olasd: yes. got it.
15:25:43 < KShivendu> anlambert: good point. but will swh-indexer scan that for dependency parsing?
15:27:00 < KShivendu> afaik, at least for now, it doesn't do that.
15:28:18 <+vlorentz> anlambert: no need to do that, they are in the archive
15:29:02 <+vlorentz> KShivendu: it has the architecture to support it
15:29:33 < KShivendu> I see.
15:30:12 <+anlambert> vlorentz: what do you mean by they are in the archive ?
15:32:47 <+vlorentz> in the .gem file *
15:32:54 <+vlorentz> !482 (comment 135540)
15:35:31 <+anlambert> ack so we need to copy the metadata.gz file in the folder extracted from data.tar.gz file and the indexer will process it, am I right ?
15:37:44 <+anlambert> or we could create a branch with the raw gem file content (data.tar.gz, metadata.gz, checksums.yaml.gz) for each release
15:38:22 <-- gsamson (~gsamson@wifi-eduroam-84-035.paris.inria.fr) has quit (Quit: WeeChat 3.8)
15:39:48 < KShivendu> > am I right ?
15:39:48 < KShivendu> yes. provided we uncompressed metadata.gz.
15:40:34 <+vlorentz> definitely not a branch, that would be a mess and specific to rubygems
15:41:27 <+anlambert> yes but usually we try to keep original tarball content so not sure what is best here
15:41:28 <+vlorentz> the options I see are: 1. extract the metadata file in the same folder 2. in a parent folder (meaning the source is no longer at the root) 3. as extrinsic metadata
15:49:22 < KShivendu> vlorentz: you gave me an interesting idea. We might be able to directly reuse ecosyste.ms packages repo Ecosystem module to get extrinsic metadata for different package managers. They have already implemented most of them.
map_package_metadata
function seems to return values in a consistent JSON format.... (full message at https://libera.ems.host/_matrix/media/v3/download/libera.chat/6dce60159faa5ea3f97004224be32216c2b4b234)15:52:03 <+vlorentz> KShivendu: this is what we meant by "Refactor metadata indexers to re-use packages provided/used by ecosyste.ms"
15:52:11 <+vlorentz> so yes, definitely
15:53:33 < KShivendu> cool.
15:53:45 < KShivendu> < vlorentz> "the options I see are: 1..." <- just need to decide on this.
Discussion of the three options I listed:
- extract the metadata file in the same folder -> potential for name clashes + we don't faithfully preserve the original layout
- in a parent folder (meaning the source is no longer at the root) -> we mess with the original layout in a different way (not adding a dir entry, but instead we add a level)
- as extrinsic metadata -> a little harder to discover, and may interfere with Disarchive to rebuild tarballs (though Rubygems, RPMs, and Debian packages will probably need special handling rebuild no matter what to rebuild them)