To make RubyGems Lister we need the following -
List of all the packages.
Source code URL and metadata for each package
**To get the list of all the packages. **
There is no public API endpoint available to list all the packages. Although there is an inbuilt API which can be used to list the packages and all the version present for a particular package.
On further investigation, I found out there are data dumps provided on rubygems.org https://rubygems.org/pages/data
This could be used to get the list of all the packages.
I looked into the data dumps provided on rubygems.org.
A bash script(link to the script) is provided by rubygem that will download the most recent weekly dump listed on https://rubygems.org/pages/data and load it into a PostgreSQL database.
Here is the list of tables that were present in the database
List of relations Schema | Name | Type | Owner | Size | Description --------+---------------+-------+----------+--------+------------- public | dependencies | table | postgres | 454 MB | public | gem_downloads | table | postgres | 62 MB | public | linksets | table | postgres | 17 MB | public | rubygems | table | postgres | 10 MB | public | versions | table | postgres | 436 MB | (5 rows)
gem_downloads and dependencies table would serve no use in making the lister
The rubygems table contains id corresponding to their id and the date of their updating.
I did a bit investigation on data dumps, and it seems, they can serve the purpose well
To get the package release, we mainly need the name and version of packages. The link can, therefore, can be generated as -
Syntax of the link to the gem package source release http://rubygems.org/gems/<name>-<version>.gemExamplehttp://rubygems.org/gems/rails-3.2.1.gem
The blueprint for making the lister -
To download and load the latest data dump, the script mentioned in the above comment can be used.
We can get the name and time of the last update of all the package from the rubygems table.
Then we can get all the version associated with a package from versions table with their respective metadata.
From this the info, we can generate the link to the gem package source release as mentioned above. Then can create the loading task.
Some statistics -
No. of packages listed through gem list -r --all : 151373
No. of packages present in data dump: 162339
Some extra info
The package release, which will be downloaded, is in the formed of a structure similar to the following:
[package_name]:
The main root directory of the Gem package.
/bin:
Location of the executable binaries if the package has any.
/lib:
Directory containing the main Ruby application code (inc. modules).
/test:
Location of test files.
Rakefile:
The Rake-file for libraries which use Rake for builds.
[packagename].gemspec:
*.gemspec file, which has the name of the main directory, contains all package meta-data, e.g. name, version, directories etc.
We mainly archive source code, but here I feel we should also ingest all files in a package release(except /bin folder) because the source code of package is useless without its files like *.gemspec file and Rakefile. The package cannot be regenerated from source code without these files. Hence I feel these files are also essential to ingest.
Regarding the source code link in the linksets table -
There were only links for 17306( ~ 10% ) packages out of 162339, and that too without vcs. So, I guess they are useless.