Skip to content
Snippets Groups Projects
Verified Commit 087c50f7 authored by Antoine R. Dumont's avatar Antoine R. Dumont
Browse files

README: Describes the PyPI loader

And add:
- a development section to explain how to make the loader run locally
- terminology section used
- known edge cases

Related T421
parent 17904b45
No related tags found
No related merge requests found
......@@ -2,3 +2,101 @@ swh-loader-pypi
====================
SWH PyPI loader's source code repository
# What does the loader do?
The PyPI loader visits and loads a PyPI project [1].
Each visit will result in:
- 1 snapshot (which targets n revisions ; 1 per release artifact)
- 1 revision (which targets 1 directory ; the release artifact uncompressed)
[1] https://pypi.org/help/#packages
## First visit
Given a PyPI project (origin), the loader, for the first visit:
- retrieves information for the given project (including releases)
- then for each associated release
- for each associated source distribution (type 'sdist') release
artifact (possibly many per release)
- retrieves the associated artifact archive (with checks)
- uncompresses locally the archive
- computes the hashes of the uncompressed directory
- then creates a revision (using PKG-INFO metadata file)
targetting such directory
- finally, creates a snapshot targetting all seen revisions
(uncompressed PyPI artifact and metadata).
## Next visit
The loader starts by checking if something changed since the last
visit. If nothing changed, the visit's snapshot is left
unchanged. The new visit targets the same snapshot.
If something changed, the already seen release artifacts are skipped.
Only the new ones are loaded. In the end, the loader creates a new
snapshot based on the previous one. Thus, the new snapshot targets
both the old and new PyPI release artifacts.
## Terminology
- 1 project: a PyPI project (used as swh origin). This is a collection
of releases.
- 1 release: a specific version of the (PyPi) project. It's a
collection of information and associated source release
artifacts (type 'sdist')
- 1 release artifact: a source release artifact (distributed by a PyPI
maintainer). In swh, we are specifically
interested by the 'sdist' type (source code).
## Edge cases
- If no release provides release artifacts, those are skipped
- If a release artifact holds no PKG-INFO file (root at the archive),
the release artifact is skipped.
- If a problem occurs during a fetch action (e.g. release artifact
download), the load fails and the visit is marked as 'partial'.
# Development
## Configuration file
### Location
Either:
- /etc/softwareheritage/loader/pypi.yml
- ~/.config/swh/loader/pypi.yml
- ~/.swh/loader/svn.pypi
### Configuration sample
```
storage:
cls: remote
args:
url: http://localhost:5002/
```
## Local run
PyPI loader expects as input:
- project: a pypi project name (ex: arrow)
- project_url: uri to the pypi project (html page)
- project_metadata_url: uri to the pypi metadata information (json page)
``` sh
$ python3
Python 3.6.6 (default, Jun 27 2018, 14:44:17)
[GCC 8.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging; logging.basicConfig(level=logging.DEBUG
>>> project='arrow; from swh.loader.pypi.tasks import LoadPyPI;
>>> LoadPyPI().run(project, 'https://pypi.org/pypi/%s/' % project, 'https://pypi.org/pypi/%s/json' % project)
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment