Skip to content
Snippets Groups Projects

Bootstrap pypi loader

Given a pypi origin, load its release artifacts (as synthetic revision).

The first time, the visit results in a snapshot (targetting created revisions). Further visits with no change results in the same latest snapshot. Visit with new release artifacts results in a new snapshot (new snapshot share the same branches as the last one, plus the new ones)

Functional note

  • Missing release artifact information are skipped

  • Release artifacts whose PKG-INFO file is missing are skipped

Modules

  • swh.loader.pypi.client: Client interface to query pypi.org (and also somewhat manipulating the artifact local representations). It's the PyPiProject's collaborator to fetch missing information on the project.

  • swh.loader.pypi.model: PyPiProject representation of a pypi origin. It's the loader's collaborator to manipulate the origin (filter releases, etc...)

  • swh.loader.pypi.loader: The main entry point for the loader (fetch_data fetches, store_data stores... ;)

Technical Note

The client cache is there for local runs and post data analysis (also, it has been used for tests). That setting should be set to off for production.

Edge cases expected

  • Pypi project can resolve in no longer existing origin (404)

  • Hash checksum divergences stop the loading (maybe this can be improved later)

Possible improvments

  • Add some retry around the fetch artifacts routine (client)

  • Add more tests

  • simplify PyPiProject class (it was created at first to do more than it does now)

Branch is remote on the repository: loader-pypi

Test Plan

make test


Migrated from D408 (view on Phabricator)

Merge request reports

Closed by Phabricator Migration userPhabricator Migration user 6 years ago (Sep 13, 2018 11:47pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Nice work, thanks!

    Most of my comments have been put inline, however I have a few general comments :

    Please consistently capitalize PyPI as PyPI: it's the Python Package Index.

    The use of terminology around PyPI releases in the model submodule is confusing to me. That's not helped by the PyPI API, which returns data of exactly the same shape whether you ask for full project information, or for information for a single release.

    Here's what I understand : the PyPI data model has three levels

    • Projects, which have a set of metadata that are apparently pulled from the latest release, plus some index-specific metadata such as who's registered as an allowed uploader on PyPI;
    • Releases, which are point in time snapshots of a Project, tagged with a version number (and only a version number);
    • Files, which are the actual artifacts that are uploaded by the maintainer and materialize the release. Those have a name, a type (sdist, bdist_wheel, etc.), a date of upload, a size and a set of digests.

    In this loader (and maybe in the lister, I haven't looked :no_mouth:), the "releases" term alternatively talks about a set of releases, or about a list of files associated to a release. I think we need to make the distinction crystal clear, which I think will help make the code smoother.

    Ideally this terminology would be documented in the client class as well so we can keep it consistent.

    Can you explain why some methods in the loader class have an (override) comment?

    I'm a bit worried about the PyPIClient cache mechanism only using filenames; I'd be more comfortable with a digest-based cache, which shouldn't be an issue as all the files we're downloading also have a digest provided by upstream.

    I think I understand the client / model split (client is generic, model is SWH-specific), but I'm not 100% convinced by the model submodule name (I'm guessing it's an analogy with swh.model?). I'd be tempted to move the PyPIProject class to client.py as they're fairly intertwined (I still think it's fair to keep the classes split for test mocking purposes).

    The "real world" -> "swh data model" conversion functionality could then be moved to a converters submodule analogous to other loaders.

    Some tests for the client module would be nice but I understand that's not exactly easy to do, we can add that at a later point.

  • ! In !1 (closed), @olasd wrote: Nice work, thanks!

    :)

    Most of my comments have been put inline, however I have a few general comments :

    Please consistently capitalize PyPI as PyPI: it's the Python Package Index.

    Yes.

    The use of terminology around PyPI releases in the model submodule is confusing to me. That's not helped by the PyPI API, which returns data of exactly the same shape whether you ask for full project information, or for information for a single release.

    Here's what I understand : the PyPI data model has three levels

    • Projects, which have a set of metadata that are apparently pulled from the latest release, plus some index-specific metadata such as who's registered as an allowed uploader on PyPI;
    • Releases, which are point in time snapshots of a Project, tagged with a version number (and only a version number);
    • Files, which are the actual artifacts that are uploaded by the maintainer and materialize the release. Those have a name, a type (sdist, bdist_wheel, etc.), a date of upload, a size and a set of digests.

    In this loader (and maybe in the lister, I haven't looked :no_mouth:), the "releases" term alternatively talks about a set of releases, or about a list of files associated to a release. I think we need to make the distinction crystal clear, which I think will help make the code smoother. Ideally this terminology would be documented in the client class as well so we can keep it consistent. Ok.

    Can you explain why some methods in the loader class have an (override) comment?

    It's just for mentioning it's coming from the base class. To distinguish between those and the other ones being initialized in that class.

    I'm a bit worried about the PyPIClient cache mechanism only using filenames; I'd be more comfortable with a digest-based cache, which shouldn't be an issue as all the files we're downloading also have a digest provided by upstream.

    Yes, well, that part is not for production. It was more for the development phase (as explained in the description of this diff ;). It helped me a lot to post-analyze the json and release artifacts.

    I did not plan on sharing the cache between the different workers.

    I think I understand the client / model split (client is generic, model is SWH-specific),

    Yes, that was the goal.

    but I'm not 100% convinced by the model submodule name (I'm guessing it's an analogy with swh.model?).

    Yes, i don't like it either. I did not find anything better then.

    I'd be tempted to move the PyPIProject class to client.py as they're fairly intertwined (I still think it's fair to keep the classes split for test mocking purposes).

    I'm ok with that.

    The "real world" -> "swh data model" conversion functionality could then be moved to a converters submodule analogous to other loaders.

    Right.

    Some tests for the client module would be nice but I understand that's not exactly easy to do, we can add that at a later point.

    I agree.

    For the testing, I focused mainly on the loading part.

    I'll keep it in mind for later.

  • Taking care of some remarks

    • loader.pypi.client: Improve docstrings
    • pypi.client: Improve error message when failing to fetch artifacts
    • pypi.client: Extract PyPI instance's base url as instance attribute
    • pypi.client: Rename prepare_release_artifact method
    • pypi.client: Clarify the term from release to artifact
    • loader.pypi: Move converter functions in converters module
    • test_loader: Fix typo
    • Move PyPiProject from model to client module
    • *: Use "PyPI" terminology consistently everywhere
    • pypi.loader: Remove (override) comment
    • *: Use less _ in prefix variable names
    • pypi.client: Fix wrong verb use (allow instead of permit)
    • pypi.client: Dissociate artifact and release object
    • test_loader: Fix pep8 violation
    • pypi.client: Clarify filtering release artifacts method
    • loader.pypi: Clarify term from release to artifact

    Mainly remains the memory optimization alignment with the debian loader

    • pypi.loader: Make loader fully incremental in loading swh objects
    • pypi.client: Improve PyPIProject docstring
    • pypi.tasks: Rename to LoadPyPI task
    • pypi.client: Compute hashes and write the tarball in one roundtrip
  • Merge request was merged

  • /me raises an eyebrow...

    I did not intend to close this...

  • I'm on branch loader-pypi

    arc diff --update !1 (closed) origin/master

    To update the diff which is the way to go.

    My mistake seems to be me wanting to push the branch, that closed the diff. What's confusing is that i already did that (update the branch) but that did not close it before...

    And now i just don't understand what is displayed...

Please register or sign in to reply
Loading