Bootstrap pypi loader

Nice work, thanks!

Most of my comments have been put inline, however I have a few general comments :

Please consistently capitalize PyPI as PyPI: it's the Python Package Index.

The use of terminology around PyPI releases in the model submodule is confusing to me. That's not helped by the PyPI API, which returns data of exactly the same shape whether you ask for full project information, or for information for a single release.

Here's what I understand : the PyPI data model has three levels

Projects, which have a set of metadata that are apparently pulled from the latest release, plus some index-specific metadata such as who's registered as an allowed uploader on PyPI;
Releases, which are point in time snapshots of a Project, tagged with a version number (and only a version number);
Files, which are the actual artifacts that are uploaded by the maintainer and materialize the release. Those have a name, a type (sdist, bdist_wheel, etc.), a date of upload, a size and a set of digests.

In this loader (and maybe in the lister, I haven't looked ), the "releases" term alternatively talks about a set of releases, or about a list of files associated to a release. I think we need to make the distinction crystal clear, which I think will help make the code smoother.

Ideally this terminology would be documented in the client class as well so we can keep it consistent.

Can you explain why some methods in the loader class have an (override) comment?

I'm a bit worried about the PyPIClient cache mechanism only using filenames; I'd be more comfortable with a digest-based cache, which shouldn't be an issue as all the files we're downloading also have a digest provided by upstream.

I think I understand the client / model split (client is generic, model is SWH-specific), but I'm not 100% convinced by the model submodule name (I'm guessing it's an analogy with swh.model?). I'd be tempted to move the PyPIProject class to client.py as they're fairly intertwined (I still think it's fair to keep the classes split for test mocking purposes).

The "real world" -> "swh data model" conversion functionality could then be moved to a converters submodule analogous to other loaders.

Some tests for the client module would be nice but I understand that's not exactly easy to do, we can add that at a later point.

! In !1 (closed), @olasd wrote: Nice work, thanks!

:)

Most of my comments have been put inline, however I have a few general comments :

Please consistently capitalize PyPI as PyPI: it's the Python Package Index.

Yes.

The use of terminology around PyPI releases in the model submodule is confusing to me. That's not helped by the PyPI API, which returns data of exactly the same shape whether you ask for full project information, or for information for a single release.

Here's what I understand : the PyPI data model has three levels

Projects, which have a set of metadata that are apparently pulled from the latest release, plus some index-specific metadata such as who's registered as an allowed uploader on PyPI;

Releases, which are point in time snapshots of a Project, tagged with a version number (and only a version number);

Files, which are the actual artifacts that are uploaded by the maintainer and materialize the release. Those have a name, a type (sdist, bdist_wheel, etc.), a date of upload, a size and a set of digests.

In this loader (and maybe in the lister, I haven't looked ), the "releases" term alternatively talks about a set of releases, or about a list of files associated to a release. I think we need to make the distinction crystal clear, which I think will help make the code smoother. Ideally this terminology would be documented in the client class as well so we can keep it consistent. Ok.

Can you explain why some methods in the loader class have an (override) comment?

It's just for mentioning it's coming from the base class. To distinguish between those and the other ones being initialized in that class.

I'm a bit worried about the PyPIClient cache mechanism only using filenames; I'd be more comfortable with a digest-based cache, which shouldn't be an issue as all the files we're downloading also have a digest provided by upstream.

Yes, well, that part is not for production. It was more for the development phase (as explained in the description of this diff ;). It helped me a lot to post-analyze the json and release artifacts.

I did not plan on sharing the cache between the different workers.

I think I understand the client / model split (client is generic, model is SWH-specific),

Yes, that was the goal.

but I'm not 100% convinced by the model submodule name (I'm guessing it's an analogy with swh.model?).

Yes, i don't like it either. I did not find anything better then.

I'd be tempted to move the PyPIProject class to client.py as they're fairly intertwined (I still think it's fair to keep the classes split for test mocking purposes).

I'm ok with that.

The "real world" -> "swh data model" conversion functionality could then be moved to a converters submodule analogous to other loaders.

Right.

Some tests for the client module would be nice but I understand that's not exactly easy to do, we can add that at a later point.

I agree.

For the testing, I focused mainly on the loading part.

I'll keep it in mind for later.

Taking care of some remarks

loader.pypi.client: Improve docstrings
pypi.client: Improve error message when failing to fetch artifacts
pypi.client: Extract PyPI instance's base url as instance attribute
pypi.client: Rename prepare_release_artifact method
pypi.client: Clarify the term from release to artifact
loader.pypi: Move converter functions in converters module
test_loader: Fix typo
Move PyPiProject from model to client module
*: Use "PyPI" terminology consistently everywhere
pypi.loader: Remove (override) comment
*: Use less _ in prefix variable names
pypi.client: Fix wrong verb use (allow instead of permit)
pypi.client: Dissociate artifact and release object
test_loader: Fix pep8 violation
pypi.client: Clarify filtering release artifacts method
loader.pypi: Clarify term from release to artifact

Mainly remains the memory optimization alignment with the debian loader

pypi.loader: Make loader fully incremental in loading swh objects
pypi.client: Improve PyPIProject docstring

Some references in the commit message have been migrated:

T421 is now swh-loader-core#421 (closed)

pypi.tasks: Rename to LoadPyPI task
pypi.client: Compute hashes and write the tarball in one roundtrip

Merge request was merged

closed

/me raises an eyebrow...

I did not intend to close this...

I'm on branch loader-pypi

arc diff --update !1 (closed) origin/master

To update the diff which is the way to go.

My mistake seems to be me wanting to push the branch, that closed the diff. What's confusing is that i already did that (update the branch) but that did not close it before...

And now i just don't understand what is displayed...

Bootstrap pypi loader

Functional note

Modules

Technical Note

Edge cases expected

Possible improvments

Test Plan

Closed by Phabricator Migration user 6 years ago (Sep 13, 2018 11:47pm UTC) 6 years ago

Activity

Bootstrap pypi loader

Functional note

Modules

Technical Note

Edge cases expected

Possible improvments

Test Plan

Merge request reports

Closed by Phabricator Migration user 6 years ago (Sep 13, 2018 11:47pm UTC) 6 years ago

Activity