Another known remaining problem with the pserver method is that it does not yet expand RCS keywords in checked out files.
The rsync access method already does this, which means we may end up with different content hashes depending on the access method.
Patches to add RCS keyword expansion are under review.
I am still testing with the GNU dino CVS repository. There is (hopefully only) one more issue which needs to be addressed before this repository will be converted correctly over both pserver and rsync: The pserver access method currently ignores CVS commit IDs. This means it might merge CVS commits together which contain the same log message and were made within 5 minutes of each other, even if commit IDs could be used to tell the commits apart. The rsync method already does this. The GNU dino CVS repository contains some such commits near the end of its history and this is causing hash diversions between the pserver and rsync access method.
There is another problem related to keywords: Some CVS-based projects use custom keywords, instead of the standard
Id
keyword. This prevents wrong expansion of
Id
when code is imported from one project to another. Usually the project's name will be used as the custom keyword name, such as
OpenBSD
or
NetBSD
, instead of
Id
. At present, to expand keywords correctly in this case, we need to use the pserver access method to benefit from server-side keyword expansion. But we will end up with different hashes if rsync is used to import the same origin again. We might be able to auto-detect use of custom keywords if the rsync server allows access to the CVSROOT folder, but this is not always the case. If CVSROOT is hidden from rsync, the only reliable way to detect custom keywords would be a parameter that gets passed into the loader. We could, for example, allow passing the name of a custom keyword as a parameter embedded in the origin URL.
! In #3691, @stsp wrote:
There is another problem related to keywords: Some CVS-based projects use custom keywords, instead of the standard
Id
keyword. This prevents wrong expansion of
Id
when code is imported from one project to another. Usually the project's name will be used as the custom keyword name, such as
OpenBSD
or
NetBSD
, instead of
Id
. At present, to expand keywords correctly in this case, we need to use the pserver access method to benefit from server-side keyword expansion. But we will end up with different hashes if rsync is used to import the same origin again. We might be able to auto-detect use of custom keywords if the rsync server allows access to the CVSROOT folder, but this is not always the case. If CVSROOT is hidden from rsync, the only reliable way to detect custom keywords would be a parameter that gets passed into the loader. We could, for example, allow passing the name of a custom keyword as a parameter embedded in the origin URL.
The above is the only currently known remaining issue.
keywords.
Note again that this would only matter for rsync where keyword expansion is done locally. With pserver access, the CVS server already expands such keywords on our behalf.
Would anyone object to passing a project-specific keyword as part of the origin URL like this? Or would this break assumptions made elsewhere in the system? For a given origin that uses a custom keyword the conversion would produce different results depending on whether the custom keyword is expanded or not. An origin URL which causes the custom keyword to be expanded would represent a slightly different origin (and result in different commit hashes) compared to an origin URL which ignores the custom keyword.
Another problem with keyword expansion found during testing:
Keywords may contain the file path. During conversion over rsync we currently write out the absolute path of the local file we have on disk. In the pserver case the expanded keyword uses the server-side path instead.
I have started test conversions of the OpenBSD CVS repository.
Unfortunately, it will be impossible to match the existing conversion of this repository to Git which is published on Github,
even though this conversion is created using the cvs2gitdump script which our own CVS loader is based on.
The problem is again related to to keyword expansion.
The OpenBSD history contains Header keywords which expand to server-side paths. The published conversion uses a repository at the path /home/cvs/src, and this path ends up in various files via Header keywords. We can cope with this by using an rsync origin which exposes a copy of this CVS repository at rsync://example.com/home/cvs/src. However, CVS repositories published on official mirrors via rsync use a different path (just "/cvs/src"), so we would end up with different hashes when loading history from such an official mirror.
While the above could be worked around, there is another problem: a small difference in keyword expansion which is already present in the very first commit:
Our CVS loader expands this path like a CVS server would do, preserving the 'Attic' path component.
The upstream cvs2gitdump script however strips 'Attic' from the path, which is incompatible with the behavior implemented by CVS.
Given this choice, I would rather keep CVS compatibility than compatibility to cvs2gitdump behavior.
There is one bug in our Log keyword handling which is also exposed by converting the very first OpenBSD CVS commit.
I will send a fix for this soon, with a corresponding test added.
However, beyond this, further experiments with converting the OpenBSD CVS repository are unlikely to be very useful since we cannot trivially compare our results to a known-good reference conversion. From an operational point of view our CVS loader is already up to the task, albeit quite slow.
Unless I have overlooked something, all currently known issues have now been addressed.
Awesome.
The next steps would be the new created subtasks:
#3789 (closed): adapt the sourceforge lister to actually what expects the loader in terms of
cvs origins (more details in the task, it'd be neat if you could have a look?)
-> fixing the lister would allow actually trying to load more origins (in the
production way but for staging) which would eventually lift some more issues (or
not, sentry [1] would tell us)
-> that's a requisite to actually deploy (pulls pypi upload, debian package and some
puppet works)
Could you please also open a diff with the necessary changes required for the docker
stack (swh-environment/docker changes you had to make to actually have the loader run
properly)?
[1] https://sentry.softwareheritage.org (i can invite you to the team there if you
create an account first, that way you will be able to see issues there)
Could you please also open a diff with the necessary changes required for the docker
stack (swh-environment/docker changes you had to make to actually have the loader run
properly)?