Commits · af9d7b758580c6fc54093719edc82658a24837ba · Platform / Development / swh-loader-cvs

Jan 07, 2022
- d/changelog: Bump new release · af9d7b75
  Antoine R. Dumont authored 3 years ago
  
  Related to T3788
  Verified
  
  af9d7b75
- d/control: Add missing new test dependency · b720d229
  Antoine R. Dumont authored 3 years ago
  
  Without it, that fails the build [1] [1] https://jenkins.softwareheritage.org/job/debian/job/packages/job/DLDCVS/job/gbp-buildpackage/5/console Related to T3788
  Verified
  
  b720d229
- Updated debian changelog for version 0.1.0 · eb78a1d4
  Jenkins for Software Heritage authored 3 years ago
  
  eb78a1d4
- Update upstream source from tag 'debian/upstream/0.1.0' · e7136f20
  Jenkins for Software Heritage authored 3 years ago
  
  Update to upstream version '0.1.0' with Debian dir 294d0d8fdc0c107d311e3015079dbea8b43fbf7e
  e7136f20
- New upstream version 0.1.0 · 04e8f6a5
  Jenkins for Software Heritage authored 3 years ago
  
  04e8f6a5
Jan 06, 2022

validate input paths in the CVS loader · 238c9c03

Stefan Sperling authored 3 years ago

The CVS loader creates files on the local file system based on
paths which were read from a local copy of a CVS repository or
sent by a CVS server as part of its "cvs rlog" response.

Ensure that such paths will not be able to escape the temporary
directory which stores checked out versions of files.

238c9c03

Dec 16, 2021

Pin mypy and drop type annotations which makes mypy unhappy · cbde9812
Antoine R. Dumont authored 3 years ago
```
This also drops spurious copyright headers to those files if present.

Related to T3812
```
Verified

cbde9812

swh.loader.cvs.tasks: Fix parameter uses to the ones needed · f191158d

Antoine R. Dumont authored 3 years ago

The existing code was probably made out of the svn loader and got never changed. This
drops the inexistant parameters and keep only the one needed.

This also adds coverage to the module.

Related to T3788

Verified

f191158d

Dec 15, 2021
- Add missing dh_install override to avoid stomping on the namespace __init__.py · 38aaa5d8
  Nicolas Dandrimont authored 3 years ago
  
  38aaa5d8
- Import upstream version 0.0.2 · e07e8859
  Nicolas Dandrimont authored 3 years ago
  
  e07e8859
- Updated debian directory for version 0.0.2 · c3cc3c12
  Nicolas Dandrimont authored 3 years ago
  
  c3cc3c12
Dec 13, 2021
- setup.py: Use proper trove qualifier for AGPLv3 · ce656fde
  Nicolas Dandrimont authored 3 years ago
  
  v0.0.2
  
  ce656fde
Dec 09, 2021

fix Log keyword expansion with trailing whitespace in prefix · a66c6b49

Stefan Sperling authored 3 years ago

Our expansion of the Log keyword was slightly wrong. We need to
trim trailing whitespace from the "prefix" line content which
preceeds the Log keyword when we write out line content which
followed the Log keyword. Update the Log expansion example given
in a comment to document this (see there for details; this behaviour
of CVS is hard to explain without illustration).

Found while testing conversion of the OpenBSD CVS repository.
Add a new test which uses an RCS file from this repository to
reproduce this problem.

a66c6b49

support custom keywords during rsync:// conversion · dcb895ca

Stefan Sperling authored 3 years ago

CVS supports the definition of custom keywords. A common use case
for custom keywords is to use the project name as a keyword. This
avoids confusion when files are copied between projects using CVS,
in case files contain a keyword that is in use by both projects.
In other words, a file will retain its expanded custom keyword from
project A, allowing to trace the initial file version back to its
origin, after the file was copied into project B's CVS repository.

This feature is in active use by OpenBSD and NetBSD, for example.
Existing conversions of their CVS repositories to Git expand
the corresponding custom keywords as well, and so should we.
Historically, X11 and FreeBSD were also using custom keywords.

During conversion via rsync:// we copy the CVSROOT directory and the
desired CVS module from the rsync server. The file CVSROOT/config
contains directives which configure the use of custom keywords.
Parse this file and expand keywords accordingly when checking out
versions of files from our local copy of the CVS repository.

For now, we only support custom keywords which correspond to the
Id keyword since this is known to be in common use by projects.
The latest releases of CVS (1.12.x) have optional support for arbitrary
keyword aliases via custom keywords. Support for this could be added
later, should there be a need to do so. In any case, the pserver access
method already supports arbitrary custom keywords because such keywords
will be expanded by the CVS server when we check out files from it.

While here, optimize our use of rsync a bit.
Fetch only CVSROOT and the desired CVS module over rsync, rather
than fetching the entire CVS repository directory, which may contain
unrelated CVS modules that require disk space but will not be used.

dcb895ca

Dec 08, 2021

fix the top-level directory path of imported CVS modules · 965629d6

Stefan Sperling authored 3 years ago

CVS modules were imported with the a top-level directory which
matched the module name. For a CVS origin such as
rsync://cvs.savannah.gnu.org/sources/dino/dino
the top-level directory contained a single directory called "dino"
with all expected files and directories residing inside this directory.
E.g. the dino project's top-level README file would be stored at
the path "dino/README" instead of just "/README".

Import project files directly into the top-level directory, as expected.
Adjust test expectations accordingly.

965629d6

Dec 07, 2021

update test suite documentation · 9e8f931e

Stefan Sperling authored 3 years ago

Mention that cvs is a required dependency for running the tests.

Document that some protocol schemes are not fully covered by
the test suite (as suggested by vlorentz in D6678).

9e8f931e

make CVS loader create one snapshot per visit · 5298a8f9

Stefan Sperling authored 3 years ago

The CVS loader used to create one snapshot per loaded revision.
As pointed out by ardumont in D6745, this is wrong; Other loaders
create only one snapshot per visit.
Fix this issue and adjust tests expectations accordingly.

While here, show SHW IDs of loaded revisions and snapshots in regular
"info" log output, rather than only in "debug" log output. Previously,
only CVS-related data was shown at the "info" log level. Showing both
CVs and SWH data in log output is more informative.

5298a8f9

fix expansion of the Log keyword with rsync origins · 099959bb

Stefan Sperling authored 3 years ago

Align our expansion of Log keywords with the behaviour of a real
CVS server. With this, such keywords expand the same way over
the pserver and rsync access methods.

This is the last change required to consistently ingest CVS's own
CVS repository over both pserver and rsync. Otherwise we get commit
hash mis-matches due to differently expanded Log keywords.

099959bb

Dec 04, 2021

in cvs loader tests, use f-strings to build repository URLs · f36332c7

Stefan Sperling authored 3 years ago

Summary: Suggested by ardumont in D6566

Reviewers: #reviewers, vlorentz

Reviewed By: #reviewers, vlorentz

Differential Revision: https://forge.softwareheritage.org/D6585

f36332c7

Nov 29, 2021

fix expansion of multiple RCS keywords on a line via rsync · 939dd546

Stefan Sperling authored 3 years ago

The function RcsKeywords.expand_keyword() is used to expand keywords
when fetching an origin over rsync. This function failed to process
multiple keywords on a single line, even though the existing code
already keeps looping in an attempt to expand multiple keywords.

For example, consider this line from a file in the ccvs CVS repository:

  #ident	"@(#)cvs/contrib/pcl-cvs:$Name:  $Id$"

Here, a regular CVS server expands both keywords on this line.

The Name keyword is special; It expands only if an explicit tag name was
given on the CVS command line. This keyword always expands to an empty
string for now, until perhaps one day the CVS loader learns about tags.

Our regular expression which attempts to match keywords on a line splits
the above example into two match groups:

  1: #ident	"@(#)cvs/contrib/pcl-cvs:$Name:  $
  2: Id$

The Name keyword was then expanded as expected, but the Id keyword was missed.
To fix this, attempt another match starting from the terminating character of
the previous match, such that we match the following two strings:

  1: #ident	"@(#)cvs/contrib/pcl-cvs:$Name:  $
  2: $Id$

Now our CVS loader expands both keywords like the CVS server does.
Add new test data to confirm that this works as intended.

939dd546

Nov 26, 2021
- add a test for conversion of a file which contains a Header keyword · bc00d6b1
  Stefan Sperling authored 3 years ago
  
  bc00d6b1
Nov 23, 2021

attempt to avoid content differences due to paths in keywords · 5539ccb6

Stefan Sperling authored 3 years ago

Some RCS keywords, such has "Header", contain absolute file paths
derived from the on-disk filesystem path of the CVS repository.

When we fetch files over the pserver protocol such keywords are
expanded by the CVS server. But when using the rsync protocol we
will first copy the CVS repository to local disk and the path to
this local copy will correspond to some temporary directory.

Try to avoid file content differences between pserver and rsync
access methods by deriving a likely server-side path from path
information found in the rsync:// origin URL.
This will work as expected as long as the CVS server-side setup
exposes the same path to the CVS repository over both access
methods, which is the case for GNU savannah for example.

In general, we should recommend treating pserver and rsync as distinct
origins and not rely on them to be interchangable and always produce
the same conversion result. But we can still try our best to avoid
needless differences in content hashes.

5539ccb6

cvs.tasks: Fix type · 1f6580c4

Antoine R. Dumont authored 3 years ago

This fixes build [1]

[1] https://jenkins.softwareheritage.org/view/swh-draft/job/DLDSVN/job/tests/1304/console

Verified

1f6580c4

Nov 11, 2021

preserve empty lines in CVS log messages over pserver · 34f46486

Stefan Sperling authored 3 years ago

Empty lines sent by the CVS server in rlog output were being stripped
by our custom cvs client implementation. Unfortunately, this resulted
in empty lines being stripped from CVS log messages, which is fixed
with this commit. The rsync access method already preserved log
messages properly, and now the pserver access method does the same.

34f46486

Nov 09, 2021

add CVS commit ID support to rlog.py · f5b974a0

Stefan Sperling authored 3 years ago

Newer CVS clients tag commits with a commit ID which allows us to
correctly convert commits which changed several RCS files at once.
The rsync access method based on cvs2gitdump was already taking
advantage of this. To ensure that conversions over the pserver
protocol yield the same result as conversions over rsync we need
to add commit ID support to rlog.py.

Add two new test cases which convert the same repository over
rsync and pserver respectively, and ensure that they yield the
same result. Without commit ID support conversion over pserver
produces a different result for this particular test repository.

With feedback about coding style from vlorentz.

f5b974a0

handle Attic-only RCS files over CVS pserver · d28a4b21

Stefan Sperling authored 3 years ago

CVS repositories may contain RCS history in file,v as well as
a corresponding Attic/file,v where each file contains separate
events that occurred in history. The Attic version of the file
results from file deletion events.

The rsync access method already uses history found in the Attic.
However, a CVS server will only return RCS files from the Attic
if we request them explicitly. If we do not request them then our
converted history may end up missing deletion events for some files.
Unfortunately, we cannot tell which RCS files have a corresponding
file in the Attic, so we need to search all Attic directories by
running the equivalent of 'cvs rlog' in each directory. This slows
down pserver access considerably (and it was already quite slow
compared to rsync). But we need to pay this price in order to
obtain a valid conversion result.

This patch contains related fixes to cvsroot path handling, which
was broken for the pserver case. Without these fixes we cannot
create the correct paths for Attic directories to search.

Problem found while comparing conversion results of rsync and
pserver access methods for the GNU dino CVS repository at
cvs.savannah.gnu.org/sources/dino
Add two new test cases based on RCS files from this repository.

Without this fix in place history would diverge at this commit:
  8891a63 | larsl | Removed the MIDIEvent class | 04 May 2006, 01:11 UTC
Because the files midievent.cpp and midievent.hpp would not get deleted
when converting this commit via the pserver protocol.

d28a4b21

improve test coverage of file additions and deletions · d72f15f2

Stefan Sperling authored 3 years ago

Make an existing test case run over pserver as well.
This access method uses a different way of detecting file
additions and deletions and should be tested separately.

Add new tests to cover the re-addition of a file after it
was deleted.

d72f15f2

display file state in progress logging output · ca23bc13
Stefan Sperling authored 3 years ago

ca23bc13

add support for RCS keyword expansion over pserver protocol · f52f0e45

Stefan Sperling authored 3 years ago

We can simply ask the CVS server to expand keywords for us, instead
of forcing binary file mode with the -kb option. The CVS repository
contains per-file keyword expansion defaults the server will use.
Files checked out by cvsclient.py should now match what a regular
CVS client would check out by default.

Add test cases which verify that we create the same snapshot ID
for a repository which uses the Id keyword in a file, regardless
of whether this repository is accessed via rsync or pserver.

f52f0e45

Nov 05, 2021
- Remove debug code · 7e2d8d89
  vlorentz authored 3 years ago
  
  7e2d8d89
Nov 03, 2021
- Add type annotations · bab0a5c6
  vlorentz authored 3 years ago
  
  bab0a5c6
- Fix DeprecationWarnings caused by invalid escape sequences. · 7cffdf70
  vlorentz authored 3 years ago
  
  7cffdf70
Oct 27, 2021

test checkout of file lacking trailing \n over pserver protocol · beb7fc8a

Stefan Sperling authored 3 years ago

This test reproduces the bug fixed in
commit d3b3344b where our custom cvs
client would fail to check out a file which lacks a trailing newline
from a remote CVS server.

The error triggered by the test without the fix in place is:

CVSProtocolError: Overlong response from CVS server:
b'delta with no trailing eolok\n'

beb7fc8a

rlog: fix loading of CVS commits which have a commit ID · 509ac801

Stefan Sperling authored 3 years ago

The CVS commit ID is an optional attribute which is only generated
by relatively recent releases of CVS clients. Our rlog parser was
skipping such commits because it failed to match on them due to an
error in a regular expression.
This resulted in an incomplete import of CVS revision history.

Here is a sample line from cvs rlog output which carries a
commit ID and was not matched because the regex lacked the
trailing semicolon:
date: 2007-07-17 15:02:50 +0200;  author: larsl;  state: Exp;  lines: +619 -285;  commitid: oju0x8tTc9aUB7qs;

Found while testing ingestion of the GNU dino repository from
cvs.sannah.gnu.org/sources/dino

509ac801

rlog: fix parsing of multiple file revisions · 0829dc33

Stefan Sperling authored 3 years ago

The rlog parser was only fetching a single file revision because
some lines of code had the wrong indentation. These lines were
supposed to be part of a loop body but were only executed once.

Also rename a function which had a misleading name and docstring.
This function does in fact process the entire RCS revision history
of a given file, as opposed to just one entry of RCS revision history.

Found while testing ingestion of the GNU dino repository from
cvs.savannah.gnu.org/sources/dino

0829dc33

apply style tweaks suggested by vlorentz and reformatted by black · 6ff0b447
Stefan Sperling authored 3 years ago

6ff0b447

cvsclient: handle additional responses sent by server · 3a2f06b3

Stefan Sperling authored 3 years ago

While checking out files the server sends messages to the CVS
client which provide information about the state of file paths.

Our custom CVS client implementation needs to recognize a few
additional responses the server may send while checking out a
different version of a file which was already checked earlier.
Otherwise our client will error out. We can simply ignore the
messages (and its 2 paths arguments separated by \n) because
we do not manage an actual CVS working copy.

Found while testing ingestion of the GNU dino repository at
cvs.savannah.gnu.org/sources/dino

3a2f06b3

cvsclient: handle files which lack a trailing newline · d3b3344b

Stefan Sperling authored 3 years ago

CVS uses \n as a protocol message separator, which forces us
to read protocol message line-by-line. File content sent by
the server has a length known and is transmitted in bytes.
The server appends a final "ok\n" message (or perhaps an error
message) when it is done sending file contents.

Properly handle the case where this final message gets buffered
along with file contents and is not delimited from file contents
by \n because the file lacks a trailing newline. Previously, the
final protocol message ended up being written out to file contents
in this case.

Found while testing ingestion of the GNU dino CVS repository from
cvs.savannah.gnu.org/sources/dino.

d3b3344b

Oct 04, 2021
- apply re-formatting suggested by the black code formatter · 7f761b85
  Stefan Sperling authored 3 years ago
  
  7f761b85
Oct 01, 2021
- make the black code formatter skip the pserver scramble shift table · ea469457
  Stefan Sperling authored 3 years ago
  
  This table becomes unreadable after the black formatter inserts a newline between all the table entries.
  ea469457