Check integrity of directories, revisions, and releases

mentioned in issue swh-storage#3878 (closed)

mentioned in merge request swh-loader-git!72 (closed)

mentioned in merge request swh-loader-git!84 (closed)

added Archive content Post FOSDEM'17 consolidation priority:High labels

changed title from Check integrity of Revisions and Tags to Check integrity of Revisions and Releases

changed the description

I have done some investigations on this in light of T272. Bottom line: not good: git is very proficient in the corner cases department.

On the bright side, setting aside the issue underlying T272, we get the right result most of the time.

Here are a few of the corner cases that I stumbled upon:

git can store a positive or negative null offset when your timezone is UTC (that is, +0000 and -0000 are valid offsets)
The author/committer line can be missing the timestamp and timezone. In that case, pygit2 defaults to (timestamp, offset) = (0, +0000)
This would all be fine, except that sometimes, there is an explicit 0 +0000 at the end of the author line.
Empty message storage is inconsistent. The internal git object format is: a newline-terminated header, followed by an empty line, followed by the message. When the message is empty, there are two newlines at the end of the object. Except that sometimes, the empty line is omitted.

assigned to @olasd

added state:wip label

Some example releases:

Negative UTC offset: 97c8d2573a001f88e72d75f596cf86b12b82fd01 @ https://github.com/dankamongmen/sprezzos-installer
No timestamp in author line: c00ee9a43734e2a87b71ce5f6695ea7ede24bb98 @ https://github.com/zenczykowski/klibc
Explicit epoch timestamp in author line: 953d97ed27ad0ee5c288ed52f9589f0407b1eb05 @ https://github.com/antoniofabio/iDMC
Empty message, with empty line: dfcfdbdf877f252a44a8fd35930a0cf643c2d815 @ https://github.com/CroquetMagic/DownloadRankings
Empty message, without empty line: 953d97ed27ad0ee5c288ed52f9589f0407b1eb05 @ https://github.com/antoniofabio/iDMC (combo)

I just noticed that empty messages with empty lines are stored as an empty bytea, whereas empty messages without the empty line are stored as NULL. So there's that.

Dulwich seems to handle some of those special cases just fine.

Negative UTC offset: t._tag_timezone_neg_utc
No timestamp in author line: t.tag_time is None
Explicit epoch timestamp: (t.tag_timestamp, t.tag_time) = (0, 0)

It currently breaks on completely empty messages, but the patch seems fairly simple.

Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.

A gpgsig attribute records information for signed commits;
a mergetag attribute records information about gpg signatures on tags merged in the commit.

Obviously nothing prevents git upstream from adding other such attributes in the future either.

As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.

! In swh/devel/experiments/swh-db-audit#75 (moved), @olasd wrote: Reading the dulwich code a bit further, it turns out that git commits can have more header attributes than we initally expected.

Nice catch.

As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.

Yes, absolutely. As they're not generic to our VCS model, they shouldn't be glorified as columns or anything of the sort. OTOH, we will need to look for them when checking for VCS object integrity, because otherwise the object ID wouldn't correspond to the actual object content (right?).

(Which also reminds me of the fact that we should document somewhere our "schemas" for arbitrary metadata, incrementally, every time we add new metadata to the DB. Where would be a right place to do this? Somewhere under swh-storage I guess…)

! In swh/devel/experiments/swh-db-audit#75 (moved), @olasd wrote: It currently breaks on completely empty messages, but the patch seems fairly simple.

Submitted https://github.com/jelmer/dulwich/pull/413

The dulwich pull request has been merged, and the corresponding package has been added to the local swh archive for importers.

Some progress has been made on several fronts: swh.model has been updated for the latest schema modifications, and swh.loader.git properly populates the new attributes on updates.

The swh.model implementation has been exercised on the failing cases above and all seems well.

We have another snag: git supports an encoding header, that we have to support somehow. Two options: add it as an extra header, or as a new field on releases/revisions. The second option seems sensible as I believe the field is general-purpose enough (even if only as an advisory setting).

Ok, one more reason why we must keep the original data from the dumps.

The scripts checking the integrity of revisions and fixing some of them have been committed to swh-storage (in a new utils directory).

By manually checking the original source of some existing revisions with bad checksums, we have uncovered a few bugs where libgit2 would happily eat some data.

Some fixes we're trying:

Missing newlines at the beginning or end of commit messages
Negative UTC offsets in author/committer timestamps
Trailing spaces in author/committer names

We still have some ways to go:

some commits have a completely empty author name (author <bla@bla> timestamp offset instead of author <bla@bla> timestamp offset)
some commits have trailing null bytes (eww.)
some commits have extra headers that were previously ignored by our importer

Once we get down to a reasonable number of commits left to fix (there are still a few million currently), we'll be able to track down the original source and dump it to see where we went wrong.

Check integrity of directories, revisions, and releases

Designs

Child items ...

Activity