Git commit and tags have been imported considering their authors' names and emails have been encoded as utf-8. This assumption is most likely wrong in some cases, and we need to figure out which repositories we need to reimport to fix them.
I have done some investigations on this in light of T272. Bottom line: not good: git is very proficient in the corner cases department.
On the bright side, setting aside the issue underlying T272, we get the right result most of the time.
Here are a few of the corner cases that I stumbled upon:
git can store a positive or negative null offset when your timezone is UTC (that is, +0000and-0000 are valid offsets)
The author/committer line can be missing the timestamp and timezone. In that case, pygit2 defaults to (timestamp, offset) = (0, +0000)
This would all be fine, except that sometimes, there is an explicit 0 +0000 at the end of the author line.
Empty message storage is inconsistent. The internal git object format is: a newline-terminated header, followed by an empty line, followed by the message. When the message is empty, there are two newlines at the end of the object. Except that sometimes, the empty line is omitted.
I just noticed that empty messages with empty lines are stored as an empty bytea, whereas empty messages without the empty line are stored as NULL. So there's that.
As those things seem really git-specific I suppose we should store them in some kind of "extra_headers" attribute in the current metadata field.
Yes, absolutely. As they're not generic to our VCS model, they shouldn't be glorified as columns or anything of the sort. OTOH, we will need to look for them when checking for VCS object integrity, because otherwise the object ID wouldn't correspond to the actual object content (right?).
(Which also reminds me of the fact that we should document somewhere our "schemas" for arbitrary metadata, incrementally, every time we add new metadata to the DB. Where would be a right place to do this? Somewhere under swh-storage I guess…)
The dulwich pull request has been merged, and the corresponding package has been added to the local swh archive for importers.
Some progress has been made on several fronts: swh.model has been updated for the latest schema modifications, and swh.loader.git properly populates the new attributes on updates.
The swh.model implementation has been exercised on the failing cases above and all seems well.
We have another snag: git supports an encoding header, that we have to support somehow. Two options: add it as an extra header, or as a new field on releases/revisions. The second option seems sensible as I believe the field is general-purpose enough (even if only as an advisory setting).
The scripts checking the integrity of revisions and fixing some of them have been committed to swh-storage (in a new utils directory).
By manually checking the original source of some existing revisions with bad checksums, we have uncovered a few bugs where libgit2 would happily eat some data.
Some fixes we're trying:
Missing newlines at the beginning or end of commit messages
Negative UTC offsets in author/committer timestamps
Trailing spaces in author/committer names
We still have some ways to go:
some commits have a completely empty author name (author <bla@bla> timestamp offset instead of author <bla@bla> timestamp offset)
some commits have trailing null bytes (eww.)
some commits have extra headers that were previously ignored by our importer
Once we get down to a reasonable number of commits left to fix (there are still a few million currently), we'll be able to track down the original source and dump it to see where we went wrong.