(Brain dump after a discussion with @zimoun at the 10 Years of Guix Event)
Guix (and Nix?) manifests store a checksum of the source/input files downloaded by the manifest. For tarballs, this is a hash of the tarball. For Git repo, it's the commit or tree hash (IIRC). For other VCSs without their own intrinsic hashes (Subversion, CVS, ...), Guix hashes the directory tree of the tip commit by formatting them in the NAR (Nix Archive) format.
We should investigate this format. If it is simple enough, we should be able to recompute this NAR manifest and store its hash as ExtID so it is available to Guix (and Nix?) via the vault.
One of the issue is to lookup (key)----considering the information we have at package time. Today, the only intrinsic information that Guix always keeps is the integrity (computed by NAR serializer and hashed by SHA-256). For tarballs (url-fetch), Disarchive somehow provides a dictionnary from this Guix integrity to SWH.
For Git (git-fetch), it is less clear. I would like to store the Git commit hash at package time in Guix. But there is no consensus inside the community. Well, in this case, Guix fetches back from SWH using the URL + Git tag.
For other VCSs, nothing is implemented. For instance, IIUC about Subversion [1], the way to fallback would be to use snapshots which are not implemented yet.
Well, you will tell me to store swhid at package time. :-) Maybe, but it means that 20k+ packages should be updated. Perhaps, instead, a map from various integrity computations to swhid (something similar to Disarchive, somehow), could help; especially for other VCSs as Subversion, CVS, Bzr, etc. At least, NAR would help, sure! :-)
Guix provides guix hash where the option named --recursive is confusing (at least confused me :-)). Considering a random Git repository, say this one, then,
and the naive Python equivalent -- closely following Nix ARchive format described in here p.100 and following (or p.92 depending how you are counting ;-) well around section 5.2.2) -- looks like:
import osimport statimport hashlibfrom pathlib import Pathdef str_(thing): # name 'str' in Figure 5.2 p.93 (page 101 of pdf) if isinstance(thing, str): byte_sequence = thing.encode() else: byte_sequence = thing l = len(byte_sequence) blen = l.to_bytes(8, byteorder='little') # 64-bit little endian m = l % 8 if m == 0: offset = 0 else: offset = 8 - m return blen + byte_sequence + bytearray(offset)def serialise_prime_prime(fso): mode = os.lstat(fso).st_mode if stat.S_ISREG(mode): bstr = str_("type") + str_("regular") contents = b'' with open(str(fso), "rb") as f: for chunk in iter(lambda: f.read(4096), b""): contents += chunk if os.access(fso, os.X_OK): bstr += str_("executable") + str_("") return bstr + str_("contents") + str_(contents) elif stat.S_ISLNK(mode): bstr = str_("type") + str_("symlink") target = os.readlink(fso).encode() return bstr + str_("target") + str_(target) elif stat.S_ISDIR(mode): bstr = str_("type") + str_("directory") for path in sorted(Path(fso).iterdir()): bstr += serialiseEntry(path.name.encode(), path) return bstr else: raise ValueError("unsupported file type")def serialise_prime(fso): return str_("(") + serialise_prime_prime(fso) + str_(")")def serialiseEntry(name, fso): return str_("entry") + str_("(") \ + str_("name") + str_(name) \ + str_("node") + serialise_prime(fso) \ + str_(")")def serialise(fso): return str_("nix-archive-1") + serialise_prime(fso)# Entry pointdef nar_hash(thing): h = hashlib.sha256() s = serialise(thing) h.update(s) return h.hexdigest()
Python 3.9.9 (main, Jan 1 1970, 00:00:01) Type 'copyright', 'credits' or 'license' for more informationIPython 8.2.0 -- An enhanced Interactive Python. Type '?' for help.In [1]: %run narhasher.pyIn [2]: nar_hash("guix-modules")Out[2]: '75b5d7448bffb10ecf367e424c0941e059bab1aea0036c5c45e6b337db0e097d'
Guix does not encode using hex but nix-base32 but it does not really matter here.
which is the same as in the link above . The .git directory is excluded in the serialization.
However, note that Guix does not consider NAR serializer as integrity when it is about tarball. For example, consider the tarball of the package hello, then the checksum is 086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd
$ tree foofoo├── bar│ └── exe└── baz1 directory, 2 files
serializes as
nix-archive-1 ( type directory entry ( name bar node ( type directory entry ( name exe node ( type regular executable contents <_io.BufferedReader name='foo/bar/exe'> ) ) ) ) entry ( name baz node ( type regular contents <_io.BufferedReader name='foo/baz'> ) ) )
where the files baz and exe are empty and chmod u+x exe. The sha1 in hex format is a81bdeb702409399f83ba91cf95cd6bcab800ad7. Note the indentation is only added for readibility.
Now, let consider a large directory, with many Git repository including Guix or Emacs ones.
$ find ~/src/ -type f -print | wc -l91419$ find ~/src/ -type d -print | wc -l5218$ du -sh ~/src/11G /home/simon/src/
and compare guix hash and some Python implementation,
fwiw, I've iterated a bit over @zimoun's code and pushed it into the snippets repository
(see commits above and their commit description message). It's also able to deal with
git, hg and svn trees (ignoring their respective top metadata folder .git, .svn, ...
without impacting the performance).
The algorithm seems simple enough.
We may want to adapt, industrialize and push this within swh-model to expose functions
so loaders could compute and store that in the extid part (as per the description need
above). That could also avoid depending on nix (or guix) binaries for the deployment of
the new {Content|Directory}Loader [1] [2].
The only caveat I see in the current implementation is that the algorithm is doing its
own recursion. So if loaders would want to compute it, they'd walk again the
arborescence tree a second time (one round for hashing our intrinsic identifiers,
another to compute this one).
This may not be too much of a performance hit for the new {Content|Directory}Loader [1]
[2] but that may be a blocker for vcs repository loaders though... (some can take their
time depending on the origin to ingest).
Another way would be to integrate within one of the swh.model modules that deals with
merkle/fs structures. I don't think plugging this directly inside
swh.model.hashutil.MultiHash is possible. But maybe inside one of the modules
swh.model.merkle or swh.model.from_disk? But would that help?
The only caveat I see in the current implementation is that the algorithm is doing its
own recursion. So if loaders would want to compute it, they'd walk again the
arborescence tree a second time (one round for hashing our intrinsic identifiers,
another to compute this one).
Where your intrinsic identifiers are they computed? Because one solution is to walk the arborescence and feed various serializers (swhid, git, nar, etc.). Just to note that maybe in the future you would like support other intrinsic serializers. Well, I do not know, maybe it is a premature optimization. :-)
The only caveat I see in the current implementation is that the algorithm is doing its
own recursion. So if loaders would want to compute it, they'd walk again the
arborescence tree a second time (one round for hashing our intrinsic identifiers,
another to compute this one).
Where your intrinsic identifiers are they computed?
A loader is really a mapping "function" of reading the origin (vcs, tarball, package,
...) to swh.model.model Content, Directory, Revision, Release and Snapshot model objects
[1]. Those integrates intrinsic id computation method. See for example, git [2], hg [3].
For package loader, we directly walk the filesystem through
swh.model.from_disk.Directory.from_disk method [4], hence my previous question quoted
below.
Another way would be to integrate within one of the swh.model modules that deals with
merkle/fs structures. I don't think plugging this directly inside
swh.model.hashutil.MultiHash is possible. But maybe inside one of the modules
swh.model.merkle or swh.model.from_disk?
Because one solution is to walk the arborescence and feed various serializers (swhid,
git, nar, etc.).
Indeed, that's already the case in our MultiHash module [5] (which our model objects
in-fine depends upon). But so far, the hashes in question were "flat" hash algorithm
(without any extra mapping as the NAR defines).
Just to note that maybe in the future you would like support other intrinsic
serializers.
While we may be willing to do the extra effort for the NAR format to ease our
integration with guix/nix, I don't think that's the way forward for us though. We'd like
various package managers to start using standardized intrinsic identifiers hence our
effort on standardizing the SWHID format (which everybody can compute without the swh
stack).
Well, I do not know, maybe it is a premature optimization. :-)
While we may be willing to do the extra effort for the NAR format to ease our integration with guix/nix, I don't think that's the way forward for us though. We'd like various package managers to start using standardized intrinsic identifiers hence our effort on standardizing the SWHID format (which everybody can compute without the swh stack).
Are there package managers besides Nix and Guix that use file tree hashes as hard integrity checks? (Honest question; my guess is "no".)
Right now, Guix is doing great for tarballs (via Disarchive, which stores a swhid for the tarball content) and for Git (since we can directly look up Git commits by ID). For everything else (Mercurial, Subversion, etc.), there's no practical solution.
I understand and subscribe to the idea of standardizing around SWHID. However, until we get there, supporting nar hashes in SWH would make a big difference; it would instantly lead Guix to 99% archival coverage, which would justify having a party. :-)
Are there package managers besides Nix and Guix that use file tree hashes as
hard integrity checks? (Honest question; my guess is "no".)
I don't think so. I don't recall having seen such checks.
@vlorentz@anlambert since you did your fair amount of reviews and/or
implementations on lister/package loaders, any opinion ^ ?
However, until we get there, supporting nar hashes in SWH would make a big
difference; it would instantly lead Guix to 99% archival coverage, which
would justify having a party. :-)
Point taken (beyond the party ;p)
I realized that i need to take a step back as I've entangled 2 things in this
discussion:
first the implementation. What @zimoun started is a basic yet exhaustive
implementation in python to avoid depending on runtime dependencies. I've
massaged it a bit a while back and pushed it inside the snippet
repository. It's currently untested though so that'd need some more tests. The
current Content and Directory loaders "used" by the new nixguix lister are
depending on a nix binary to actually check integrity fields which are nar
hashes. So they do not use that "native" implementation yet. So there is a
shot at simplifying deployment if we wanted to avoid that runtime dependency
(to the expanse of maintaining it).
whether we want to support Nar hash when we encounter it. Which means
possibly lifting the ExtID mapping (as we did for the mercurial loader among
other things [1]). I propose ExtID(extid_type=nar, extid_version=0,
target_type={content|directory}). Would that be enough? @vlorentz, thoughts?
For the implementation part 1., @vlorentz, @olasd@douardda if we were to
support NAR hashes "natively" (with our own implementation), where would that
code go? A starting implementation (unused in swh yet, missing tests) is
currently living inside the snippets repository. Could we move it inside a
module in swh.model under third-party hash module or some such? Or simpler,
keep that code under where it's currently needed as it's not transverse yet
(in the swh.loader.core.{Content|Directory}Loader).
For 2, the core of the issue, we need to push forward the discussion, conclude
and design something? I've proposed something.
Are there package managers besides Nix and Guix that use file tree hashes as hard integrity checks? (Honest question; my guess is "no".)
I don't think so. I don't recall having seen such checks.
@vlorentz@anlambert since you did your fair amount of reviews and/or implementations on lister/package loaders, any opinion ^ ?
If you squint hard enough, tarball hashes are hashes of a file tree. But otherwise, no.
whether we want to support Nar hash when we encounter it.
What do you mean by "encounter"?
Which means possibly lifting the ExtID mapping (as we did for the mercurial loader among other things [1]). I propose ExtID(extid_type=nar, extid_version=0, target_type={content|directory}). Would that be enough? @vlorentz, thoughts?
I think so, yes
For the implementation part 1., @vlorentz, @olasd@douardda if we were to support NAR hashes "natively" (with our own implementation), where would that code go? A starting implementation (unused in swh yet, missing tests) is currently living inside the snippets repository. Could we move it inside a module in swh.model under third-party hash module or some such? Or simpler, keep that code under where it's currently needed as it's not transverse yet (in the swh.loader.core.{Content|Directory}Loader).
whether we want to support Nar hash when we encounter it.
What do you mean by "encounter"?
Currently, in the various listing of nixguix, it's not always present hence
"encounter".
Which means possibly lifting the ExtID mapping (as we did for the mercurial
loader among other things [1]). I propose ExtID(extid_type=nar,
extid_version=0, target_type={content|directory}). Would that be enough?
@vlorentz, thoughts? I think so, yes
Great, thanks.
In the core loader. Where* else would use it?
Nowhere that I could think of.
I could have missed something so I prefer having a second or more opinion on
the matter.
So swh.core.loader it is! And We'll still be eventually able to move
it if need be. In the mean time, it's yagni indeed.
So your plan is to make the Nixguix lister send the NAR hash to loaders, and make VCS, archive, and content loaders check it then write an extid, right?
So your plan is to make the Nixguix lister send the NAR hash to loaders, and
make VCS, archive, and content loaders check it then write an extid, right?
I was synthesizing in my head the plan to create task(s) about it (under the
milestone extend archive coverage). But yes, something like this.
To be clear (/repeat ;), the nar hash is already sent for Content and
Directory (archive) loaders. It's used for checks (implementation wise, with
the nix binary instead of some python code). It's discarded after that.
Note that, while I foresee what to do for the archive and content loaders
regarding adding the writing extid step (maybe even reading to be faster?),
it's fuzzy for the vcs part for now.
I'll summarize my thoughts on the topic, thanks @civodul and @ardumont for following it up:
I do not think we should compute and store nar hashes for all objects
I think we should store the value of the nar hash that we've received from guix or nix as an extid associated to the archived directory
I think we should validate nar hashes, to the best of our ability, before storing them as extids (and record that information)
I'd favor a reimplementation of nar hashing in our own code to do this validation, instead of having one more non-python dependency in our code. If our own implementation fails in more than 10% of cases, I'd consider scrapping it. (I'm mostly worried about our normalization of file and directory modes). I think it makes sense to have it in swh.loader.core.
I also think that we should store nar hashes that we could not validate as well, as a different class of extids, to ease last resort lookups for nix and guix
So, in my mind, we would have extids of the form (nar-sha256-base64-validated, 0, <nar sha256 base64ed>, directory, <directory id>) and (nar-sha256-base64-unvalidated-<guix/nix/...>, 0, <nar sha256 base64ed>, directory, <directory id>).