Consider archiving NAR hashes

added Nixguix loader SVN Loader Tarball loader priority:Normal labels

yes, nix too. As far as i remember, it's NAR for Nix ARchive ;)

Hi,

Thanks for raising the questions.

One of the issue is to lookup (key)----considering the information we have at package time. Today, the only intrinsic information that Guix always keeps is the integrity (computed by NAR serializer and hashed by SHA-256). For tarballs (url-fetch), Disarchive somehow provides a dictionnary from this Guix integrity to SWH.

For Git (git-fetch), it is less clear. I would like to store the Git commit hash at package time in Guix. But there is no consensus inside the community. Well, in this case, Guix fetches back from SWH using the URL + Git tag.

For other VCSs, nothing is implemented. For instance, IIUC about Subversion [1], the way to fallback would be to use snapshots which are not implemented yet.

Well, you will tell me to store swhid at package time. :-) Maybe, but it means that 20k+ packages should be updated. Perhaps, instead, a map from various integrity computations to swhid (something similar to Disarchive, somehow), could help; especially for other VCSs as Subversion, CVS, Bzr, etc. At least, NAR would help, sure! :-)

1: https://issues.guix.gnu.org/issue/43442#11

Related to $1461.

Guix provides guix hash where the option named --recursive is confusing (at least confused me :-)). Considering a random Git repository, say this one, then,

$ git clone https://gitlab.inria.fr/guix-hpc/guix-modules
$ git -C  guix-modules checkout v0.1.0
$ guix hash -S nar -f hex -H sha256 guix-modules
75b5d7448bffb10ecf367e424c0941e059bab1aea0036c5c45e6b337db0e097

and the naive Python equivalent -- closely following Nix ARchive format described in here p.100 and following (or p.92 depending how you are counting ;-) well around section 5.2.2) -- looks like:

import os
import stat
import hashlib
from pathlib import Path

def str_(thing):
    # name 'str' in Figure 5.2 p.93 (page 101 of pdf)
    if isinstance(thing, str):
        byte_sequence = thing.encode()
    else:
        byte_sequence = thing
    l = len(byte_sequence)
    blen = l.to_bytes(8, byteorder='little') # 64-bit little endian
    m = l % 8
    if m == 0:
        offset = 0
    else:
        offset = 8 - m
    return blen + byte_sequence + bytearray(offset)

def serialise_prime_prime(fso):
    mode = os.lstat(fso).st_mode

    if stat.S_ISREG(mode):
        bstr = str_("type") + str_("regular")
        contents = b''
        with open(str(fso), "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                contents += chunk
        if os.access(fso, os.X_OK):
            bstr += str_("executable") + str_("")
        return bstr + str_("contents") + str_(contents)

    elif stat.S_ISLNK(mode):
        bstr = str_("type") + str_("symlink")
        target = os.readlink(fso).encode()
        return bstr + str_("target") + str_(target)

    elif stat.S_ISDIR(mode):
        bstr = str_("type") + str_("directory")
        for path in sorted(Path(fso).iterdir()):
            bstr += serialiseEntry(path.name.encode(), path)
        return bstr

    else:
        raise ValueError("unsupported file type")

def serialise_prime(fso):
    return str_("(") + serialise_prime_prime(fso) + str_(")")

def serialiseEntry(name, fso):
    return str_("entry") + str_("(")         \
        + str_("name") + str_(name)           \
        + str_("node") + serialise_prime(fso) \
        + str_(")")

def serialise(fso):
    return str_("nix-archive-1") + serialise_prime(fso)

# Entry point
def nar_hash(thing):
    h = hashlib.sha256()
    s = serialise(thing)
    h.update(s)
    return h.hexdigest()

Python 3.9.9 (main, Jan  1 1970, 00:00:01) 
Type 'copyright', 'credits' or 'license' for more information
IPython 8.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %run narhasher.py

In [2]: nar_hash("guix-modules")
Out[2]: '75b5d7448bffb10ecf367e424c0941e059bab1aea0036c5c45e6b337db0e097d'

Guix does not encode using hex but nix-base32 but it does not really matter here.

Hope that helps.

To be complete, the integrity checksum that Guix provides (here one example about the package guix-modules),

$ git clone https://gitlab.inria.fr/guix-hpc/guix-modules
$ git -C  guix-modules checkout v0.1.0
$ guix hash -S nar -f nix-base32 -H sha256 -x guix-modules
1ckvrrmkgzz93i35sj1372wxs7ln4gzszpri1pcdf473z0p7nh7w

which is the same as in the link above . The .git directory is excluded in the serialization.

However, note that Guix does not consider NAR serializer as integrity when it is about tarball. For example, consider the tarball of the package hello, then the checksum is 086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd

$ guix hash -S none -f nix-base32 -H sha256 hello-2.12.1.tar.gz
086vqwk2wl8zfs47sq2xpjc9k066ilmb8z6dn0q6ymwjzlm196cd

Therefore, if instead nix-base32 encoding, we pick hex, it reads,

$ guix hash -S none -f hex -H sha256 hello-2.12.1.tar.gz
8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20
$ sha256sum hello-2.12.1.tar.gz
8d99142afd92576f30b0cd7cb42a8dc6809998bc5d607d88761f512e26c7db20  hello-2.12.1.tar.gz

This tree

$ tree foo
foo
├── bar
│   └── exe
└── baz

1 directory, 2 files

serializes as


nix-archive-1
  (
  type
  directory
    entry
    (
    name
    bar
    node
      (
      type
      directory
        entry
        (
        name
        exe
        node
          (
          type
          regular
          executable
          
          contents
          <_io.BufferedReader name='foo/bar/exe'>
          )
        )
      )
    )
    entry
    (
    name
    baz
    node
      (
      type
      regular
      contents
      <_io.BufferedReader name='foo/baz'>
      )
    )
  )

where the files baz and exe are empty and chmod u+x exe. The sha1 in hex format is a81bdeb702409399f83ba91cf95cd6bcab800ad7. Note the indentation is only added for readibility.

Now, let consider a large directory, with many Git repository including Guix or Emacs ones.

$ find ~/src/ -type f -print | wc -l
91419

$ find ~/src/ -type d -print | wc -l
5218

$ du -sh ~/src/
11G	/home/simon/src/

and compare guix hash and some Python implementation,

$ time guix hash -S nar -H sha256 -f base64 ~/src/
DkayOdU4GtQcVjpkkyuFro4uHovIkx7GXVNuNrTZeLg=

real	0m27,476s
user	0m36,002s
sys	0m2,279s

$ time python3.9 narserializer.py ~/src | sha256sum | (xxd -r -p | base64)
DkayOdU4GtQcVjpkkyuFro4uHovIkx7GXVNuNrTZeLg=

real	0m45,190s
user	0m49,130s
sys	0m5,906s

Only the serialization part takes,

$ time python3.9 narserializer.py ~/src > /dev/null

real	0m5,328s
user	0m3,523s
sys	0m1,797s

Since we are interested in the final hash, let compute hash when walking the tree,

$ time python3.9 narserializer.py ~/src sha256 base64
DkayOdU4GtQcVjpkkyuFro4uHovIkx7GXVNuNrTZeLg=

real	0m25,883s
user	0m24,293s
sys	0m1,584s

Profiling this Python implementation, it reads,

      9/1    0.000    0.000   28.874   28.874 {built-in method builtins.exec}
        1    0.000    0.000   28.874   28.874 narserializer.py:1(<module>)
        1    0.000    0.000   28.869   28.869 narserializer.py:92(serialize)
  97997/1    0.463    0.000   28.869   28.869 narserializer.py:56(_serialize)
 97996/10    0.182    0.000   28.869    2.887 narserializer.py:85(_serializeEntry)
1366550/675860    2.008    0.000   25.675    0.000 narserializer.py:13(str_)
  3663342   21.344    0.000   21.344    0.000 {method 'update' of '_hashlib.HASH' objects}
   339543    0.104    0.000    1.542    0.000 narserializer.py:43(<lambda>)
   339551    1.438    0.000    1.438    0.000 {method 'read' of '_io.BufferedReader' objects}
     5218    0.098    0.000    0.789    0.000 {built-in method builtins.sorted}
    97997    0.302    0.000    0.626    0.000 {built-in method posix.lstat}
    91419    0.524    0.000    0.524    0.000 {built-in method io.open}
   688892    0.287    0.000    0.465    0.000 pathlib.py:796(__lt__)

where most of the time is spent for hashing.

import os, stat, hashlib, io
from pathlib import Path

import sys, base64

class Nar:
    def __init__(self, updater, isdebug=False):
        self._update = updater

        self.__isdebug = isdebug
        self.__indent = 0

    def str_(self, thing):
    # named 'str' in Figure 5.2 p.93 (page 101 of pdf)

        if self.__isdebug \
           and (isinstance(thing, str) or isinstance(thing, io.BufferedReader)):
            indent = "".join(["  " for _ in range(self.__indent)])
            print(indent + str(thing))

        # named 'int'
        if isinstance(thing, str):
            byte_sequence = thing.encode('utf-8')
            l = len(byte_sequence)
        elif isinstance(thing, io.BufferedReader):
            l = os.stat(thing.name).st_size

        # ease reading of _serialize
        elif isinstance(thing, list):
            for stuff in thing:
                self.str_(stuff)
            return
        else:
            raise ValueError("not string nor file")

        blen = l.to_bytes(8, byteorder='little') # 64-bit little endian
        self._update(blen)

        # first part of 'pad'
        if isinstance(thing, str):
            self._update(byte_sequence)
        elif isinstance(thing, io.BufferedReader):
            for chunk in iter(lambda: thing.read(2*2*2*2*4096), b""):
                self._update(chunk)

        # second part of 'pad
        m = l % 8
        if m == 0:
            offset = 0
        else:
            offset = 8 - m
        boffset = bytearray(offset)
        self._update(boffset)


    def _serialize(self, fso):
        if self.__isdebug: self.__indent += 1
        self.str_("(")

        mode = os.lstat(fso).st_mode

        if stat.S_ISREG(mode):
            self.str_(["type", "regular"])
            if os.access(fso, os.X_OK):
                self.str_(["executable", ""])
            self.str_("contents")
            with open(str(fso), "rb") as f:
                self.str_(f)

        elif stat.S_ISLNK(mode):
            self.str_(["type", "symlink", "target"])
            self.str_(os.readlink(fso))

        elif stat.S_ISDIR(mode):
            self.str_(["type", "directory"])
            for path in sorted(Path(fso).iterdir()):
                self._serializeEntry(path)

        else:
            raise ValueError("unsupported file type")

        self.str_(")")
        if self.__isdebug: self.__indent += -1

    def _serializeEntry(self, fso):
        if self.__isdebug: self.__indent += 1
        self.str_(["entry", "(", "name", fso.name, "node"])
        self._serialize(fso)
        self.str_(")")
        if self.__isdebug: self.__indent += -1

    def serialize(self, fso):
        self.str_("nix-archive-1")
        self._serialize(fso)
        return


if __name__ == "__main__":
    directory = sys.argv[1]
    try:
        if sys.argv[2] == "sha1":
            h = hashlib.sha1()
        else:
            h = hashlib.sha256()
        updater = h.update
        try:
            if sys.argv[3] == "hex":
                convert = lambda hsh: hsh
            elif sys.argv[3] == "base64": # hex -> base64
                convert = lambda hsh: base64.b64encode(bytes.fromhex(hsh)).decode()
            elif sys.argv[3] == "BASE32":
                convert = lambda hsh: base64.b32encode(bytes.fromhex(hsh)).decode()
            elif sys.argv[3] == "base32":
                convert = lambda hsh: base64.b32encode(bytes.fromhex(hsh)).decode().lower()
            else:
                convert = lambda hsh: hsh
        except:
            convert = lambda hsh: hsh
        isPrintable = True
    except:
        updater = sys.stdout.buffer.write
        isPrintable = False

    d = os.environ.get('DEBUG')
    if d and d != "no":
        debug = True
    else:
        debug = False
    nar = Nar(updater, debug)
    nar.serialize(directory)
    if isPrintable:
        print(convert(h.hexdigest()))

mentioned in commit swh/devel/snippets@58d139c7

mentioned in commit swh/devel/snippets@d50f5183

mentioned in commit swh/devel/snippets@6724cecc

mentioned in commit swh/devel/snippets@9c5d75a6

mentioned in commit swh/devel/snippets@f3979ba2

mentioned in commit swh/devel/snippets@36e3d021

fwiw, I've iterated a bit over @zimoun's code and pushed it into the snippets repository (see commits above and their commit description message). It's also able to deal with git, hg and svn trees (ignoring their respective top metadata folder .git, .svn, ... without impacting the performance).

The algorithm seems simple enough.

We may want to adapt, industrialize and push this within swh-model to expose functions so loaders could compute and store that in the extid part (as per the description need above). That could also avoid depending on nix (or guix) binaries for the deployment of the new {Content|Directory}Loader [1] [2].

The only caveat I see in the current implementation is that the algorithm is doing its own recursion. So if loaders would want to compute it, they'd walk again the arborescence tree a second time (one round for hashing our intrinsic identifiers, another to compute this one).

This may not be too much of a performance hit for the new {Content|Directory}Loader [1]

[2] but that may be a blocker for vcs repository loaders though... (some can take their time depending on the origin to ingest).

Another way would be to integrate within one of the swh.model modules that deals with merkle/fs structures. I don't think plugging this directly inside swh.model.hashutil.MultiHash is possible. But maybe inside one of the modules swh.model.merkle or swh.model.from_disk? But would that help?

@vlorentz @olasd What do you think?

Hi,

Oh awesome! Thanks for pushing forward. :-)

The only caveat I see in the current implementation is that the algorithm is doing its own recursion. So if loaders would want to compute it, they'd walk again the arborescence tree a second time (one round for hashing our intrinsic identifiers, another to compute this one).

Where your intrinsic identifiers are they computed? Because one solution is to walk the arborescence and feed various serializers (swhid, git, nar, etc.). Just to note that maybe in the future you would like support other intrinsic serializers. Well, I do not know, maybe it is a premature optimization. :-)

Oh awesome! Thanks for pushing forward. :-)

;)

The only caveat I see in the current implementation is that the algorithm is doing its own recursion. So if loaders would want to compute it, they'd walk again the arborescence tree a second time (one round for hashing our intrinsic identifiers, another to compute this one).

Where your intrinsic identifiers are they computed?

A loader is really a mapping "function" of reading the origin (vcs, tarball, package, ...) to swh.model.model Content, Directory, Revision, Release and Snapshot model objects

[1]. Those integrates intrinsic id computation method. See for example, git [2], hg [3]. For package loader, we directly walk the filesystem through swh.model.from_disk.Directory.from_disk method [4], hence my previous question quoted below.

Another way would be to integrate within one of the swh.model modules that deals with merkle/fs structures. I don't think plugging this directly inside swh.model.hashutil.MultiHash is possible. But maybe inside one of the modules swh.model.merkle or swh.model.from_disk?

Because one solution is to walk the arborescence and feed various serializers (swhid, git, nar, etc.).

Indeed, that's already the case in our MultiHash module [5] (which our model objects in-fine depends upon). But so far, the hashes in question were "flat" hash algorithm (without any extra mapping as the NAR defines).

[5] https://forge.softwareheritage.org/source/swh-model/browse/master/swh/model/hashutil.py$6-52

Just to note that maybe in the future you would like support other intrinsic serializers.

While we may be willing to do the extra effort for the NAR format to ease our integration with guix/nix, I don't think that's the way forward for us though. We'd like various package managers to start using standardized intrinsic identifiers hence our effort on standardizing the SWHID format (which everybody can compute without the swh stack).

Well, I do not know, maybe it is a premature optimization. :-)

possibly ;)

Hi there! :-)

While we may be willing to do the extra effort for the NAR format to ease our integration with guix/nix, I don't think that's the way forward for us though. We'd like various package managers to start using standardized intrinsic identifiers hence our effort on standardizing the SWHID format (which everybody can compute without the swh stack).

Are there package managers besides Nix and Guix that use file tree hashes as hard integrity checks? (Honest question; my guess is "no".)

Right now, Guix is doing great for tarballs (via Disarchive, which stores a swhid for the tarball content) and for Git (since we can directly look up Git commits by ID). For everything else (Mercurial, Subversion, etc.), there's no practical solution.

As an example, looking up Subversion revisions on SWH by URL + revision number is impractical at best and impossible in many cases.

I understand and subscribe to the idea of standardizing around SWHID. However, until we get there, supporting nar hashes in SWH would make a big difference; it would instantly lead Guix to 99% archival coverage, which would justify having a party. :-)

hello @civodul

Are there package managers besides Nix and Guix that use file tree hashes as hard integrity checks? (Honest question; my guess is "no".)

I don't think so. I don't recall having seen such checks.

@vlorentz @anlambert since you did your fair amount of reviews and/or implementations on lister/package loaders, any opinion ^ ?

However, until we get there, supporting nar hashes in SWH would make a big difference; it would instantly lead Guix to 99% archival coverage, which would justify having a party. :-)

Point taken (beyond the party ;p)

I realized that i need to take a step back as I've entangled 2 things in this discussion:

first the implementation. What @zimoun started is a basic yet exhaustive implementation in python to avoid depending on runtime dependencies. I've massaged it a bit a while back and pushed it inside the snippet repository. It's currently untested though so that'd need some more tests. The current Content and Directory loaders "used" by the new nixguix lister are depending on a nix binary to actually check integrity fields which are nar hashes. So they do not use that "native" implementation yet. So there is a shot at simplifying deployment if we wanted to avoid that runtime dependency (to the expanse of maintaining it).
whether we want to support Nar hash when we encounter it. Which means possibly lifting the ExtID mapping (as we did for the mercurial loader among other things [1]). I propose ExtID(extid_type=nar, extid_version=0, target_type={content|directory}). Would that be enough? @vlorentz, thoughts?

For the implementation part 1., @vlorentz, @olasd @douardda if we were to support NAR hashes "natively" (with our own implementation), where would that code go? A starting implementation (unused in swh yet, missing tests) is currently living inside the snippets repository. Could we move it inside a module in swh.model under third-party hash module or some such? Or simpler, keep that code under where it's currently needed as it's not transverse yet (in the swh.loader.core.{Content|Directory}Loader).

For 2, the core of the issue, we need to push forward the discussion, conclude and design something? I've proposed something.

[1] Current datapoint on extid use:

2023-03-20 18:23:47 softwareheritage@belvedere:5432 λ select extid_type, extid_version, target_type, count(*) from extid group by extid_type, extid_version, target_type;
+-------------------------+---------------+-------------+----------+
|       extid_type        | extid_version | target_type |  count   |
+-------------------------+---------------+-------------+----------+
| bzr-nodeid              |             1 | revision    | 41464611 |
| cran-sha256             |             0 | revision    |    37313 |
| cran-sha256             |             0 | release     |    19104 |
| dsc-sha256              |             0 | revision    |    98346 |
| dsc-sha256              |             0 | release     |    40972 |
| dsc-sha256              |             1 | release     |   496369 |
| hg-nodeid               |             0 | revision    | 19587036 |
| hg-nodeid               |             1 | revision    | 20722241 |
| maven-jar               |             0 | release     |   767834 |
| npm-archive-sha1        |             0 | revision    | 21034602 |
| npm-archive-sha1        |             0 | release     |  6689082 |
| npm-manifest-sha256     |             0 | release     |  1005687 |
| package-manifest-sha256 |             0 | revision    |     8398 |
| package-manifest-sha256 |             0 | release     |       14 |
| pypi-archive-sha256     |             0 | revision    |  2789819 |
| pypi-archive-sha256     |             0 | release     |  2513310 |
| subresource-integrity   |             0 | revision    |    78742 |
| subresource-integrity   |             0 | release     |    58044 |
+-------------------------+---------------+-------------+----------+
(18 rows)

Time: 70383.942 ms (01:10.384)

Are there package managers besides Nix and Guix that use file tree hashes as hard integrity checks? (Honest question; my guess is "no".)

I don't think so. I don't recall having seen such checks.

@vlorentz @anlambert since you did your fair amount of reviews and/or implementations on lister/package loaders, any opinion ^ ?

If you squint hard enough, tarball hashes are hashes of a file tree. But otherwise, no.

whether we want to support Nar hash when we encounter it.

What do you mean by "encounter"?

Which means possibly lifting the ExtID mapping (as we did for the mercurial loader among other things [1]). I propose ExtID(extid_type=nar, extid_version=0, target_type={content|directory}). Would that be enough? @vlorentz, thoughts?

I think so, yes

For the implementation part 1., @vlorentz, @olasd @douardda if we were to support NAR hashes "natively" (with our own implementation), where would that code go? A starting implementation (unused in swh yet, missing tests) is currently living inside the snippets repository. Could we move it inside a module in swh.model under third-party hash module or some such? Or simpler, keep that code under where it's currently needed as it's not transverse yet (in the swh.loader.core.{Content|Directory}Loader).

In the core loader. What else would use it?

whether we want to support Nar hash when we encounter it. What do you mean by "encounter"?

Currently, in the various listing of nixguix, it's not always present hence "encounter".

Which means possibly lifting the ExtID mapping (as we did for the mercurial loader among other things [1]). I propose ExtID(extid_type=nar, extid_version=0, target_type={content|directory}). Would that be enough? @vlorentz, thoughts? I think so, yes

Great, thanks.

In the core loader. Where* else would use it?

Nowhere that I could think of.

I could have missed something so I prefer having a second or more opinion on the matter.

So swh.core.loader it is! And We'll still be eventually able to move it if need be. In the mean time, it's yagni indeed.

@civodul @zimoun, it looks like a plan is getting in shape ^

Yeah I think that's fine.

So your plan is to make the Nixguix lister send the NAR hash to loaders, and make VCS, archive, and content loaders check it then write an extid, right?

So your plan is to make the Nixguix lister send the NAR hash to loaders, and make VCS, archive, and content loaders check it then write an extid, right?

I was synthesizing in my head the plan to create task(s) about it (under the milestone extend archive coverage). But yes, something like this.

To be clear (/repeat ;), the nar hash is already sent for Content and Directory (archive) loaders. It's used for checks (implementation wise, with the nix binary instead of some python code). It's discarded after that.

Note that, while I foresee what to do for the archive and content loaders regarding adding the writing extid step (maybe even reading to be faster?), it's fuzzy for the vcs part for now.

I'll summarize my thoughts on the topic, thanks @civodul and @ardumont for following it up:

I do not think we should compute and store nar hashes for all objects
I think we should store the value of the nar hash that we've received from guix or nix as an extid associated to the archived directory
I think we should validate nar hashes, to the best of our ability, before storing them as extids (and record that information)
I'd favor a reimplementation of nar hashing in our own code to do this validation, instead of having one more non-python dependency in our code. If our own implementation fails in more than 10% of cases, I'd consider scrapping it. (I'm mostly worried about our normalization of file and directory modes). I think it makes sense to have it in swh.loader.core.
I also think that we should store nar hashes that we could not validate as well, as a different class of extids, to ease last resort lookups for nix and guix

So, in my mind, we would have extids of the form (nar-sha256-base64-validated, 0, <nar sha256 base64ed>, directory, <directory id>) and (nar-sha256-base64-unvalidated-<guix/nix/...>, 0, <nar sha256 base64ed>, directory, <directory id>).

Would that make sense?

(nar-sha256-base64-unvalidated-<guix/nix/...>, 0, <nar sha256 base64ed>, directory, <directory id>)

I don't know how I feel about having an extid type for each "forge" instance...

And why base64 instead of the raw bytes? iirc. nix and guix have their own base32 variant anyway

I don't know how I feel about having an extid type for each "forge" instance...

Yeah, I don't know either. I would like us to make sure that we source the nar claim.

And why base64 instead of the raw bytes? iirc. nix and guix have their own base32 variant anyway

Yeah, I didn't remember if extid storage was raw bytes, so yeah, we should store that.

This makes total sense to me, thanks for summarizing!

changed milestone to %Extend archive coverage [Roadmap - Collect]

Consider archiving NAR hashes

Child items ...

Activity