It seems there exists difference what's listed in the upstream manifests and what's git 'checkout'ed ('switch'ed). [1]
The code and the clis (including guix's) are in agreement (last 2 cli calls) and diverge from the hash listed in the upstream manifest (what's passed as argument to the loader from the lister listing the upstream manifest).
$ git clone https://github.com/daattali/jquery-colourpicker --filter=tree:0Cloning into 'jquery-colourpicker'......$ cd jquery-colourpicker$ git switch --detach 27c2a266d51e18a9fe6d7542264152b27c7d34e0...HEAD is now at 27c2a26 remove old version text from js$ swh nar -x -f hex -H sha256 /var/tmp/git-tryouts/jquery-colourpicker 2>/dev/null0b473e867ae80f408ccac98448c62ffb3449337c38a3210d11d7c08e6055e851$ guix hash -x -S nar -f hex -H sha256 /var/tmp/git-tryouts/jquery-colourpicker0b473e867ae80f408ccac98448c62ffb3449337c38a3210d11d7c08e6055e851
Ah... I do not know where this hash fa599... could come from. Well, the commit is from 2016 and the repository seems "simple". Therefore, I do not know what could be done differently in order to get fa599... instead of 0b473....
I do not have access to sentry; I have just sent a request. :-)
I do not have access to sentry; I have just sent a request. :-)
i've added you (i think?!)
Ah... I do not know where this hash fa599... could come from. Well, the commit is from 2016 and the repository seems "simple". Therefore, I do not know what could be done differently in order to get fa599... instead of 0b473....
I'd expect this was what the lister compulsed out of the guix manifest (at the time).
Yes, another check up to the manifest [1] shows something is off between the listing and the actual checkout.
Hence the loader not being too happy about it (this issue).
The manifests lists a nar hash <7638c91fe9a49e639f95e5498529022ea0886842e1da08d9644ff174d66e3ed6>
to check but it's not what's currently checkouted <0cf5945a7158e25af4ab40589fda5ada81217ae01675df6248b9a52b88e48ea8> [2].
[1] Manifest extract:
$ curl -k curl -k https://guix.gnu.org/sources.json > /var/tmp/guix-tryout/$(today)-sources.json$ file /var/tmp/guix-tryout/20230822-1534-sources.json/var/tmp/guix-tryout/20230822-1534-sources.json: ASCII text, with very long lines, with no line terminators$ ipythonIn [1]: filepath = '/var/tmp/guix-tryout/20230822-1534-sources.json'In [3]: import jsonIn [4]: with open(filepath, 'r') as f: guix_data = json.loads(f.read())['sources']In [5]: origins_git = [origin for origin in guix_data if origin['type'] == 'git' ]In [8]: loksh_origin = [o for o in origins_git if o['git_url'] == 'https://github.com/dimkr/loksh'][0]In [9]: loksh_originOut[9]:{'type': 'git', 'git_url': 'https://github.com/dimkr/loksh', 'integrity': 'sha256-djjJH+mknmOfleVJhSkCLqCIaELh2gjZZE/xdNZuPtY=', 'outputHashAlgo': 'sha256', 'outputHashMode': 'recursive', 'git_ref': '7.3'}In [10]: loksh_origin['integrity']Out[10]: 'sha256-djjJH+mknmOfleVJhSkCLqCIaELh2gjZZE/xdNZuPtY='In [11]: def decode(integrity: str): ...: """Compute checksum hash out of an integrity field. ...: ...: """ ...: checksum_algo, chksum_b64 = integrity.split("-") ...: checksum = base64.decodebytes(chksum_b64.encode()).hex() ...: return checksum ...:In [13]: import base64In [14]: decode(loksh_origin['integrity'])Out[14]: '7638c91fe9a49e639f95e5498529022ea0886842e1da08d9644ff174d66e3ed6'
[2] Local checkout (from previous comment we already had a local clone so we only switch to the reference 7.3)
$ # In loksh repository$ git switch --detach 7.3Previous HEAD position was 85a4de6 Fix release nameHEAD is now at e88658f Update lolibc$ swh nar -x -f hex -H sha256 /var/tmp/git-tryouts/loksh 2>/dev/null0cf5945a7158e25af4ab40589fda5ada81217ae01675df6248b9a52b88e48ea8$ guix hash -x -S nar -f hex -H sha256 /var/tmp/git-tryouts/loksh0cf5945a7158e25af4ab40589fda5ada81217ae01675df6248b9a52b88e48ea8
I think the issue for loksh comes from the submodule.
The Guix definition reads,
(version "7.3") (source (origin (method git-fetch) (uri (git-reference (url "https://github.com/dimkr/loksh") (commit version) ;; Include the ‘lolibc’ submodule, a static compatibility library ;; created for and currently used only by loksh. (recursive? #t))) (file-name (git-file-name name version)) (sha256 (base32 "1miydvb79wagckchinp189l8i81f08lqajg5jngn77m4x4gwjf3n"))))
@civodul@zimoun just so you know, i had speed issues with the git ingestion of git tree checkout which got overcome by using treeless cloning (and some folder filtering too) [1]
But "now" i'm facing a hash discrepancy issue [2] between what's listed in the guix manifest and what's actually computed on disk (both the guix and the swh tool agree on the nar hash computed so there might be either an issue with the tool that compulse the manifest or in the tool that computes the nar hash).
About <git://git.linux-nfs.org/projects/bfields/nfs4-acl-tools.git>, the hash that is expected is 31c0d1a8a485332da20d42517aa44dafee2e48477aa390f47bdaf7a955eb0953 and I do not know what is the other one. Could you provide some context where this hash is computed?
Context is that the the lister nixguix (deployed in staging) is listing more kind of origins including git tree checkouts (to store their nar in swh alongside their dag if the hash match). But currently, there remains some hash mismatch which lead to this opened issue.
Thx to you, i've got one more insight at least for the submodule part (the loader does nothing regarding submodule, no --recurse-submodule or submodule init step is done currently when cloning).
for the origin git://git.linux-nfs.org... [1], it's an old issue (the one i opened from sentry at the time). I don't have anything more than the following expected hash (i don't see the git reference required in the issue).
When cloning it locally though, i still have the same 31c0... nar hash from the current master head.
Checksum mismatched on <git://git.linux-nfs.org/projects/bfields/nfs4-acl-tools.git>: {'sha256': 'd665d94efccddaefa75b09416fa0488f8700211a350bca652c3d42365507cd21'} != {'sha256': '31c0d1a8a485332da20d42517aa44dafee2e48477aa390f47bdaf7a955eb0953'}
Could you provide some context where this hash is computed?
Git directory loader class [1] is in charge of loading such repository.
Which inherits the behavior from the base class BaseDirectoryLoader [2].
Which also inherits the hash computation behavior from its base class NodeLoader [3].
The unfortunate inheritance design stems from the initial core loader which we cannot overcome easily.
So it's still class inheriting from class loop hell to stay DRY (but not KISS ¯\(ツ)/¯).
And assuming that .serialize()is somehow validated by cross-comparison with the both CLI swh nar and guix hash, then it would mean that the hash mismatch would come from .fetch_artifact(). The method is maybe not doing the same as git clone ... && git swicth --detach. Well, do you have access to repo_path from:
? I mean, is it possible to run something like swh nar -x -f hex -H sha256 <repo_path> where <repo_path> is the actual path on disk? If it is possible, it would help to reduce the scope of the potential issue, no?
If it is possible, it would help to reduce the scope of the potential issue, no?
(talking about origins without any git submodule...)
For me, 'swh nar' and the actual loader renders the same hash (out of the local
checkout) but it does not correspond to what's fed to the loader hence the issue (again
except for the origins you spotted have submodules).
And assuming that .serialize() is somehow validated by cross-comparison with the both
CLI swh nar and guix hash, then
It's actually doing the computation exactly like the 'swh nar' does (what you initially
provided me with a while back). Then the hexdigest call is simply rendering the hash
results into a dict of {hash_algo: hash_result_in_hex_format, ...} with hash in
{'sha256', ...}. That's how hash is computed in python world (so we kept the Nar object
compliant to this scheme).
it would mean that the hash mismatch would come from
.fetch_artifact().
Fetch artifact (as per your 2nd quote) is just cloning the repository (see [1] for the
actual implementation) and provide the repository's local checkout path for hashing (and
when it needs to, the hash is actually a nar one, that depends on the lister's output
which is the loader's input ;).
The method is maybe not doing the same as git clone ... && git swicth --detach.
For the submodule you identified, yes, it's missing --recurse-submodule (or whatever
the git flag to do that is [4]). But for the rest, it's ok.
Note that no all origins are failing. You only see in sentry the ones failing.
Well, do you have access to repo_path from:
Yes, that's actually the artifact_path you read from [5] (the loop is described in the
[3] link in my last comment ;).
Rereading, everything again.
I guess, you'd like to be able to add something like...
(talking about origins without any git submodule...)
For me, 'swh nar' and the actual loader renders the same hash (out of the local checkout) but it does not correspond to what's fed to the loader hence the issue
What does it mean "it does not correspond to what's fed to the loader hence the issue"? I mean, the file sources.json contains hashes that are the same as the local checkout. Well, from my understanding.
IIUC, one mismatch is:
ValueError: Checksum mismatched on <git://git.linux-nfs.org/projects/bfields/nfs4-acl-tools.git>: {'sha256': 'd665d94efccddaefa75b09416fa0488f8700211a350bca652c3d42365507cd21'} != {'sha256': '31c0d1a8a485332da20d42517aa44dafee2e48477aa390f47bdaf7a955eb0953'}```
Therefore, it appears to me that the error is the computation of d665d94efccddaefa75b09416fa0488f8700211a350bca652c3d42365507cd21 by the loader. Do I miss something?
Somehow, assuming that the code .serialize() is correct, the mismatch would come from that the loader is hashing a different content.
Therefore, yes it would appear to me worth to be able to double-check that the content living at repo_path is the one that is expected. Well, yes run some Guix command (or swh nar) on repo_path as the re-reading code appear to me helpful for debugging the mismatch. Even, if it is possible, instead of running guix hash, I would run kind of copy(repo_path, "/var/tmp/" + repo_path) or whatever other location. Then, having access to /var/tmp/xxxx where xxxx is the repo_path of this nfs4-acl-tools problematic repo, this access would help, IMHO. For instance run swh nar /var/tmp/xxxx and compare the hash. Then git clone and run some diff between the two repositories. Etc.
I mean, the file sources.json contains hashes that are the same as the local checkout. Well, from my understanding.
yes, i was having some doubt (because of my guix hash misuse).
Therefore, it appears to me that the error is the computation of d665d94efccddaefa75b09416fa0488f8700211a350bca652c3d42365507cd21 by the loader. Do I miss something?
yes.
Somehow, assuming that the code .serialize() is correct, the mismatch would come from that the loader is hashing a different content.
hashing with both command guix (correctly!) and swh nar should make sure whether swh nar is correct or not.
Guix command (or swh nar)
I'd say both (swh nar should have the shortcomings the loader has if any).
Yeah all the hashing parameters can be confusing. :-)
The recipe of Guix packages stores -S nar -H sha256 -f nix-base32 which is a modified Base-32 format. Well, thus, this is the default of the Guix world.
The file sources.json lists -S nar -H sha256 -f base64 which is just the format conversion of the latter.
The loader seems using -S nar -H sha256 -f hex.
Well, since the file sources.json is generated specifically for SWH and since the format is "versioned", it would be possible to change the hash format from base64 to hex and update the format version.
Let me know how I can help for fixing this checksum mismatch.
I used the following script to execute the git directory loading on these origin URLs by getting the loader parameters from the guix_sources.json and dumping loading errors to a file (requires swh-loader-core!49 (closed):
importbase64importjsonfromcollectionsimportdefaultdictfromsubprocessimportrundefget_nar_hex_checksum(nar_hash):algo,b64_checksum=nar_hash.split("-")returnjson.dumps({algo:base64.b64decode(b64_checksum).hex()})withopen("guix_sources.json","r")asguix_sources_file:guix_sources=json.load(guix_sources_file)git_sources=defaultdict(list)forguix_sourceinguix_sources["sources"]:ifguix_source["type"]!="git"orguix_source["outputHashMode"]!="recursive":continuegit_sources[guix_source["git_url"]].append({"ref":f"'{guix_source['git_ref']}'","checksum_layout":"nar","checksums":get_nar_hex_checksum(guix_source["integrity"]),})fororigin_urlinopen("origin-urls-nar-mismatch","r"):origin_url=origin_url.rstrip("\n")iforigin_urlnotingit_sources:continueforparamsingit_sources[origin_url]:swh_command=["swh","-l","DEBUG","loader","run","git-checkout",origin_url,]+[f"{k}={v}"fork,vinparams.items()]completed_process=run(swh_command,capture_output=True)ifcompleted_process.returncode!=0:# to ensure command can be executed in terminal after copy pasteprint("".join(swh_command).replace("{","'{").replace("}","}'"))print(completed_process.stderr.decode())
After analyzing the loading errors, almost all of them were related to submodules handling and I noticed two issues:
The NAR hash computation code implemented in swh-loader-core was not excluding VCS directories or special paths in a recursive way. This could result in different directory hash computation as git creates a .git file in each submodule folder and it should not be taken into account when computing the hash (as guix does). The fix for this can be found in swh-loader-core!493 (merged).
The loader was erroneously always fetching submodules but it turned out guix needs them only for a couple of packages so this was resulting in quite a lot of hash mismatches in the loader. I refined the way submodules are handled in !164 (merged).
After applying both fixes, almost all origin URLs from the list I extracted could be successfully loaded. Nevertheless, it remains some cases where there is still hashes mismatch, I will give some examples in another comment.
Below you can find some remaining origins with hashes mismatch.
First let's take an example where we can successfully recompute the NAR hash computed by guix: the python-send2trash package.
It is defined as follow in guix package specification:
(define-publicpython-send2trash(package(name"python-send2trash")(version"1.8.0")(source(origin(methodgit-fetch);; Source tarball on PyPI doesn't include tests.(uri(git-reference(url"https://github.com/arsenetar/send2trash")(commitversion)))(file-name(git-file-namenameversion))(sha256(base32"1k7dfypaaq4f36fbciaasv72j6wgjihw8d88axmz9c329bz8v5qx"))))(build-systempython-build-system)(arguments'(#:phases(modify-phases%standard-phases(add-before'check'pre-check(lambda_(setenv"HOME""/tmp")))(replace'check(lambda*(#:keytests?#:allow-other-keys)(whentests?(invoke"pytest""-vv")))))))(native-inputs(listpython-pytest))(home-page"https://github.com/arsenetar/send2trash")(synopsis"Send files to the user's @file{~/Trash} directory")(description"This package provides a Python library to send files to theuser's @file{~/Trash} directory.")(licenselicense:bsd-3)))
and its related entry in the guix_sources.json file is:
We can see the computed hashes are the same as in package specification and JSON data.
However for some packages, the computed hash does not match the one provided by guix. I paste below the JSON metadata for those I found while working on that issue:
My guess is there is a side effect, likely due to my environment, that prevents to compute the same hash as guix. I dug into guix code to check if some kind of special processing was done when fetching the code from git but did not find anything relevant. @zimoun maybe this could ring a bell to you ?
I read Guix commit 404df667e3b06f1e9a416c956e53c03ca3642140 from Jun 28, 2022. Then, if I read correctly, the tag v2.5.0-alpha is rewritten upstream on Jul 9, 2022.
About groovy, my guess is about .gitattributes and the normalization of CRLF. IIUC, Guix is applying this file before hashing and SWH do not. I do not know, we discussed a discrepency from that normalization in swh-devel on March.
About FastQC, hum I do not know. I need to investigate... it is weird.
Currently Guix stores in its source the hash 00y9drm0bkpxw8xfl8ysss18jmnhj8blgqgr6fpa58rkpfcbg8qk and this has not changed since February (Guix revision b6a4fbb488f1f539ae45ed7924c9af8905fa0d8b).
About pyxel, it seems a change of tag (mistake on Guix side or upstream retag). Arf, when people will decide to switch to intrinsic identifier for good. :-)
HEAD is now at be75b72 Merge branch 'develop'r:sha256 hash mismatch for /gnu/store/m79d72fh3k4vypcqrsikrwikrscwqd6m-python-pyxel-1.4.3-checkout: expected hash: 0bwsgb5yq5s479cnf046v379zsn5ybp5195kbfvzr9l11qbaicm9 actual hash: 03ch79cmh9fxvq6c2f3zc2snzczhqi2n01f254lsigckc7d5wz08hash mismatch for store item '/gnu/store/m79d72fh3k4vypcqrsikrwikrscwqd6m-python-pyxel-1.4.3-checkout'
And there is a discrepancy between command-line option -x in guix hash and what the fixed-output derivation computes. IIUC, only the top-level .git, .svn, .bzr, etc. should be excluded for matching the fixed-output derivation checksum hash. Well, let confirm on Guix side. :-)
There is still a discrepancy between the nar hash currently computed by guix hash but this should also be fixed on the guix side if I understood the dicussion in https://issues.guix.gnu.org/65979#6 correctly ?
It reminds me a question and I have lost the answer. :-) Some packages use what Guix calls "recursive? #true" as origin, which concretely means Git submodules. And all the submodules are part of the hash.
I am raising the case just to be sure we have discussed it. :-)
There is still a discrepancy between the nar hash currently computed by guix hash
Yes but guix hash is an helper and not used at all for producing sources.json. Therefore, it does not matter for the loader. Fixing it is a bonus. :-) Thanks for pointing the bug.
I added a workaround in f9c18e78 to handle the submodules case.
Nevertheless, if it is possible from your side to add a new boolean field in the JSON file (submodules ?) indicating that submodules should be fetched to compute the hash, this will simplify the processing from our side.
I added a workaround in f9c18e78 to handle the submodules case.
Thanks! Aah, now I remember. :-) Thank you for fixing this case I have initially missed.
Nevertheless, if it is possible from your side to add a new boolean field in the JSON file (submodules ?) indicating that submodules should be fetched to compute the hash, this will simplify the processing from our side.
Yes, for sure. Tracked by which keyword? submodule ? Like that,