Hello!As discussed on IRC, Timothy Sample extracted a list of unarchivedBioconductor sources that we'd like SWH to ingest via the `NixGuix` loader(that's a file in the usual JSON format) [1]Could you schedule archival and let us know how it goes?(Timothy may be able to extract a larger list of historical source URLs fromthe PoG [2] database eventually.)Thanks! :-)
@samplet I have some scripts for generating some sources.json. However, it appears to me easier if you could generate using the PoG database the sources.json containing all Bioconductor origins. Let me know.
@ardumont, I'm not sure I understand "remote JSON manifest". Do you mean one that has the original URLs as well as the content-addressed ones? Or do you mean a JSON file hosted on the Web? I'm happy to help, just confused!
If possible, please load this sources file. It contains all of the non-VCS Guix sources that I know of that are missing from SWH. It also has the original URLs. There are some false positives (sources I think are missing that are actually there), but I know the loader will ignore those. There are around 6K sources, including those from Bioconductor. I realize this is slightly beyond the scope of this issue, but we discussed loading these elsewhere, and I figured this would be the easiest way to solve both problems.
If not, let me know and I will prepare a Bioconductor version.
@ardumont, I'm not sure I understand "remote JSON manifest". ...Or do you mean a JSON file hosted on the Web?
^ json file hosted on the web (which i don't have to compute myself too ;)
If possible, please load this sources file.
yes, on my todo list for the week (i'll try to do it soon). On monday morning, it's meeting time so asap.
As mentioned earlier, i'll do it on staging first so you can have a look too (prior to actually trigger in production).
An issue has been identified and a fix is ongoing (both staging & prod) [1]
I'll trigger the fill-in-the-hole in production once the fix is landed and deployed (probably next week) [and fully ingested on staging first].
The fix has been deployed.
Listed origins which failed the ingestion (most occurrences being because of the issue I mentioned) have been rescheduled on staging btw.
I pointed my script at the staging instance, and it checked 5923 missing SWHIDs and found 2255 of them. This is an amazing improvement (), but it's not quite what your numbers suggest.
Some of the still "missing" sources are in the staging archive, just not in the way they usually are. Usually a tarball is extracted into a directory (as if running tar -C DIR ...) and then that directory is stored as a node in the SWH graph. The few "missing" tarballs I investigated have the single subdirectory of the tarball in the graph, but not the parent directory. Take jq-1.7 for example:
It's weird because some (about half guessing by the numbers) work the usual way. An example is Lilypond 2.19.80, which has a parent directory node. Here are the origin pages: jq-1.7; lilypond-2.19.80. The snapshots for both point to the subdirectories. Personally, I like the consistency of having the top-level directory even if everyone is too polite to pack multiple directories into a tarball. Whatever the design goal is, the fact that some are one way and some are the other is probably worth some investigation!
As always, thank you @ardumont for considering and sorting through these issues. Getting 5K new Guix-referenced sources into SWH will be a huge improvement.
I'm not sure how to reply ;)
And what to do next... So i'll just reply along the way.
The fact that the staging instance is a append-only test archive (with a messy dataset inside, since it never got emptied since its start) does not help the analysis... (or the will to go and dig...). Fortunately, we have some tasks(/epic) to work on to try and have scratchable instances for, among other things, improved analysis of ingestion stack... Hopefully, we'll start that work soon.
I pointed my script at the staging instance, and it checked 5923 missing SWHIDs and found 2255 of them.
What's the script you are using?
What kind of swhids we are discussing here?
For that last part, I gather it's directory (given the discussion below), but better ask ;)
This is an amazing improvement (),
yes!
but it's not quite what your numbers suggest.
Those numbers are heuristics about the scheduler/ingestion, unbeknownst of what the swhid (or/and ExtIDs) are about.
This discrepancy though sounds like something is off somewhere.
Personally, I like the consistency of having the top-level directory even if everyone is too polite to pack multiple directories into a tarball. Whatever the design goal is, the fact that some are one way and some are the other is probably worth some investigation!
We do like consistency as well ;)
It seems the code [1] is currently not providing the top-level directory holding the extracted archive but directly the contents of the tarball.
And when it's one top-level directory, it starts the ingestion from there.
Now, how come some have a different structure?
That must be other package loaders which can archive the same tarballs but differently, which does not sound too good.
It seems that's the issue ^ [2]. So we may need to adapt to be consistent with what the other package loaders are ingesting tarballs [2]
They seem to just pass along directly the uncompressed path (without doing some extra check like [1]).
As always, thank you @ardumont for considering and sorting through these issues. Getting 5K new Guix-referenced sources into SWH will be a huge improvement.
sure
@anlambert What do think about the issue and adaptation on the tarball directory loader to ingest as package loader do?
It seems that's the issue ^ [2]. So we may need to adapt to be consistent with what the other package loaders are ingesting tarballs [2]
They seem to just pass along directly the uncompressed path (without doing some extra check like [1]).
Tentatively pushed [1], there may need to bump the extid version too.
Although, we do have the nixpkgs dataset to contend with too...
I recall the development started with theirs, hence that strange if (see [1] from previous comment).
Because nix does not use the top-level directory to compute the tarball nar hash...
This one comes from fetchFromGitHub. Nix has a way of downloading an archive and extracting it as a fixed-output derivation (which means the hash is the nar hash of the output directory). It looks like they use fetchFromGitHub as an optimization: it will download the generated archive file and unpack it rather than cloning the repo. In Guix, we had trouble with those archives being unstable, but maybe if you hash the contents it doesn't matter. They also have the underlying fetchzip, which sometimes gets used for stable tarball/zip releases. I don't know if they have a policy for when to use fetchzip vs. fetchurl.
Well, I do not speak Nix language. :-) Hum, I am missing something:
(HTH) I think the nixpkgs-swh code is only about reading the nixpkgs repositories and extracting derivation information into a json manifest.
They don't recompute anything.
This one comes from fetchFromGitHub. Nix has a way of downloading an
archive and extracting it as a fixed-output derivation (which means the
hash is the nar hash of the output directory). It looks like they use
fetchFromGitHub as an optimization: it will download the generated archive
file and unpack it rather than cloning the repo. In Guix, we had trouble
with those archives being unstable, but maybe if you hash the contents it
doesn't matter. They also have the underlying fetchzip, which sometimes
gets used for stable tarball/zip releases. I don't know if they have a
policy for when to use fetchzip vs. fetchurl.
[3]
Implementation wise, for tarballs, the code was already consuming a path
generator of one filepath, the unique top-level directory if any or the
top-level uncompressed archive path otherwise. Hence, probably, the source of
the discrepancy for some tarballs @samplet mentioned. In the current mr
referenced [1] (not yet deployed), that generator now provides both the
top-level uncompressed tarball path first and the unique top-level directory
within it if any.
The remaining part of the generation consumption stays the same, it's checking
the nar checksums on the path up until the nar checksums match. If nothing
match, that fails. Otherwise, it goes on and ingest the tarball. So both Guix
(alway top-level uncompressed archive path) and Nixpkgs (top-level directory
of the tarballs if any) datasets should be ingested appropriately.
The identified impact is that we have to bump the extid_version (already done
in mr [1]). Meaning, we'll have to reingest tarballs. It should be mostly noop
(we already have the source code), but for the visit snapshot (swhid) & extid
mapping. That way, it should become complete, relevant and consistent at least
for the guix tarball dataset (& another consistency pattern for nixpkgs
tarball datasets... [3]). That way you should be able to find the 2.5k missing
swhid mentioned.
On your side, you'll have to adapt the extid query api call to pass along an
extra '?extid_version=1' [1] [2] in your calls to find the right mapping.
[2] It's currently not set, hence defaulting to the version 0
[3] We cannot do anything about this unless asking nix to change their
ways... which does not sound like a sensible inquiry... We might as well ask
them (and you ;) to use swhid directly
17:18:19 Let me try to rephrase, because I still don't get why an
extid version bump is needed... Let's suppose tarball tarball.tar.gz is
referenced by both the guix and nix sources.json. Extracting tarball.tar.gz
yields a single directory a/, containing a bunch of files. The swhid of the
a subdirectory is S_a; The swhid of the toplevel directory is S_top (with a
single entry a pointing to the swhid S_a).
ok
The snapshot that we would generate for the tarball.tar.gz should always
have had a single branch that points to S_top (and a HEAD alias). Is that
correct?
no, before, with <= v5.16.1, that depended on the manifest & tarball structure
[1]. Snapshot would have targeted the directory that matched the checksums [2]
(for nar or flat checksums). I realized I probably answered you wrong this
morning (orally)...
If I understand correctly, guix's nar computation for tarball.tar.gz (let's
call the value nar_guix) is a hash of the toplevel directory (which has a
single 'a' subdirectory); whereas nix's nar computation (let's call it
nar_nix) skips the toplevel and uses directly the contents of the 'a'
subdirectory. Is that correct?
yes.
Which nar extids did we store? I assume we would have tried to store
(nar_nix, S_top) and (nar_guix, S_top), is that right?
No, the old version loader (still in prod) ingested the directory matching the
checksums (same goes for the extid mapping computed). So a combination of one
of [(nar_nix, S_top), (nar_nix, S_a)] and one of [(nar_guix, S_top),
(nar_guix, S_a)] depending on the structure of the tarball too (because of the
conditional [2] in there pulled by nixpkgs' dataset iirc the train of thoughts
at the time)
Which nar extids are we storing now?
My guess would be (nar_nix, S_a) and (nar_guix, S_top)?
yes, that's it.
---- to which @anlambert responded more succintely & exhaustively. It's definitely
worth quoting ;p
17:28:37 <anlambert> olasd: the tarball hashes referenced in guix sources.jsonmanifest are flat ones (checksums of tarballs bytes) not recursive ones (NARhashes of contained source trees) so the checksum verification took placeafter we downloaded a tarball and succeeded but unfortunately the tarballloader that was deployed skip the top level directory so we ended up havingextid of type checksum-sha256 targetting a wrong directory SWHID17:31:14 <anlambert> if we want to avoid bumping the extid version, we mustremove the wrong checksum-sha256 extid mapping from the storage and reload theconcerned origins but I guess this is not something we want to do17:33:31 <olasd> ah ! OK, that makes a ton of sense
fwiw, i've been monitoring the process and it's steadily growing up [1]
From early today to now, it's keeping up and should keep up with the old extid_version (0).
As mentioned earlier in the discussion, extid_version 0 represents the old mapping (with the issue on tarball ingested at the wrong level in regards to how guix expects the directory representation).
Data with extid_version 1 is the new version (with the correct directory representation).
We'll see how to deal with [2] (very few hg* origins, in regards to those mappings).
I checked my collection of Guix SWHIDs against the staging instance, and it found 5406 (out of 5923 that were missing as of a month ago). That's a much better result! (I assume the extra 319 found their way into to the archive by other means over the last month.) As far as I can tell the new loader is working well. The new coverage numbers for Guix are very exciting! I'm looking forward to these landing in the main archive.