model: Add payload to ExtID class
This revision adds payload and payload type fields to the ExtID class. The intent is to use these fields to store Disarchive specifications to support recovering source tarballs.
See https://sympa.inria.fr/sympa/arc/swh-devel/2022-02/msg00022.html where Stefano suggests using a generic payload mechanism for ExtIDs and https://sympa.inria.fr/sympa/arc/swh-devel/2022-05/msg00027.html where we decided on using object storage for the payload.
Migrated from D8759 (view on Phabricator)
Merge request reports
Activity
Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/541/ See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/541/console
Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/542/ See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/542/console
Hi, thanks for the diffs
! In !267 (closed), @samplet wrote: It looks like using the
--draft
flag when sending this witharc
has confused the CI system. The build failed because it couldn't access the updated code:fatal: Couldn't find remote ref refs/tags/phabricator/diff/31582
It was because you hadn't signed the CLA when submitting the diff
Sorry! Can the CI job be retried?
Done
! In !267 (closed), @swh-public-ci wrote: Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/542/ See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/542/console
That's weird. I don't see why it would fail this time.
Anyway, I'm guessing it will fix itself when you send any update to the diff so let's ignore that issue for now.
This looks fine to me, though we might want to use something other than SHA1 for this.
We are currently in the process of moving the objstorage to multi-hashes, so eventually we will want to use multi-hashes here too. I wonder if we should do that before landing this diff (and swh-storage!849 (closed)). It would be less effort overall, but will delay your work a bit.
eg. here it could be done by making
payload
a dictionary (with algos as keys and digests as values), and serialize the manifest as:[payload_type $StrWithoutSpaces] [payload blake2s256 $Hexdigest] [payload sha1 $Hexdigest] [payload sha1_git $Hexdigest] [payload sha256 $Hexdigest]
! In !267 (closed), @vlorentz wrote: Hi, thanks for the diffs
! In !267 (closed), @samplet wrote: It looks like using the
--draft
flag when sending this witharc
has confused the CI system. The build failed because it couldn't access the updated code:fatal: Couldn't find remote ref refs/tags/phabricator/diff/31582
It was because you hadn't signed the CLA when submitting the diff
Nope, that's because the commit hasn't been pushed to the staging repo. I assume either you used
--skip-staging
when running arc diff, or the push failed for some reason.You need to go through these instructions: https://docs.softwareheritage.org/devel/contributing/phabricator.html#enabling-git-push-to-our-forge to ensure that you're able to push to https://forge.softwareheritage.org/source/staging.git
! In !267 (closed), @vlorentz wrote: Anyway, I'm guessing it will fix itself when you send any update to the diff so let's ignore that issue for now.
This looks fine to me, though we might want to use something other than SHA1 for this.
We are currently in the process of moving the objstorage to multi-hashes, so eventually we will want to use multi-hashes here too. I wonder if we should do that before landing this diff (and swh-storage!849 (closed)). It would be less effort overall, but will delay your work a bit.
eg. here it could be done by making
payload
a dictionary (with algos as keys and digests as values), and serialize the manifest as:[payload_type $StrWithoutSpaces] [payload blake2s256 $Hexdigest] [payload sha1 $Hexdigest] [payload sha1_git $Hexdigest] [payload sha256 $Hexdigest]
So, in essence, the
payload
would be an outbound edge from theextid
node to acontent
node (and only a content node?), which needs to be recorded in the SWH archive separately, correct?Rather than bodge together some multi-hash edge definition that would only be used for extid manifests (and which may or may not be what we end up retaining for SWHID v2), I think it would make more sense to use a definition consistent with the existing edges in our data model, that is, use the sha1_git of the content to refer to it (like directory entries or snapshot branches would, today). We can make the schema of extids (and the hash used to identify edges) evolve concurrently with the schema for the rest of the SWH object identifiers once we have a consistent viewpoint on how to do that.
! In !267 (closed), @vlorentz wrote: Hi, thanks for the diffs
My pleasure! Sorry for fumbling with Phabricator so much.
I've set up pushing to the forge now, so hopefully things will work after an update (as suggested).! In !267 (closed), @olasd wrote: [...] I think it would make more sense to use a definition consistent with the existing edges in our data model, that is, use the sha1_git of the content to refer to it (like directory entries or snapshot branches would, today).
This makes sense to me, too. I had it this way, but was tempted away by the ease of using
storage.content_get_data
with the plain SHA-1. If we store the Git SHA-1, my understanding is that I will have to usestorage.content_get
to get the metadata and proceed as before with the SHA-1. (In light of your comment, which I agree with, it's weird thatstorage.content_get_data
uses a plain SHA-1 in the first place.)Use a Git SHA-1 for the payload.
This update switches from using a bare SHA-1 to a Git SHA-1 for the payload.
It also includes minor corrections suggested by @vlorentz.
Build has FAILED
Link to build: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/543/ See console output for more information: https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/543/console
Build is green
Patch application report for D8759 (id=31659)
Rebasing onto fe8d5558...
Current branch diff-target is up to date.
Changes applied before test
commit 49475e98028f2b6095fe5972cc51cc88919dfe3a Author: Timothy Sample <samplet@ngyro.com> Date: Tue Sep 27 15:29:23 2022 -0400 model: Add payload to ExtID class
See https://jenkins.softwareheritage.org/job/DMOD/job/tests-on-diff/544/ for more details.
272 272 a relationship between an original identifier of an artifact, in its 273 273 native/upstream environment, and a `core SWHID <persistent-identifiers>`, 274 which is specific to Software Heritage. As such, it is a triple made of: 274 which is specific to Software Heritage. As such, it includes: 275 275 276 276 * the external identifier, stored as bytes whose format is opaque to the 277 277 data model 278 278 * a type (a simple name and a version), to identify the type of relationship 279 279 * the "target", which is a core SWHID 280 280 281 An extid may also include a "payload", which is arbitrary data about 282 the relationship. For example, an extid might link a directory to 283 the cryptographic hash of the tarball that originally contained it. 284 In this case, the payload could include data useful for 285 reconstructing the original tarball from the directory. 286 mentioned in merge request swh-storage!849 (closed)