Specify the Vitam archiving format

SWH in Vitam

Proposal for archiving SWH in Vitam

It seems unrealistic to apply the SWH data model directly in Vitam, and probably unnecessary: the main goal of this archiving project it to be able to recover a working SWH dataset, and possibly to relatively easily find and retrieve one SWH object from its SWHID.

With this main goal in mind, the simplest solution would probably to pack SWH objects in "packfiles" of a certain amount of objects, and keep all the SWHIDs stored in this packfile in a indexable metadata file attached to this packfile. These 2 files would be an SWH Archive Unit.

This should allow to relatively easily handle the incremental archiving of the SWH Archive by assembling these packfiles by consuming the SWH Journal (kafka-based).

These packfiles can use several low level serialization protocols, like the msgpack format which is already used in several places in the software heritage code base but the existing work on swh-dataset gives the opportunity to consider the Apache ORC file format.

This comes with the advantage of using a known and well defined file format (but the export data model remains to be defined) and provide very good compression and tooling to read, parse filter etc the data.

Depending on whether we can add a custom indexer in Vitam to deal with ORC files produces by swh-dataset, we may be able to get rid of the text-based index file storing the list of SWHIDs included in a given ORC file.

OAIS -- Open Archival Information System

Organize "information" in "packages" of different nature to capture:

when producing the information
when archiving it
when communicating it

OAIS propose 3 kinds of packages:

Submission Information Packages (SIP), crafted by producer for the archiving system (Vitam here)
Archival Information Package (AIP), resulting of the processing of the SIP by the archiving service (Vitam)
Dissemination Information Packages (DIP), resulting os the processing of one or more AIP by the archiving service, aim at publicaton of said information.

SEDA -- Standard d’Échange de Données pour l’Archivage

SEDA, along with the MEDONA norm, are the official, normalized standard to exchange rtansactions between archiving services.

Transferring an archive unit to a distant archive service following the SEDA protocol consist in a SIP.

Vitam

Archiving platform implementing (among others) OAIS/SEDA protocols.

https://www.programmevitam.fr/pages/documentation/

Data model

:::info Based on mongodb, this does have impacts on how the data model is specified. :::

The data model consists in several "collections" organized in "bases":

Identity Base: stores user and application certificates (x509)
- Certificate Collection
- PersonalCertificate Collection
Logbook Base: operation and lifecycle log of archive units and objects
- LogbookOperation Collection
- LogbookLifeCycle Collection
- LogbookLifeCycleUnit Collection
- LogbookLifeCycleObjectGroup Collection
- Offset Collection
MetaData Base: metadata on archive units (Unit) and objects (ObjectGroup)
- Unit Collection
- ObjectGroup Collection
- Offset Collection
MasterData Base
- AccessContract Collection
- AccestionRegisterDetail Collection
- AccessionRegisterSummary Collection
- AccessionRegisterSymbolic Collection
- ArchiveUniProfile Collection: describe archive unit profiles
- Agencies Collection
- Context Collection: describe application contexts(?)
- FileFormat Collection: describe file formats (filled from the PRONOM base provided by the UK National Archive)
- FileRules Collection: management rules to compute life cycle and deadline events for archive unit
- Griffin Collection: (sort of plugins used to make some processing on archived binary objects)
- IngestContract Collection: ingestion contracts
- ManagementContract Collection: management contracts
- Ontology Collection
- PreservationScenario Collection: "script" (aka list of griffins) to be executed on archived units (eg. check formats and generate PDFs from documents)
- Profile Collection: archiving profiles
- SecurityProfile Collection
- VitamSequence Collection: used to generate internal IDs
- Offset Collection
Report Base
- AuditObjectGroup Collection
- EliminationActionUnit Collection
- EliminationActionObjectGroup Collection
- PreservationReport Collection

SIP

Consist in a zip or tgz file with:

a transfer manifest file (_manifest.xml) with informations and metadata describing digital objects and archival units being trasferred,
a content directory containing said objects.

A SIP must not be larger than 1GB.

A SIP must not have more than 100k objects.

manifest file

Consists in:

a header: identify the archive lot and the transfer agreement
list of binary objects in the SIP,
archiving metadata:
- ManagementMetadata: MD for the whole archiving lot,
- DescriptiveMetadata: logical tree of included ArchiveUnits
declarations of the source and destinations services.

A "group of objects" in the SEDA norm is used to model the idea that an original artifact (eg. a photography) can be transferred as several objects (eg. different file formats or resolutions). The Object Group is the unit that encompass the archived object, even if the archived artifacts are multiples.

An object group is desclared using DataObjectGroup. It is mandatory to declare such an object group if the archived object consists in several actual objects, otherwise, it is not recommanded.

It is possible to add custom fields in the manifest of a SIP, but these need to be declared beforehand in an Ontology.

Migrated from T3415 (view on Phabricator)

Edited Jan 08, 2023 by Phabricator Migration user

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information