- Mar 28, 2025
-
-
Nicolas Dandrimont authored
-
- Mar 26, 2025
-
-
Nicolas Dandrimont authored
-
Pierre-Yves David authored
We needed to change the way we handle transactions, and there was some impact on content ordering, but nothing too worrisome.
-
- Mar 13, 2025
-
-
Nicolas Dandrimont authored
inventory: when using storage.revision_log, a known revision doesn't mean all further revisions are known revision_log linearizes the history, so in case of merges, if a revision is known, further revisions in the log can be from other branches that haven't been processed yet.
-
- Feb 17, 2025
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Bump development tools: mypy, codespell, isort, ...
-
- Dec 16, 2024
-
-
David Douard authored
masked_state argument was actually ignored...
-
David Douard authored
Make these 2 'handle-removal-notification' subcommands work with a config file without config entries for graph, journal or search backends, since these are unnecessary for these 2 subcommands.
-
David Douard authored
Instead of hardwriting values.
-
David Douard authored
-
David Douard authored
This allows to apply a removal request received via the notification system without recomputing locally the set of SWHIDs to remove, using the list sent along with the notification message. It allows a mirror to apply the removal notification exactly as it has been applied on the main archive, and more importantly, to apply it without the need of an accessible swh-graph instance.
-
David Douard authored
-
David Douard authored
There are cases (e.g. mirrors or tests) where these data silos from which the removal of content should occur are not available for good reasons, so make it possible not to have them.
-
David Douard authored
-
- Dec 13, 2024
-
-
Guillaume Samson authored
-
- Nov 22, 2024
-
-
David Douard authored
Make the 2 cli tools performing actual removal from the archive (remove and resume_removal) send notifications on the 'removal_notification' kafka topic on completion. Address #20
-
David Douard authored
Otherwise these tests can fail if such a file exists in the working directory of the user.
-
David Douard authored
-
David Douard authored
Fix black/codespell etc.
-
- Nov 21, 2024
-
-
On removals, we will publish notification on a dedicated Kafka topic. This new notification watcher that should be run on mirrors will listen to the topic and in order: 1. Create a new masking request in the masking proxy database and add a “decision pending” entry for every SWHID listed in the notification. 2. Send an email to the mirror operator with commands to either propagate the removal, mask permanently, or unmask all objects. See the newly added section in `usage.rst` for a description of the new commands. Implementation-wise, the passing of information between the mirror notification watcher and the `handle-removal-notification` command group is done via the masking database. Useful information that were in the removal notification are serialized in YAML to the masking request “reason” field. This will then be retrieved when handling the removal notification, as well as the list of objects that have been masked. While this is a bit of abuse, it feels way simpler that creating a whole new database just for this purpose. We surely will want to revisit this design once the [takedown request processing workflow] has been implemented. [takedown request processing workflow]: https://gitlab.softwareheritage.org/product-management/swh-archive-website/-/issues/1 Address #20
-
- Jul 18, 2024
-
-
Jérémy Bobbio (Lunar) authored
-
Jérémy Bobbio (Lunar) authored
-
Jérémy Bobbio (Lunar) authored
We used to have two fixtures named `sample_populated_storage`: one with objects coming from `swh.graph.example_dataset`. The other with objects from `swh.storage.tests.storage_data`. The former are easier to reason with and have real relations between objects, but their hashes are synthetic and do not match the object actual data. Therefore, they are not fit for use when creating recovery bundles. To avoid ambiguities and help with further reuse, we rename the `sample_populated_storage` used in `test_recovery_bundle.py` to `sample_populated_storage_with_matching_hash` and we move its definition to `conftest.py`. Also add a bunch of assertions to ensure the storage added what was expected.
-
- Jul 11, 2024
-
-
Jérémy Bobbio (Lunar) authored
While 99eed4c0 improved the situation when objects given to `swh alter remove` were missing from the storage, it would display one missing origin or object at a time. In case of extensive takedown requests, this create a cumbersome trial and error process. Instead, we now ensure that all objects requested on the command line exists in storage before performoring the inventory step.
-
Jérémy Bobbio (Lunar) authored
-
- Jul 10, 2024
-
-
Jérémy Bobbio (Lunar) authored
Some objstorages perform network accesses which sometimes result in transient errors. Instead of giving up the whole removal for a single timeout, we now retry 3 times to delete an object from an objstorage before raising the most recent exception.
-
Jérémy Bobbio (Lunar) authored
When a Content was missing from the objstorage – because it had been manually erased before or in a case of corruption – we used to crash with a ValueError. Instead, we now display a proper error message. The message suggests using the `--allow-empty-content-objects` flag. When specified, getting no data will display a warning and a Content object with no data will be recorded in the recovery bundle. Closes #22
-
Jérémy Bobbio (Lunar) authored
The time it takes to backup objects vary widely between an object type from the next. Lumping all of them all in a single progressbar results in very inaccurate ETA. Instead, we display a progress bar for each object we backup. This required moving progress bar management to `RecoveryBundleCreator.backup_swhids()`, leading to a refactor of `Remover.create_recovery_bundle()`, `iter_swhids_grouped_by_type()` and its other users. The result feels more symmetric with `RecoveryBundle.restore()` which was already managing its progress bar. We take the opportunity to use different chunk sizes depending on the type of objects we are backing up. As Content objects have to retrieve data from objstorage, they tend to be longer to backup. We use smaller chunks there so we can update the progress more often. We also handle a hidden issue were RawExtrinsingMetadata would be wrongly numbered in a RecoveryBundle if they were added in multiple calls to `RecoveryBundleCreator.backup_swhids()`. Trying to do so will now raise an exception. Finally, objects being backed up are now storted by SWHID to ease reproducibility. Closes #23
-
Jérémy Bobbio (Lunar) authored
Let me have the `match` statement, pretty please: https://docs.python.org/dev/reference/compound_stmts.html#the-match-statement
-
Jérémy Bobbio (Lunar) authored
-
- Jul 08, 2024
-
-
Jérémy Bobbio (Lunar) authored
Instead of asking “Proceed with removing 98 SWHIDs?” which turns out to be hard to interpret, we now display the number of objects to remove for each type having a SWHID. We also mention that some objects do not have a SWHID (as they will add to the counts shown later). We also now prefer to display these numbers rather than a list of SWHID when `--dry-run=stop-before-recovery-bundle` is specified. Getting the list of SWHIDs can be done using `swh alter list-candidates`.
-
Jérémy Bobbio (Lunar) authored
Enumerating the RawExtrinsicMetadata objects that need to be removed can be quite a lengthly operation. We now display a progress bar (or several if some RawExtrinsicMetadata objects reference other RawExtrinsicMetadata objects).
-
Jérémy Bobbio (Lunar) authored
We forgot that `Lister._iter_inventory_candidates()` would end the iteration when 0 objects remain to be looked up. This would result in moving to the removable step while still showing a progressbar displaying “4 left to look up”. We now always update the progress bar saying that we are done after leaving the loop in `Lister.inventory_candidates()`.
-
- Jul 03, 2024
-
-
Guillaume Samson authored
-
- Jun 27, 2024
-
-
Jérémy Bobbio (Lunar) authored
Once in a while, objects are missing from the storage. It should not happen but it does: previous removals, replayer errors, corruptions… `swh alter remove` and `swh alter recovery-bundle resume-removal` now accepts the new options `--known-missing` and `--known-missing-file` to handle missing objects. `--known-missing` is followed by a SWHID and can appear multiple times. `--known-missing-file` is followed by the path (or `-` for stdin) of a file with one SWHID per line (skipping empty lines or lines starting with `#`). Known missing objects will be skipped during the “inventory” phase and mark as unremovable during the “removable” phase. Due to the latter, they will not be used when creating the recovery bundle, and therefore skipped from removal as well.
-
Jérémy Bobbio (Lunar) authored
`StorageInterface.revision_log()` returns an `Iterable` of `Optional`s to make room for missing objects in a revision log. We were not properly handling None values in `Lister._add_edges_using_storage_for_revision()`.
-
Jérémy Bobbio (Lunar) authored
-
Jérémy Bobbio (Lunar) authored
When using `swh alter remove` or `swh alter list-candidates`, we used to only display a warning when an origin cannot be found in the storage. But in both cases, it is more likely to be an error on the command-line that anything else. So instead of offering to pursue the removal operation, let’s display an error and exit instead.
-
Jérémy Bobbio (Lunar) authored
We now display progressbars on stderr so their messages will not appear before the list of SWHIDs.
-
Jérémy Bobbio (Lunar) authored
Recovery bundle manifests now contain a list of objects outside the bundle that are referenced by objects in the bundle. We make use of this list before restoring a bundle to see if any of these objects are missing from the storage. In this is the case, a confirmation is required before proceeding as it will create “dangling pointers” in the Merkle DAG.
-