Skip to content

Introduce object masking functionality

MR co-authored with @vlorentz

The two main commits for this change are:

Introduce a PostgreSQL storage schema and API for object masking information

This is a simple database of the SWHIDs of objects for which we have made a policy decision to restrict the diffusion without removing them from the archive, and a lightweight history structure for the associated object masking requests.

Doing this as an overlay, instead of modifying the storage schema for all objects, allows us to start better separating the concerns of archival of origins (which necessitates a full view of all the unmodified objects that are stored in the archive), with the concerns about the dissemination of said archived objects.

To avoid interfering with archival, the masking policy will only be applied for full object retrieval and implemented as a new proxy storage, which will be placed in front of all public-facing storages.

Introduce the object masking proxy storage

This new masking proxy storage intercepts all information retrieval from the underlying storage, and matches the SWHIDs of returned objects to the contents of the masking database.

For simplicity, when any of the returned objects matches the masking database, a non-retryable MaskedObjectException is raised, with a dict mapping the masked SWHIDs to information about the masking request, including an opaque id and a masking state (temporary or permanent). It is up to the client to process this exception to display the information in a useful manner. If necessary, a client fetching a batch of objects including some masked and non-masked ones could extract the ids of the masked objects and retry for the non-masked objects as well. If this usage becomes prevalent, it could be implemented as one more proxy.

When an object's SWHID (or a list thereof) is passed as argument to the storage function, we first call the underlying function to check the object for existence, before we attempt to match the object with the masking database. This avoids leaking information out of the masking database until it's absolutely needed, avoiding potential issues after a content removal has been processed.

For now, our implementation does not consider that the SWHID of masked objects itself needs to be masked. For instance, an unmasked Directory containing masked Contents will still allow being listed. Only accessing the data of the masked Content object itself would raise a MaskedObjectException. This choice was made to limit the impact of masked objects in the overall archive navigation experience.

TODO

  • more bikeshedding of the schema / more docs of decisions made
  • actual testing of the object masking behavior (for now we've only gone as far as testing that all the "normal" storage tests go through when no objects are masked, and that some exceptions happen when we mask some objects)
  • cli for performing masking operations (!1112 (merged))
  • (mild) performance testing
    • we've replaced the __getattr__ lru cache with something smarter, at least...
    • add overhead measurement
  • caching / bloom filtering (will happen in a follow up MR)
    • extend the query interface to prime and refresh a local cache
      • we could remember the greatest date in the history table as a proxy for the freshness of the data
    • obvious win in terms of performance if we can avoid 99.9% of negative queries using a few MB of RAM, even if it's for each read-only storage RPC server thread.
Edited by Nicolas Dandrimont

Merge request reports