Skip to content
Snippets Groups Projects
  • Jérémy Bobbio (Lunar)'s avatar
    848f89cc
    Test removal and restore in Kafka · 848f89cc
    Jérémy Bobbio (Lunar) authored
    Removing an object from Kafka requires writing a new message with
    the same key as previously used and an empty value. These tombstones
    then get later “compacted” depending on the topic settings `cleanup.policy`,
    `max.compaction.lag.ms` and `delete.retention.ms`.
    
    To test the presence or absence of objects in Kafka, we thus need to
    find which is the most recent: a tombstone or a value. In order to do
    so, we parse all messages into a single dict, associating SWHIDs with
    the latest message timestamp and if it should be considered present or
    absent. This is sadly a bit time and memory consuming but at least we
    get accurate results.
    
    While not strictly necessary, we now use a topic configuration in
    Kafka that will aggressively try to remove “dead” messages. It
    should improve slightly the time needed to inventory objects as
    previously described.
    
    We use the match syntax introduced in Python 3.10 in `handle_message()`,
    so we bump black compatibility settings to Python 3.11.
    
    Depends on swh/devel/swh-alter!7 (and a new release thereafter)
    848f89cc
    History
    Test removal and restore in Kafka
    Jérémy Bobbio (Lunar) authored
    Removing an object from Kafka requires writing a new message with
    the same key as previously used and an empty value. These tombstones
    then get later “compacted” depending on the topic settings `cleanup.policy`,
    `max.compaction.lag.ms` and `delete.retention.ms`.
    
    To test the presence or absence of objects in Kafka, we thus need to
    find which is the most recent: a tombstone or a value. In order to do
    so, we parse all messages into a single dict, associating SWHIDs with
    the latest message timestamp and if it should be considered present or
    absent. This is sadly a bit time and memory consuming but at least we
    get accurate results.
    
    While not strictly necessary, we now use a topic configuration in
    Kafka that will aggressively try to remove “dead” messages. It
    should improve slightly the time needed to inventory objects as
    previously described.
    
    We use the match syntax introduced in Python 3.10 in `handle_message()`,
    so we bump black compatibility settings to Python 3.11.
    
    Depends on swh/devel/swh-alter!7 (and a new release thereafter)