Skip to content
Snippets Groups Projects
  1. Nov 09, 2021
    • David Douard's avatar
      Add support for a redis-based reporting for invalid mirrorred objects · 850a7553
      David Douard authored
      The idea is that we check the BaseModel validity at journal
      deserialization time so that we still have access to the raw object from
      kafka for complete reporting (object id plus raw message from kafka).
      
      This uses a new ModelObjectDeserializer class that is responsible for
      deserializing the kafka message (still using kafka_to_value) then
      immediately create the BaseModel object from that dict. Its `convert`
      method is then passed as `value_deserializer` argument of the
      `JournalClient`.
      
      Then, for each deserialized object from kafka, if it's a HashableObject,
      check its validity by comparing the computed hash with its id.
      
      If it's invalid, report the error in logs, and if configured, register the
      invalid object in via the `reporter` callback.
      
      In the cli code, a `Redis.set()` is used a such a callback (if configured).
      So it simply stores invalid objects using the object id a key (typically its
      swhid), and the raw kafka message value as value.
      
      Related to T3693.
      v0.40.0
      850a7553
    • David Douard's avatar
      Refactor fixer.fix_objects() to extract the inner object_fixers dict · 04bd15a0
      David Douard authored
      allowing to use this dict independently of the fix_objects() function.
      04bd15a0
    • David Douard's avatar
      Remove now useless fixers · d655c858
      David Douard authored
      keep the the fix_objects() function for bw compat for now.
      d655c858
    • David Douard's avatar
      Add a --type option to 'swh storage replay' · 55eed77b
      David Douard authored
      allows to choose replayed object types from the cli.
      55eed77b
    • vlorentz's avatar
      Update extrinsic-metadata-specification.rst to match the current implementation · 0262f1c1
      vlorentz authored
      * merged origin and artifact metadata
      * added metametadata
      * uses structures instead of dict
      * removed raw_extrinsic_metadata_get_latest
      0262f1c1
  2. Oct 28, 2021
    • Antoine Lambert's avatar
      interface: Add origin_snapshot_get_all method · a5bfe5b5
      Antoine Lambert authored
      It enables to return in an efficient way the list of unique snapshot
      identifiers resulting from the visits of an origin.
      
      Previously it was required to query all visits of an origin then query
      all visit statuses for each visit to extract such information.
      
      Introduced method enables to extract origin snaphots information in
      a single datase query.
      
      Related to T3631
      v0.39.0
      a5bfe5b5
  3. Oct 22, 2021
  4. Oct 21, 2021
  5. Oct 18, 2021
  6. Oct 11, 2021
  7. Oct 08, 2021
    • Nicolas Dandrimont's avatar
    • Nicolas Dandrimont's avatar
      buffer: add a threshold for the estimated size of revision and release batches · b6040142
      Nicolas Dandrimont authored
      The size of individual revisions and releases is essentially unbounded.
      This means that, when the buffer storage is used as a way of limiting
      memory use for an ingestion process, it is still possible to go beyond
      the expected memory use when adding a batch of revisions or releases
      with large messages or other metadata.
      
      The duration of the database operations for revision_add or release_add is also
      commensurate to the size of the objects added in a batch, so
      using the buffer proxy to limit the time individual database operations
      takes was not effective.
      
      Adding a threshold on estimated sizes for batches of revision and
      release objects makes this overuse of memory and of database transaction
      time much less likely.
      b6040142
    • Nicolas Dandrimont's avatar
      buffer: add a threshold for the number of revision parents in one batch · 7c5b0ec1
      Nicolas Dandrimont authored
      The size of individual revisions is essentially unbounded. This means
      that, when the buffer storage is used as a way of limiting memory use
      for an ingestion process, it is still possible to go beyond the expected
      memory use when adding a batch of revisions with extensive histories.
      
      The duration of the database operation for revision_add is also
      commensurate to the number of revision parents added in a batch, so
      using the buffer proxy to limit the time individual database operations
      takes was not effective.
      
      Adding a threshold on cumulated number of revision parents per batch
      makes this overuse of memory and of database transaction time much less
      likely.
      7c5b0ec1
    • Nicolas Dandrimont's avatar
      buffer: add a threshold for the number of directory entries in one batch · 5edc0ba7
      Nicolas Dandrimont authored
      The size of individual directories is essentially unbounded. This means
      that, when the buffer storage is used as a way of limiting memory use
      for an ingestion process, it is still possible to go beyond the expected
      memory use when adding a batch of (very) large directories.
      
      The duration of the database operation for directory_add is also
      commensurate to the number of directory entries added in a batch, so
      using the buffer proxy to limit the time individual database operations
      takes was not effective.
      
      Adding a threshold on cumulated number of directory entries per batch
      makes this overuse of memory and of database transaction time much less
      likely.
      5edc0ba7
    • Nicolas Dandrimont's avatar
      abe95b34
    • Nicolas Dandrimont's avatar
    • Nicolas Dandrimont's avatar
      buffer: Ensure that we don't send data from empty buffers · 5d5d4c94
      Nicolas Dandrimont authored
      This was already the case (as grouper called on an empty iterator just
      returns no batches), but add a test to enforce it.
      5d5d4c94
  8. Sep 29, 2021
  9. Sep 28, 2021
  10. Sep 27, 2021
  11. Sep 23, 2021
  12. Sep 21, 2021
  13. Sep 16, 2021
    • Antoine Lambert's avatar
      postgresql: Fix get_snapshot_branches return value for empty search · 9465054e
      Antoine Lambert authored
      When searching for branches in an existing snapshot, a PartialBranches
      object must be returned regardless the number of found branches.
      
      None should only be returned when a snapshot does not exist.
      
      This fixes an inconsistency between the postgresql and cassandra
      backends.
      
      Related to T3413
      9465054e
  14. Sep 15, 2021
  15. Sep 14, 2021
  16. Sep 10, 2021
  17. Sep 09, 2021
  18. Sep 08, 2021
  19. Sep 06, 2021
  20. Sep 03, 2021
  21. Aug 31, 2021
Loading