Skip to content
Snippets Groups Projects

Fix crash when indexing the same directory twice with non-deterministic order

persist_index_computations deduplicated row entries based on the entire content of the row; but postgresql enforces the 'id' should be unique.

This was not an issue in older version of swh-indexer, because all operations were deterministic, given a specific directory as input.

The recent switch to rdflib introduced non-determinism, so different outputs may be returned for the same directory id; causing the deduplication to not be good enough to avoid duplicate ids.

With this commit, deduplication is now done on 'id', as expected.

As a side-effect, persist_index_computations is now more efficient because:

  1. it runs in linear time instead of quadratic in the number of metadata items
  2. it only compares dir ids, instead of the content of indexed metadata (which is arbitrarily large JSON-like data)

Migrated from D8417 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D8417 (id=30363)

    Could not rebase; Attempt merge onto 44879ab5...

    Updating 44879ab..b6385ce
    Fast-forward
     swh/indexer/metadata.py                            | 52 +++++++++++++-------
     swh/indexer/metadata_dictionary/github.py          |  6 ++-
     .../tests/metadata_dictionary/test_github.py       | 20 ++++++--
     swh/indexer/tests/test_metadata.py                 | 29 ++++++++++++
     swh/indexer/tests/test_origin_metadata.py          | 55 +++++++++++++++++++++-
     5 files changed, 139 insertions(+), 23 deletions(-)
    Changes applied before test
    commit b6385cec11e291e4aa617c83f98d8c2118f3325c
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Thu Sep 8 11:35:46 2022 +0200
    
        Fix crash when indexing the same directory twice with non-deterministic order
        
        persist_index_computations deduplicated row entries based on the entire
        content of the row; but postgresql enforces the 'id' should be unique.
        
        This was not an issue in older version of swh-indexer, because all
        operations were deterministic, given a specific directory as input.
        
        The recent switch to rdflib introduced non-determinism, so different
        outputs may be returned for the same directory id; causing the
        deduplication to not be good enough to avoid duplicate ids.
        
        With this commit, deduplication is now done on 'id', as expected.
        
        As a side-effect, persist_index_computations is now more efficient
        because:
        
        1. it runs in linear time instead of quadratic in the number of
           metadata items
        2. it only compares dir ids, instead of the content of indexed metadata
           (which is arbitrarily large JSON-like data)
    
    commit dd0274193f5228ded709fb150c1685ddaafeed73
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Thu Sep 8 11:12:56 2022 +0200
    
        github: Add support for 'topics'
    
    commit befdbd7efd46dd052b8728215deeeb3f775c34d0
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Mon Sep 5 15:48:40 2022 +0200
    
        Fix crash when RawExtrinsicMetadata target new origins
        
        RawExtrinsicMetadata contain a swh:1:ori: identifier of the origin,
        which the indexer needs to resolve, by querying its storage replica.
        
        Because RawExtrinsicMetadata are created by loaders, they are often
        created shortly after the origin is created by the corresponding lister,
        so the origin may not be known to the storage replica used by the
        indexer, causing this function to crash.
        
        Waiting 10s seems to be good enough when run on my computer with
        production data and moma's replica; so I set it to 60s just to be safe.
    
    commit 68940cfccfed258620cc116bedd6598fd9b28df4
    Author: Valentin Lorentz <vlorentz@softwareheritage.org>
    Date:   Mon Sep 5 14:21:50 2022 +0200
    
        Fix crash when RawExtrinsicMetadata objects have the same target
        
        ... and they are processed in the same batch.
        
        The last one received takes precedence, as it is likely to be more
        up-to-date

    See https://jenkins.softwareheritage.org/job/DCIDX/job/tests-on-diff/483/ for more details.

  • Merge request was accepted

  • Antoine R. Dumont approved this merge request

    approved this merge request

  • Author Maintainer

    Merge request was merged

  • closed

Please register or sign in to reply
Loading