Skip to content
Snippets Groups Projects

retry: Make content_add endpoints maximize content writes to storage

In effect, allows to write all but the colliding contents to the storage. This is the same behavior currently existing in the journal replayer (swh-journal!124 (closed)). This shares the behavior within the retry proxy as this is a common need, for example in loaders as well (swh-loader-core!379 (closed)).

And the colliding hashes are stored in the log in a formatted way so we do not lose that information.

Related to swh-journal!124 (closed) Related to swh-loader-core!379 (closed)

Note: It could go in an entirely different proxy. That would maximize composition through configuration and not entangle the retry and that new behavior (which is somehow a retry, thus why it's proposed here in the first place ;)

Test Plan

tox


Migrated from D3012 (view on Phabricator)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D3012 (id=10681)

    Rebasing onto ddac3d27...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 15c1caa21862d5f7973a37599e338c4e7186a40f
    Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
    Date:   Fri Apr 10 12:03:04 2020 +0200
    
        retry: Make content_add endpoints maximize content writes
        
        In effect, allow to write all but the colliding contents to the storage. This
        is the same behavior currently existing in the journal replayer. This shares
        the behavior within the retry proxy as this is a common need, for example in
        loaders.
        
        And the colliding hashes are stored in the log in a formatted way so we do not
        lose that information.
        
        Related to swh/devel/swh-journal!124
        Related to swh/devel/swh-loader-core!379

    See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/89/ for more details.

  • Rebase to latest master

  • Nice. I always felt this was needed, but never actually did it.

    You significantly changed the behavior in case of hash collisions, as the exceptions no longer bubble up. Is this intentional, and did you discuss it with the others?

  • Merge request was returned for changes

  • Build is green

    Patch application report for D3012 (id=10709)

    Rebasing onto e5e59435...

    Current branch diff-target is up to date.
    Changes applied before test
    commit cdd8fac799a29fe5370063a47fcb6ae8e9558eb3
    Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
    Date:   Fri Apr 10 12:03:04 2020 +0200
    
        retry: Make content_add endpoints maximize content writes
        
        In effect, allow to write all but the colliding contents to the storage. This
        is the same behavior currently existing in the journal replayer. This shares
        the behavior within the retry proxy as this is a common need, for example in
        loaders.
        
        And the colliding hashes are stored in the log in a formatted way so we do not
        lose that information.
        
        Related to swh/devel/swh-journal!124
        Related to swh/devel/swh-loader-core!379

    See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/94/ for more details.

  • You significantly changed the behavior in case of hash collisions, as the exceptions no longer bubble up. Is this intentional,

    Well, yes and no.

    The exception is written through logs now so it's not lost (thus why it's tested with caplog).

    I'm not completely sure it should be in the retry proxy, thus why i proposed another one dedicated to this (in the main diff description).

    and did you discuss it with the others?

    Yes, the proposition appeared in the replayer swh-journal!124 (closed). Now the rest of the discussion can happen here :D

    • Rebase on latest master
    • Adapt according to review

    TODO:

    • Add another test on hash collision scenario
  • Build has FAILED

    Patch application report for D3012 (id=10716)

    Rebasing onto 2cc263da...

    Current branch diff-target is up to date.
    Changes applied before test
    commit bb2799a312d3583ba4e3f45188b570db6e327942
    Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
    Date:   Fri Apr 10 12:03:04 2020 +0200
    
        retry: Make content_add endpoints maximize content writes
        
        In effect, allow to write all but the colliding contents to the storage. This
        is the same behavior currently existing in the journal replayer. This shares
        the behavior within the retry proxy as this is a common need, for example in
        loaders.
        
        And the colliding hashes are stored in the log in a formatted way so we do not
        lose that information.
        
        Related to swh/devel/swh-journal!124
        Related to swh/devel/swh-loader-core!379

    Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/96/ See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/96/console

  • Fix counter mistyping

  • Build is green

    Patch application report for D3012 (id=10717)

    Rebasing onto 2cc263da...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 039de11c1fdbda64e58424fa9214dd478e683ae2
    Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
    Date:   Fri Apr 10 12:03:04 2020 +0200
    
        retry: Make content_add endpoints maximize content writes
        
        In effect, allow to write all but the colliding contents to the storage. This
        is the same behavior currently existing in the journal replayer. This shares
        the behavior within the retry proxy as this is a common need, for example in
        loaders.
        
        And the colliding hashes are stored in the log in a formatted way so we do not
        lose that information.
        
        Related to swh/devel/swh-journal!124
        Related to swh/devel/swh-loader-core!379

    See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/97/ for more details.

  • Fix off-by-1 check

  • Build is green

    Patch application report for D3012 (id=10718)

    Rebasing onto 2cc263da...

    Current branch diff-target is up to date.
    Changes applied before test
    commit 4cee13fc8d08afbe250eb20a2c99fd0f2ecbbabd
    Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com>
    Date:   Fri Apr 10 12:03:04 2020 +0200
    
        retry: Make content_add endpoints maximize content writes
        
        In effect, allow to write all but the colliding contents to the storage. This
        is the same behavior currently existing in the journal replayer. This shares
        the behavior within the retry proxy as this is a common need, for example in
        loaders.
        
        And the colliding hashes are stored in the log in a formatted way so we do not
        lose that information.
        
        Related to swh/devel/swh-journal!124
        Related to swh/devel/swh-loader-core!379

    See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/98/ for more details.

  • ! In !924 (closed), @ardumont wrote: You significantly changed the behavior in case of hash collisions, as the exceptions no longer bubble up. Is this intentional,

    Well, yes and no.

    The exception is written through logs now so it's not lost (thus why it's tested with caplog).

    I'm not completely sure it should be in the retry proxy, thus why i proposed another one dedicated to this (in the main diff description).

    and did you discuss it with the others?

    Yes, the proposition appeared in the replayer swh-journal!124 (closed). Now the rest of the discussion can happen here :D

    This diff is trying to conflate two different cases, with two different tradeoffs, where we need to handle colliding contents. The choice it makes is one of these tradeoffs, but it introduces a critical issue in the other.

    (c/p from the argument I wrote in swh-loader-core!379 (closed))

    Replayers need to handle colliding contents in the journal, which we know happens in two situations:

    • before the compaction, when two identical contents have been added with a different ctime (which is a bug that's now been fixed and should happen less often)
    • after compaction, when there's actually two colliding contents in the journal (that have been rejected when being added by postgres)

    In both of these cases, we want the replayer to keep on working, to get as close a copy as possible of the main archive in the mirror.

    Loaders need to write consistent data to the storage, from the snapshot object down. Handling colliding contents in loaders is only //incidental//, because we're starting to have loaders which write an enormous amount of data, and where the probability of loading colliding contents is 1. For these loaders, we want to "maximize" the size of the partial snapshots loaded. But handling these colliding contents must not in any way result in inconsistent, hard-to-recover-from data being loaded into the archive.

    The choice made in this diff is to make the "log hash collisions, keep on going without error" behavior of the journal replayer the default. If loaders start using this, which they will as they all have the retry proxy in their storage pipeline, this will break the integrity of the archive, and lose data forever!

    At this point, I'm not even sure that I agree that this behavior needs to be lifted in the storage (considering that the journal replayer will be integrated in swh.storage soon). If we decide to do so, it absolutely needs to be opt-in (a new add_content_metadata_ignoring_hash_collisions endpoint, with a big fat warning in the documentation). The default behavior of content_add/content_add_metadata needs to stay a hard failure, and specific users can opt-in to the lossy behavior if they've made this choice explicitly.

  • Merge request was returned for changes

  • Merge request was abandoned

  • mentioned in merge request swh-journal!250 (closed)

Please register or sign in to reply
Loading