retry: Make content_add endpoints maximize content writes to storage
In effect, allows to write all but the colliding contents to the storage. This is the same behavior currently existing in the journal replayer (swh-journal!124 (closed)). This shares the behavior within the retry proxy as this is a common need, for example in loaders as well (swh-loader-core!379 (closed)).
And the colliding hashes are stored in the log in a formatted way so we do not lose that information.
Related to swh-journal!124 (closed) Related to swh-loader-core!379 (closed)
Note: It could go in an entirely different proxy. That would maximize composition through configuration and not entangle the retry and that new behavior (which is somehow a retry, thus why it's proposed here in the first place ;)
Test Plan
tox
Migrated from D3012 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D3012 (id=10681)
Rebasing onto ddac3d27...
Current branch diff-target is up to date.
Changes applied before test
commit 15c1caa21862d5f7973a37599e338c4e7186a40f Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Date: Fri Apr 10 12:03:04 2020 +0200 retry: Make content_add endpoints maximize content writes In effect, allow to write all but the colliding contents to the storage. This is the same behavior currently existing in the journal replayer. This shares the behavior within the retry proxy as this is a common need, for example in loaders. And the colliding hashes are stored in the log in a formatted way so we do not lose that information. Related to swh/devel/swh-journal!124 Related to swh/devel/swh-loader-core!379
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/89/ for more details.
Build is green
Patch application report for D3012 (id=10709)
Rebasing onto e5e59435...
Current branch diff-target is up to date.
Changes applied before test
commit cdd8fac799a29fe5370063a47fcb6ae8e9558eb3 Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Date: Fri Apr 10 12:03:04 2020 +0200 retry: Make content_add endpoints maximize content writes In effect, allow to write all but the colliding contents to the storage. This is the same behavior currently existing in the journal replayer. This shares the behavior within the retry proxy as this is a common need, for example in loaders. And the colliding hashes are stored in the log in a formatted way so we do not lose that information. Related to swh/devel/swh-journal!124 Related to swh/devel/swh-loader-core!379
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/94/ for more details.
You significantly changed the behavior in case of hash collisions, as the exceptions no longer bubble up. Is this intentional,
Well, yes and no.
The exception is written through logs now so it's not lost (thus why it's tested with caplog).
I'm not completely sure it should be in the retry proxy, thus why i proposed another one dedicated to this (in the main diff description).
and did you discuss it with the others?
Yes, the proposition appeared in the replayer swh-journal!124 (closed). Now the rest of the discussion can happen here :D
Build has FAILED
Patch application report for D3012 (id=10716)
Rebasing onto 2cc263da...
Current branch diff-target is up to date.
Changes applied before test
commit bb2799a312d3583ba4e3f45188b570db6e327942 Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Date: Fri Apr 10 12:03:04 2020 +0200 retry: Make content_add endpoints maximize content writes In effect, allow to write all but the colliding contents to the storage. This is the same behavior currently existing in the journal replayer. This shares the behavior within the retry proxy as this is a common need, for example in loaders. And the colliding hashes are stored in the log in a formatted way so we do not lose that information. Related to swh/devel/swh-journal!124 Related to swh/devel/swh-loader-core!379
Link to build: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/96/ See console output for more information: https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/96/console
Build is green
Patch application report for D3012 (id=10717)
Rebasing onto 2cc263da...
Current branch diff-target is up to date.
Changes applied before test
commit 039de11c1fdbda64e58424fa9214dd478e683ae2 Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Date: Fri Apr 10 12:03:04 2020 +0200 retry: Make content_add endpoints maximize content writes In effect, allow to write all but the colliding contents to the storage. This is the same behavior currently existing in the journal replayer. This shares the behavior within the retry proxy as this is a common need, for example in loaders. And the colliding hashes are stored in the log in a formatted way so we do not lose that information. Related to swh/devel/swh-journal!124 Related to swh/devel/swh-loader-core!379
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/97/ for more details.
Build is green
Patch application report for D3012 (id=10718)
Rebasing onto 2cc263da...
Current branch diff-target is up to date.
Changes applied before test
commit 4cee13fc8d08afbe250eb20a2c99fd0f2ecbbabd Author: Antoine R. Dumont (@ardumont) <antoine.romain.dumont@gmail.com> Date: Fri Apr 10 12:03:04 2020 +0200 retry: Make content_add endpoints maximize content writes In effect, allow to write all but the colliding contents to the storage. This is the same behavior currently existing in the journal replayer. This shares the behavior within the retry proxy as this is a common need, for example in loaders. And the colliding hashes are stored in the log in a formatted way so we do not lose that information. Related to swh/devel/swh-journal!124 Related to swh/devel/swh-loader-core!379
See https://jenkins.softwareheritage.org/job/DSTO/job/tests-on-diff/98/ for more details.
! In !924 (closed), @ardumont wrote: You significantly changed the behavior in case of hash collisions, as the exceptions no longer bubble up. Is this intentional,
Well, yes and no.
The exception is written through logs now so it's not lost (thus why it's tested with caplog).
I'm not completely sure it should be in the retry proxy, thus why i proposed another one dedicated to this (in the main diff description).
and did you discuss it with the others?
Yes, the proposition appeared in the replayer swh-journal!124 (closed). Now the rest of the discussion can happen here :D
This diff is trying to conflate two different cases, with two different tradeoffs, where we need to handle colliding contents. The choice it makes is one of these tradeoffs, but it introduces a critical issue in the other.
(c/p from the argument I wrote in swh-loader-core!379 (closed))
Replayers need to handle colliding contents in the journal, which we know happens in two situations:
- before the compaction, when two identical contents have been added with a different ctime (which is a bug that's now been fixed and should happen less often)
- after compaction, when there's actually two colliding contents in the journal (that have been rejected when being added by postgres)
In both of these cases, we want the replayer to keep on working, to get as close a copy as possible of the main archive in the mirror.
Loaders need to write consistent data to the storage, from the snapshot object down. Handling colliding contents in loaders is only //incidental//, because we're starting to have loaders which write an enormous amount of data, and where the probability of loading colliding contents is 1. For these loaders, we want to "maximize" the size of the partial snapshots loaded. But handling these colliding contents must not in any way result in inconsistent, hard-to-recover-from data being loaded into the archive.
The choice made in this diff is to make the "log hash collisions, keep on going without error" behavior of the journal replayer the default. If loaders start using this, which they will as they all have the retry proxy in their storage pipeline, this will break the integrity of the archive, and lose data forever!
At this point, I'm not even sure that I agree that this behavior needs to be lifted in the storage (considering that the journal replayer will be integrated in swh.storage soon). If we decide to do so, it absolutely needs to be opt-in (a new
add_content_metadata_ignoring_hash_collisions
endpoint, with a big fat warning in the documentation). The default behavior ofcontent_add
/content_add_metadata
needs to stay a hard failure, and specific users can opt-in to the lossy behavior if they've made this choice explicitly.mentioned in merge request swh-loader-core!380 (closed)
mentioned in merge request swh-journal!250 (closed)
mentioned in merge request swh-loader-core!379 (closed)