Skip to content
Snippets Groups Projects

postgresql: improve db.origin_visit_status_add to fix possible replayer issues

Open David Douard requested to merge douardda/swh-storage:fix-replay-ovs into master
3 unresolved threads

In case of an existing OriginVisitStatus entry (typically with the 'created' status), adding a new version of this entry should do an update instead of ignoring it. This could happen during a replayer session where the insertion of the OriginVisit used to automatically create an OriginVisitStatus in "created" status. This would then prevent the following replayer to actually insert the correct OriginVisitStatus.

Closes #4704 (closed)

Edited by David Douard

Merge request reports

Members who can merge are allowed to add commits.

Pipeline #6182 passed

Pipeline passed for 35c6339a on douardda:fix-replay-ovs

Approval is optional
Merge blocked: 1 check failed
Merge request must be rebased, because a fast-forward merge is not possible.

Merge details

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
994 999 f"(origin, {', '.join(cols)}, metadata) "
995 1000 f"VALUES ((select id from origin_id), "
996 1001 f"{', '.join(['%s']*len(cols))}, %s) "
997 f"ON CONFLICT (origin, visit, date) do nothing",
1002 f"ON CONFLICT (origin, visit, date) "
1003 "do update "
1004 " set "
1005 " type=excluded.type, "
1006 " status=excluded.status, "
1007 " snapshot=excluded.snapshot, "
1008 " metadata=excluded.metadata "
1009 " where origin_visit_status.origin=excluded.origin "
1010 " and origin_visit_status.visit=excluded.visit "
1011 " and origin_visit_status.date=excluded.date "
1012 " and origin_visit_status.status='created' "
1013 " and origin_visit_status.status!=excluded.status",
  • Comment on lines +1002 to +1013
    Suggested change
    1002 f"ON CONFLICT (origin, visit, date) "
    1003 "do update "
    1004 " set "
    1005 " type=excluded.type, "
    1006 " status=excluded.status, "
    1007 " snapshot=excluded.snapshot, "
    1008 " metadata=excluded.metadata "
    1009 " where origin_visit_status.origin=excluded.origin "
    1010 " and origin_visit_status.visit=excluded.visit "
    1011 " and origin_visit_status.date=excluded.date "
    1012 " and origin_visit_status.status='created' "
    1013 " and origin_visit_status.status!=excluded.status",
    1002 f"ON CONFLICT (origin, visit, date) "
    1003 "DO UPDATE "
    1004 " SET "
    1005 " type=excluded.type, "
    1006 " status=excluded.status, "
    1007 " snapshot=excluded.snapshot, "
    1008 " metadata=excluded.metadata "
    1009 " WHERE origin_visit_status.origin=excluded.origin "
    1010 " AND origin_visit_status.visit=excluded.visit "
    1011 " AND origin_visit_status.date=excluded.date "
    1012 " AND origin_visit_status.status='created' "
    1013 " AND origin_visit_status.status!=excluded.status",
  • Please register or sign in to reply
  • Could you rename the commit to "origin_visit_status_add: Update instead of ignore on conflict"? The current one is overly generic.

    Could you also add a non-replay test for this behavior? This way it will check that Cassandra also behaves this way (it should already)

  • 992 997 f"WITH origin_id as (select id from origin where url=%s) "
    993 998 f"INSERT INTO origin_visit_status "
    994 999 f"(origin, {', '.join(cols)}, metadata) "
    995 1000 f"VALUES ((select id from origin_id), "
    996 1001 f"{', '.join(['%s']*len(cols))}, %s) "
    997 f"ON CONFLICT (origin, visit, date) do nothing",
    1002 f"ON CONFLICT (origin, visit, date) "
    1003 "do update "
    1004 " set "
    1005 " type=excluded.type, "
    1006 " status=excluded.status, "
    1007 " snapshot=excluded.snapshot, "
    1008 " metadata=excluded.metadata "
    1009 " where origin_visit_status.origin=excluded.origin "
    1010 " and origin_visit_status.visit=excluded.visit "
    1011 " and origin_visit_status.date=excluded.date "
    • Even though I proposed it, I'm not too fond of the idea of adding back an UPDATE statement in the PostgreSQL, rather than rely on the idempotence of the contents of the table, especially as a workaround for a bug, and even more so on a table that we've created specifically to stop having to do UPDATEs.

      If anything I'd prefer moving towards never creating origin_visit_status implicitly when creating an origin_visit, to make sure this never occurs again.

      I think we should consider the option of pruning the table of entries with origin_visit_status.date = origin_visit.date and status=created and relaunching the replayer?

    • Author Maintainer

      Even though I proposed it, I'm not too fond of the idea of adding back an UPDATE statement in the PostgreSQL, rather than rely on the idempotence of the contents of the table, especially as a workaround for a bug, and even more so on a table that we've created specifically to stop having to do UPDATEs.

      Good remark, makes sense indeed...

      If anything I'd prefer moving towards never creating origin_visit_status implicitly when creating an origin_visit, to make sure this never occurs again.

      the question is: when/why do we need this implicitly created entry? what problem does it actually solve (apart from some idea of consistency maybe?)

      I think we should consider the option of pruning the table of entries with origin_visit_status.date = origin_visit.date and status=created and relaunching the replayer?

      Let me play with this idea, but as we discussed previously, we probably can ditch all the 'created' OVS entries (without the filtering on dates). Don't see why it would be an issue to do so.

    • If anything I'd prefer moving towards never creating origin_visit_status implicitly when creating an origin_visit, to make sure this never occurs again.

      the question is: when/why do we need this implicitly created entry? what problem does it actually solve (apart from some idea of consistency maybe?)

      It records that an origin visit was attempted, even if it failed so hard that it couldn't record anything else. So basically it guarantees that at least one entry in origin_visit_status exists for each origin_visit.

      I think we should consider the option of pruning the table of entries with origin_visit_status.date = origin_visit.date and status=created and relaunching the replayer?

      Let me play with this idea, but as we discussed previously, we probably can ditch all the 'created' OVS entries (without the filtering on dates). Don't see why it would be an issue to do so.

      Yeah, that sounds fine to me.

      Edited by vlorentz
    • Author Maintainer

      It records that an origin visit was attempted, even if it failed so hard that it couldn't record anything else. So basically it guarantees that at least one entry in origin_visit_status exists for each origin_visit.

      I get that, but do we really need a "created" row in the OVS table to carry that information? Why cannot this state be carried by an OV row with no corresponding OVS entry?

      I see that currently, at least Db.origin_visit_get() depends on this (but it can be refactored), maybe there are other places depending on this behavior?

    • Author Maintainer

      Or even one step further: do we really need the OV table at all? Here also: what problem does it actually solve?

    • Please register or sign in to reply
  • mentioned in issue swh/meta#4370

  • Please register or sign in to reply
    Loading