Refactor the origin visit data model (aka get rid of the OriginVisit model object)
This model object represent the notion of "one visit of an origin", which model looks like:
+--------+ +--------------+ +-------------------+
| Origin | | OriginVisit | | OriginVisitStatus |
+--------+ +--------------+ +-------------------+
| - id <-|-----|-- origin | | - origin (same) |
| - url | | - visit <----|-----|-- visit |
| | | - type | | - type (same) |
| | | - date | | - date |
| | | | | - status |
| | | | | - metadata |
| | | | | - snapshot |
+--------+ +--------------+ +-------------------+
In this model, there can be several OV
for a given O
(cardinality is Origin <-(*-1)- OriginVisit
), then for a visit, there can be several OVS
(so similar cardinality).
The only attribute of an OriginVisit
that is not duplicated in OriginVisitStatus
objects related to this visit is the date
. However, the current implementation of the code pack together the creation of the first OriginVisitStatus
object with the OriginVisit
it is related to. So in practice, the OriginVisit
object does not carry any useful information.
In this model, the OV
is only holding a local counter of visits for the origin, which main purpose probably is to ease the pagination in the origin visit API.
It seems clear that this OriginVisit
object is not very useful, and moreover, can make the replayer process more difficult to do properly (mirror).
Possible evolution (from olasd)
A possible model would be to get rid of the OriginVisit
object, use the origin url directly (instead of the origin pkey id) and replace the visit id by a UUID so that there is no need for keeping a (reliable) counter any more.
A simple solution like:
OriginVisitStatus:
origin: str # origin url
status_date: datetime
type: enum
visit_id: UUID
status: enum
snapshot_id: sha1
where the primary key is (origin, status_date, type, visit_id)
.
Migrated from T4370 (view on Phabricator)