Skip to content

Make origin visits immutable

Current problem

Currently, origin visits are the only mutable object type in our data model. First it's created with "ongoing" status, then optionally updated multiple times with a partial snapshot, then has its status set to either "full" or "partial" with an optional snapshot.

This causes an issue with messages pushed to Kafka, which does not guarantee the last update of a message will be the one preserved when compacting; so the journal may end up with an outdated version of a visit.

Solution

We should split a visit object from its updates in the journal: have one topic for the visits themselves (identified by (origin_url, visit_id)), and one for the successive updates (identified by (origin_url, visit_id, update_id); where update_id must be totally ordered). This way, all visit updates would be stored in the journal; these updates and their order would be preserved by the replayer, even though they will be replayed in an arbitrary order.

The current fields of origin visits would be split this way:

  • the new "visit" objects would get the type (git/tar/...), maybe (?) metadata, and maybe a new start_date field
  • the "visit_update" objects would get the other fields (date, status, metadata (?), snapshot).

There are two ways to make the replayer work with this:

  1. simply add a "version" field to origin visits, so that when the replayer passes origin_visit_upsert this version, and swh-storage would discard the upsert if it's older than what is currently in the DB
  2. make visit updates a first class citizen of the data model and swh-storage, and always keep all updates

New data model for the visits

an origin-visit represents a run of a loader. It currently carries the information:

  • origin url
  • visit id: unique for a given origin
  • type (git, hg, ...)
  • start_date: when the loader was started, shortly before it created the origin visit
  • (snapshot): snapshot of all the branches already/currently loaded
  • (metadata): associated metadata (unused)

Note: The left-member wrapping (parenthesis) conveys the optional nature of the property.

and origin-visit-status represents a snapshot of a visit's loader at a point in time (sent from time to time by the loader, like a heartbeat). It has the fields:

  • origin url
  • visit id
  • date: the timestamp of the snapshot of the loader task
  • status: Status of the visit (possible values: created, ongoing, full, partial)
  • (snapshot): snapshot of all the branches already/currently loaded
  • (metadata): associated metadata (not used, kept for future update)

The following operations are supported on origin visits:

  • creating a visit (to get a unique id from the storage)
  • getting a visit from its origin url + id
  • listing visits of an origin (with filters, order, etc.)
  • upsert a visit (to add an origin with a predetermined id, needed for the replayer)

(note that there is no (need for an) equivalent for the current "origin_visit_update()" endpoint, as origin visits are now immutable.)

and on origin visit updates:

  • adding a new status (using the origin url and visit id; there must be an override to allow the replayer to add an update for a visit that doesn't exist yet)
  • getting the last status of a visit (so one can know if it's completed and get the id of its snapshot)
  • listing all statuses of a visit (?) (there is no need for it yet)

Example

Loaders will use the storage API like this:

# start
id = storage.origin_visit_add(OriginVisit(origin=origin_url, type=..., start_date=...))
# load some stuff
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# load more stuff
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='ongoing', snapshot=..., metadata=None))
# finish loading everything
storage.origin_visit_status_add(OriginVisitStatus(origin=origin_url, visit=id, date=..., status='full', snapshot=..., metadata=None))

and readers (mostly just swh-web) would get this from the API:

{'origin': origin_url, 'visit': id, 'start_date': ...} = storage.origin_visit_get_latest(origin_url)
{'origin': origin_url, 'visit': id, 'date': ..., status: ..., snapshot: ..., metadata: ...} = storage.origin_visit_status_get_latest(origin_url, id)

Work plan


Migrated from T2310 (view on Phabricator)

Edited by Phabricator Migration user
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information