Consider switching timestamp offset storage to strings/byte arrays
Our current TimestampWithTimezone data type, which has three fields:
- a timestamp
- a timezone offset in minutes
- a boolean to support "negative utc"
-0000
timezone offsets
is a recurrent cause of grief:
- the latest example of such grief is the discussion around swh/devel/swh-model!115 (closed).
- Some (legacy) timezone offsets don't match full minutes and can't be stored
- Some buggy data overflows the capacity of the current (smallint) field and is rejected artificially
- Some objects we're importing don't have timezone information at all and force us to add some bogus data
- Analysis of timezone-related data is more of a curiosity and wouldn't be much hampered by relaxing constraints on the field.
I propose that we turn the timezone offset and negative utc boolean fields into a unified, nullable, free-form bytestring.
The recommended format for the bytestring would be ascii [+-]HHMM
, where HH and MM are 0-padded integers for the hours and minutes of the timezone offset. Other values would be supported (and their interpretation left to end users, allowing for SWHID-preserving imports of data from VCS such as git with lax validation).
This is fully backwards-compatible with the current SWHID computation, which turns the combination of boolean/int into a string with the given format for identifier computation. The computation of SWHIDs would be modified so that null values of the field just trim the space after the timestamp at the end of the "authorship" line. Objects generated from a single timestamp with no timezone data would be stored as such.
Migrated from T2449 (view on Phabricator)