java: Writing node properties with a non-UTC timezone shifts timestamps backwards
Due to our ORC exports using the timestamp
instead of the timestamp with timezone
, reader and writer need to agree out of bound on the timezone used in files they exchange.
However, we don't do this:
-
swh-dataset
uses pyorc, which uses the C++ ORC library, which assumes users (us) always write in GMT -
swh-graph
uses the Java ORC library, which assumes the system timezone (or$TZ
if set)
So when reading with a non-UTC timezone, the Java ORC library interprets timestamps in the dataset as being in the local timezone, and converts them to UNIX timestamps (number of seconds since epoch); then we use these converted timestamps and write them to .property.author_timestamp.bin
and .property.committer_timestamp.bin
.
This affects both the 24-node example dataset, and the 2022-12-07 compressed graphs (both the complete graph and the "history-hosting" subgraph), which were compressed with Europe/Paris
as a timezone.
This means that users of these graphs must apply a reverse conversion on these timestamps, by interpreting them as Europe/Paris
dates and converting them back to UTC. In particular: add one hour in winter time, and two hours in summer time. This is unfortunately ambiguous for the fall DST change.
Another option is to regenerate property files, by downloading .orc
files for revisions and releases and using this command:
TZ=UTC java -Xmx1T -XX:PretenureSizeThreshold=512M -XX:MaxNewSize=4G -XX:+UseLargePages -XX:+UseTransparentHugePages -XX:+UseNUMA -XX:+UseTLAB -XX:+ResizeTLAB -cp java/target/swh-graph-*.jar 2022-12-07/orc/ 2022-12-07/compressed/
(this will error due to ORC files being missing, but timestamps are written first so this is good enough)