Up-to-date objstorage mirror on S3
The S3 bucket containing objects is very outdated. We need to keep fill the gap and keep it up to date. This will be done with the content-replayer, that reads new objects from Kafka and then copies them from the object storages on Banco and Uffizi.
To speed up the replay, it will use a 60GB file containing hashes of files that are already on S3. (It's a sorted list of hashes, so the replayer will do random short reads).
The content replayer is single-threaded and has a high latency, so we should run lots of instances of it, at least to fill the initial gap (I tried up to 100 on my desktop, the speedup is linear).
Example systemd unit to run it:
[Unit]
Description=Content replayer Rocq to S3 (service %i)
After=network.target
[Service]
Type=simple
ExecStart=/bin/bash -c 'sleep $(( RANDOM % 60 )); /home/dev/.local/bin/swh --log-level=INFO journal --config-file ~/replay_content_rocq_to_s3.yml content-replay --exclude-sha1-file /srv/softwareheritage/cassandra-test-0/scratch/sorted_inventory.bin'
Restart=on-failure
SyslogIdentifier=content-replayer-%i
Nice=10
[Install]
WantedBy=default.target
(The random sleep at the beginning is to workaround a crash that happens if too many kafka clients start at the same time.)
Example config file:
objstorage_src:
cls: multiplexer
args:
objstorages:
- cls: filtered
args:
storage_conf:
cls: remote
args:
url: http://banco.internal.softwareheritage.org:5003/
filters_conf:
- type: readonly
- cls: filtered
args:
storage_conf:
cls: remote
args:
url: http://uffizi.internal.softwareheritage.org:5003/
filters_conf:
- type: readonly
objstorage_dst:
cls: s3
args:
container_name: NAME_OF_THE_S3_BUCKET
key: KEY_OF_THE_S3_USER
secret: SECRET_OF_THE_S3_USER
journal:
brokers:
- esnode1.internal.softwareheritage.org
- esnode2.internal.softwareheritage.org
- esnode3.internal.softwareheritage.org
group_id: vlorentz-test-replay-rocq-to-s3
max_poll_records: 100
Migrated from T1954 (view on Phabricator)