document DB encoding requirements
In a fresh created SWH DB, with SQL_ASCII encoding and C ctype/collate, Git loading failed for me at the first revision ingestion like this:
2018-01-06 19:19:35,719 9439 Sending 100000 revisions
Exception in thread Thread-2417:
Traceback (most recent call last):
File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/usr/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-storage/swh/storage/db.py", line 185, in writer
tblname, ', '.join(columns)), f)
psycopg2.DataError: unsupported Unicode escape sequence
DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
CONTEXT: JSON data, line 1: {"extra_headers": [["mergetag",...
COPY tmp_revision, line 540, column metadata: "{"extra_headers": [["mergetag", "object 7333b5aca412d6ad02667b5a513485838a91b136\ntype commit\ntag p..."
2018-01-06 19:19:40,757 9439 Loading failure, updating to `partial` status
Traceback (most recent call last):
File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 896, in load
self.store_data()
File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 1001, in store_data
self.send_batch_revisions(self.get_revisions())
File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 681, in send_batch_revisions
send_in_packets(revisions, self.send_revisions, packet_size)
File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 42, in send_in_packets
sender(formatted_objects)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-loader-core/swh/loader/core/loader.py", line 450, in send_revisions
self.storage.revision_add(revision_list)
File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-storage/swh/storage/storage.py", line 550, in revision_add
db.revision_add_from_temp(cur)
File "/home/zack/dati/projects/sw-heritage/git/swh-environment/swh-storage/swh/storage/db.py", line 38, in _meth
self._cursor(cur).execute('SELECT %s()' % stored_proc)
psycopg2.InternalError: current transaction is aborted, commands ignored until end of transaction block
2018-01-06 19:19:40,766 9439 Updating origin_visit for origin 1 with status partial
2018-01-06 19:19:40,768 9439 Done updating origin_visit for origin 1 with status partial
{'status': 'failed'}
For comparison, the in-production DB has encoding UTF8 and C.UTF8 ctype/collate.
Do we actually require an UTF8 encoded-DB or, at least, a non-ASCII one?
If so, I'd like to updated sql/bin/db-init accordingly and document this requirement.
Migrated from T918 (view on Phabricator)