Indexer - Retrieval error when contents is too big
When the contents size is high, supposedly more than the 10010241024 bytes limit (limit imposed on our loaders), the objstorage retrieval fails:
Oct 10 12:49:26 worker01.euwest.azure python3[15204]: [2017-10-10 12:49:26,600: INFO/Worker-1] sha1: b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`'
Oct 10 12:51:06 worker01.euwest.azure python3[15204]: [2017-10-10 12:51:06,094: ERROR/Worker-1] Problem when reading contents metadata.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/swh/indexer/indexer.py", line 216, in run
raw_content = self.objstorage.get(sha1)
File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/multiplexer_objstorage.py", line 134, in get
return storage.get(obj_id)
File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/filter/filter.py", line 69, in get
return self.storage.get(obj_id, *args, **kwargs)
File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/multiplexer_objstorage.py", line 134, in get
return storage.get(obj_id)
File "/usr/lib/python3/dist-packages/swh/objstorage/multiplexer/filter/id_filter.py", line 56, in get
return self.storage.get(*args, obj_id=obj_id, **kwargs)
File "/usr/lib/python3/dist-packages/swh/objstorage/cloud/objstorage_azure.py", line 105, in get
return gzip.decompress(blob.content)
File "/usr/lib/python3.4/gzip.py", line 632, in decompress
return f.read()
File "/usr/lib/python3.4/gzip.py", line 360, in read
while self._read(readsize):
File "/usr/lib/python3.4/gzip.py", line 454, in _read
self._add_read_data( uncompress )
File "/usr/lib/python3.4/gzip.py", line 472, in _add_read_data
self.extrabuf = self.extrabuf[offset:] + data
MemoryError
Oct 10 12:51:06 worker01.euwest.azure python3[15204]: [2017-10-10 12:51:06,099: WARNING/Worker-1] Rescheduling batch
Here, the hash b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`' is the one failing:
Converting its hash to be readable:
$ python3
>>> h = b'\r5~\xe6\xb9\r\x86\nz\xb1\xa7S\x04\x03\xb3+\xbc\x97\x7f`'
>>> from swh.model import hashutil
>>> hashutil.hash_to_hex(h)
'0d357ee6b90d860a7ab1a7530403b32bbc977f60'
Checking its length in the storage, we see that's indeed a quite huge file:
curl https://archive.softwareheritage.org/api/1/content/0d357ee6b90d860a7ab1a7530403b32bbc977f60/?fields=length
{"length":1707673600}
Migrated from T803 (view on Phabricator)