package.loader: Handle tarball download erroneously marked as gzipped
It exists cases where a tarball to dowload is marked as gzipped in the Content-Encoding HTTP response header while in fact it is not.
So handle ContentDecodingError
exception that can be raised by the
dowload
method: try to download tarball raw bytes again without
attempting to uncompress the input stream.
Real word example encountered:
swh-loader_1 | [2021-06-10 09:18:08,876: DEBUG/ForkPoolWorker-1] package_info: ArchivePackageInfo(url='http://www.columbia.edu/kermit/ftp/archives/cpm80.tar.gz', filename='cpm80.tar.gz', directory_extrinsic_metadata=[], raw_info={'url': 'http://www.columbia.edu/kermit/ftp/archives/cpm80.tar.gz', 'time': '2011-08-13T23:05:09', 'length': 1894400, 'version': 'cpm80'}, length=1894400, time='2011-08-13T23:05:09', version='cpm80')
swh-loader_1 | [2021-06-10 09:18:09,039: DEBUG/ForkPoolWorker-1] filename: cpm80.tar.gz
swh-loader_1 | [2021-06-10 09:18:09,039: DEBUG/ForkPoolWorker-1] filepath: /tmp/tmpqydd_7xw/cpm80.tar.gz
swh-loader_1 | [2021-06-10 09:18:09,044: ERROR/ForkPoolWorker-1] Failed loading branch releases/cpm80 for https://www.kermitproject.org/archive.html
swh-loader_1 | Traceback (most recent call last):
swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 401, in _decode
swh-loader_1 | data = self._decoder.decompress(data)
swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 88, in decompress
swh-loader_1 | ret += self._obj.decompress(data)
swh-loader_1 | zlib.error: Error -3 while decompressing data: incorrect header check
swh-loader_1 |
swh-loader_1 | During handling of the above exception, another exception occurred:
swh-loader_1 |
swh-loader_1 | Traceback (most recent call last):
swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/requests/models.py", line 753, in generate
swh-loader_1 | for chunk in self.raw.stream(chunk_size, decode_content=True):
swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 576, in stream
swh-loader_1 | data = self.read(amt=amt, decode_content=decode_content)
swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 548, in read
swh-loader_1 | data = self._decode(data, decode_content, flush_decoder)
swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 407, in _decode
swh-loader_1 | e,
swh-loader_1 | urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
swh-loader_1 |
swh-loader_1 | During handling of the above exception, another exception occurred:
swh-loader_1 |
swh-loader_1 | Traceback (most recent call last):
swh-loader_1 | File "/src/swh-loader-core/swh/loader/package/loader.py", line 576, in load
swh-loader_1 | res = self._load_revision(p_info, origin)
swh-loader_1 | File "/src/swh-loader-core/swh/loader/package/loader.py", line 713, in _load_revision
swh-loader_1 | dl_artifacts = self.download_package(p_info, tmpdir)
swh-loader_1 | File "/src/swh-loader-core/swh/loader/package/loader.py", line 364, in download_package
swh-loader_1 | return [download(p_info.url, dest=tmpdir, filename=p_info.filename)]
swh-loader_1 | File "/src/swh-loader-core/swh/loader/package/utils.py", line 93, in download
swh-loader_1 | for chunk in response.iter_content(chunk_size=HASH_BLOCK_SIZE):
swh-loader_1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/requests/models.py", line 758, in generate
swh-loader_1 | raise ContentDecodingError(e)
swh-loader_1 | requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
Migrated from D5852 (view on Phabricator)