Skip to content

package.loader: Handle tarball download erroneously marked as gzipped

It exists cases where a tarball to dowload is marked as gzipped in the Content-Encoding HTTP response header while in fact it is not.

So handle ContentDecodingError exception that can be raised by the dowload method: try to download tarball raw bytes again without attempting to uncompress the input stream.

Real word example encountered:

swh-loader_1                    | [2021-06-10 09:18:08,876: DEBUG/ForkPoolWorker-1] package_info: ArchivePackageInfo(url='http://www.columbia.edu/kermit/ftp/archives/cpm80.tar.gz', filename='cpm80.tar.gz', directory_extrinsic_metadata=[], raw_info={'url': 'http://www.columbia.edu/kermit/ftp/archives/cpm80.tar.gz', 'time': '2011-08-13T23:05:09', 'length': 1894400, 'version': 'cpm80'}, length=1894400, time='2011-08-13T23:05:09', version='cpm80')
swh-loader_1                    | [2021-06-10 09:18:09,039: DEBUG/ForkPoolWorker-1] filename: cpm80.tar.gz
swh-loader_1                    | [2021-06-10 09:18:09,039: DEBUG/ForkPoolWorker-1] filepath: /tmp/tmpqydd_7xw/cpm80.tar.gz
swh-loader_1                    | [2021-06-10 09:18:09,044: ERROR/ForkPoolWorker-1] Failed loading branch releases/cpm80 for https://www.kermitproject.org/archive.html
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 401, in _decode
swh-loader_1                    |     data = self._decoder.decompress(data)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 88, in decompress
swh-loader_1                    |     ret += self._obj.decompress(data)
swh-loader_1                    | zlib.error: Error -3 while decompressing data: incorrect header check
swh-loader_1                    | 
swh-loader_1                    | During handling of the above exception, another exception occurred:
swh-loader_1                    | 
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/requests/models.py", line 753, in generate
swh-loader_1                    |     for chunk in self.raw.stream(chunk_size, decode_content=True):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 576, in stream
swh-loader_1                    |     data = self.read(amt=amt, decode_content=decode_content)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 548, in read
swh-loader_1                    |     data = self._decode(data, decode_content, flush_decoder)
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/urllib3/response.py", line 407, in _decode
swh-loader_1                    |     e,
swh-loader_1                    | urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))
swh-loader_1                    | 
swh-loader_1                    | During handling of the above exception, another exception occurred:
swh-loader_1                    | 
swh-loader_1                    | Traceback (most recent call last):
swh-loader_1                    |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 576, in load
swh-loader_1                    |     res = self._load_revision(p_info, origin)
swh-loader_1                    |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 713, in _load_revision
swh-loader_1                    |     dl_artifacts = self.download_package(p_info, tmpdir)
swh-loader_1                    |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 364, in download_package
swh-loader_1                    |     return [download(p_info.url, dest=tmpdir, filename=p_info.filename)]
swh-loader_1                    |   File "/src/swh-loader-core/swh/loader/package/utils.py", line 93, in download
swh-loader_1                    |     for chunk in response.iter_content(chunk_size=HASH_BLOCK_SIZE):
swh-loader_1                    |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/requests/models.py", line 758, in generate
swh-loader_1                    |     raise ContentDecodingError(e)
swh-loader_1                    | requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

Migrated from D5852 (view on Phabricator)

Merge request reports