Skip to content

tarball: Prefer to rely on mime type first to guess archive format

Instead of guessing archive format by first checking its extension then its mime type, prefer to do the opposite as it is more reliable, especially if an archive file has an incorrect extension.

Such edge case is experienced by the CRAN loader, see below:

docker-swh-loader-1  | [2023-09-06 10:24:08,480: DEBUG/ForkPoolWorker-14] package_info: CRANPackageInfo(url='https://cran.r-project.org/src/contrib/Archive/bats/bats_0.1-1.tar.gz', filename='bats_0.1-1.tar.gz', version='0.1-1', directory_extrinsic_metadata=[], checksums={'length': 10240}, raw_info={'url': 'https://cran.r-project.org/src/contrib/Archive/bats/bats_0.1-1.tar.gz', 'package': 'bats', 'version': '0.1-1', 'checksums': {'length': 10240}}, name='bats')
docker-swh-loader-1  | [2023-09-06 10:24:08,640: DEBUG/ForkPoolWorker-14] filename: bats_0.1-1.tar.gz
docker-swh-loader-1  | [2023-09-06 10:24:08,640: DEBUG/ForkPoolWorker-14] filepath: /tmp/tmp99d_52t0/bats_0.1-1.tar.gz
docker-swh-loader-1  | [2023-09-06 10:24:08,641: DEBUG/ForkPoolWorker-14] extrinsic_metadata
docker-swh-loader-1  | [2023-09-06 10:24:08,651: ERROR/ForkPoolWorker-14] Failed to load branch releases/0.1-1 for https://cran.r-project.org/package=bats
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/shutil.py", line 932, in _unpack_tarfile
docker-swh-loader-1  |     tarobj = tarfile.open(filename)
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/tarfile.py", line 1580, in open
docker-swh-loader-1  |     raise ReadError("file could not be opened successfully")
docker-swh-loader-1  | tarfile.ReadError: file could not be opened successfully
docker-swh-loader-1  | 
docker-swh-loader-1  | During handling of the above exception, another exception occurred:
docker-swh-loader-1  | 
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/tarball.py", line 161, in uncompress
docker-swh-loader-1  |     shutil.unpack_archive(tarpath, extract_dir=dest, format=format)
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/shutil.py", line 993, in unpack_archive
docker-swh-loader-1  |     func(filename, extract_dir, **dict(format_info[2]))
docker-swh-loader-1  |   File "/usr/local/lib/python3.7/shutil.py", line 935, in _unpack_tarfile
docker-swh-loader-1  |     "%s is not a compressed or uncompressed tar file" % filename)
docker-swh-loader-1  | shutil.ReadError: /tmp/tmp99d_52t0/bats_0.1-1.tar.gz is not a compressed or uncompressed tar file
docker-swh-loader-1  | 
docker-swh-loader-1  | During handling of the above exception, another exception occurred:
docker-swh-loader-1  | 
docker-swh-loader-1  | Traceback (most recent call last):
docker-swh-loader-1  |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 691, in load
docker-swh-loader-1  |     res = self._load_release(p_info, origin)
docker-swh-loader-1  |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 878, in _load_release
docker-swh-loader-1  |     (uncompressed_path, directory) = self._load_directory(dl_artifacts, tmpdir)
docker-swh-loader-1  |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 829, in _load_directory
docker-swh-loader-1  |     uncompressed_path = self.uncompress(dl_artifacts, dest=tmpdir)
docker-swh-loader-1  |   File "/src/swh-loader-core/swh/loader/package/loader.py", line 452, in uncompress
docker-swh-loader-1  |     uncompress(a_path, dest=uncompressed_path)
docker-swh-loader-1  |   File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/tarball.py", line 163, in uncompress
docker-swh-loader-1  |     raise ValueError(f"Problem during unpacking {tarpath}. Reason: {e}")
docker-swh-loader-1  | ValueError: Problem during unpacking /tmp/tmp99d_52t0/bats_0.1-1.tar.gz. Reason: /tmp/tmp99d_52t0/bats_0.1-1.tar.gz is not a compressed or uncompressed tar file
(swh) anlambert@carnavalet:/tmp$ curl -LO https://cran.r-project.org/src/contrib/Archive/bats/bats_0.1-1.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10240  100 10240    0     0   103k      0 --:--:-- --:--:-- --:--:--  104k
(swh) anlambert@carnavalet:/tmp$ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shutil
>>> shutil.unpack_archive("bats_0.1-1.tar.gz", extract_dir="/tmp")
Traceback (most recent call last):
  File "/usr/lib/python3.11/shutil.py", line 1239, in _unpack_tarfile
    tarobj = tarfile.open(filename)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/tarfile.py", line 1639, in open
    raise ReadError(f"file could not be opened successfully:\n{error_msgs_summary}")
tarfile.ReadError: file could not be opened successfully:
- method gz: ReadError('not a gzip file')
- method bz2: ReadError('not a bzip2 file')
- method xz: ReadError('not an lzma file')
- method tar: ReadError('invalid header')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.11/shutil.py", line 1316, in unpack_archive
    func(filename, extract_dir, **kwargs)
  File "/usr/lib/python3.11/shutil.py", line 1241, in _unpack_tarfile
    raise ReadError(
shutil.ReadError: bats_0.1-1.tar.gz is not a compressed or uncompressed tar file
>>> 
(swh) anlambert@carnavalet:/tmp$ file --mime bats_0.1-1.tar.gz 
bats_0.1-1.tar.gz: application/x-compress; charset=binary
(swh) anlambert@carnavalet:/tmp$ tar -xvzf bats_0.1-1.tar.gz 
bats/
bats/R/
bats/R/ar.R
bats/R/fixes.R
bats/R/spectrum.R
bats/R/acf.R
bats/DESCRIPTION
bats/COMPAT
bats/README
bats/INDEX
bats/TITLE

Merge request reports