tarball: Prefer to rely on mime type first to guess archive format
1 unresolved thread
Instead of guessing archive format by first checking its extension then its mime type, prefer to do the opposite as it is more reliable, especially if an archive file has an incorrect extension.
Such edge case is experienced by the CRAN loader, see below:
docker-swh-loader-1 | [2023-09-06 10:24:08,480: DEBUG/ForkPoolWorker-14] package_info: CRANPackageInfo(url='https://cran.r-project.org/src/contrib/Archive/bats/bats_0.1-1.tar.gz', filename='bats_0.1-1.tar.gz', version='0.1-1', directory_extrinsic_metadata=[], checksums={'length': 10240}, raw_info={'url': 'https://cran.r-project.org/src/contrib/Archive/bats/bats_0.1-1.tar.gz', 'package': 'bats', 'version': '0.1-1', 'checksums': {'length': 10240}}, name='bats')
docker-swh-loader-1 | [2023-09-06 10:24:08,640: DEBUG/ForkPoolWorker-14] filename: bats_0.1-1.tar.gz
docker-swh-loader-1 | [2023-09-06 10:24:08,640: DEBUG/ForkPoolWorker-14] filepath: /tmp/tmp99d_52t0/bats_0.1-1.tar.gz
docker-swh-loader-1 | [2023-09-06 10:24:08,641: DEBUG/ForkPoolWorker-14] extrinsic_metadata
docker-swh-loader-1 | [2023-09-06 10:24:08,651: ERROR/ForkPoolWorker-14] Failed to load branch releases/0.1-1 for https://cran.r-project.org/package=bats
docker-swh-loader-1 | Traceback (most recent call last):
docker-swh-loader-1 | File "/usr/local/lib/python3.7/shutil.py", line 932, in _unpack_tarfile
docker-swh-loader-1 | tarobj = tarfile.open(filename)
docker-swh-loader-1 | File "/usr/local/lib/python3.7/tarfile.py", line 1580, in open
docker-swh-loader-1 | raise ReadError("file could not be opened successfully")
docker-swh-loader-1 | tarfile.ReadError: file could not be opened successfully
docker-swh-loader-1 |
docker-swh-loader-1 | During handling of the above exception, another exception occurred:
docker-swh-loader-1 |
docker-swh-loader-1 | Traceback (most recent call last):
docker-swh-loader-1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/tarball.py", line 161, in uncompress
docker-swh-loader-1 | shutil.unpack_archive(tarpath, extract_dir=dest, format=format)
docker-swh-loader-1 | File "/usr/local/lib/python3.7/shutil.py", line 993, in unpack_archive
docker-swh-loader-1 | func(filename, extract_dir, **dict(format_info[2]))
docker-swh-loader-1 | File "/usr/local/lib/python3.7/shutil.py", line 935, in _unpack_tarfile
docker-swh-loader-1 | "%s is not a compressed or uncompressed tar file" % filename)
docker-swh-loader-1 | shutil.ReadError: /tmp/tmp99d_52t0/bats_0.1-1.tar.gz is not a compressed or uncompressed tar file
docker-swh-loader-1 |
docker-swh-loader-1 | During handling of the above exception, another exception occurred:
docker-swh-loader-1 |
docker-swh-loader-1 | Traceback (most recent call last):
docker-swh-loader-1 | File "/src/swh-loader-core/swh/loader/package/loader.py", line 691, in load
docker-swh-loader-1 | res = self._load_release(p_info, origin)
docker-swh-loader-1 | File "/src/swh-loader-core/swh/loader/package/loader.py", line 878, in _load_release
docker-swh-loader-1 | (uncompressed_path, directory) = self._load_directory(dl_artifacts, tmpdir)
docker-swh-loader-1 | File "/src/swh-loader-core/swh/loader/package/loader.py", line 829, in _load_directory
docker-swh-loader-1 | uncompressed_path = self.uncompress(dl_artifacts, dest=tmpdir)
docker-swh-loader-1 | File "/src/swh-loader-core/swh/loader/package/loader.py", line 452, in uncompress
docker-swh-loader-1 | uncompress(a_path, dest=uncompressed_path)
docker-swh-loader-1 | File "/srv/softwareheritage/venv/lib/python3.7/site-packages/swh/core/tarball.py", line 163, in uncompress
docker-swh-loader-1 | raise ValueError(f"Problem during unpacking {tarpath}. Reason: {e}")
docker-swh-loader-1 | ValueError: Problem during unpacking /tmp/tmp99d_52t0/bats_0.1-1.tar.gz. Reason: /tmp/tmp99d_52t0/bats_0.1-1.tar.gz is not a compressed or uncompressed tar file
(swh) anlambert@carnavalet:/tmp$ curl -LO https://cran.r-project.org/src/contrib/Archive/bats/bats_0.1-1.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 10240 100 10240 0 0 103k 0 --:--:-- --:--:-- --:--:-- 104k
(swh) anlambert@carnavalet:/tmp$ python
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shutil
>>> shutil.unpack_archive("bats_0.1-1.tar.gz", extract_dir="/tmp")
Traceback (most recent call last):
File "/usr/lib/python3.11/shutil.py", line 1239, in _unpack_tarfile
tarobj = tarfile.open(filename)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/tarfile.py", line 1639, in open
raise ReadError(f"file could not be opened successfully:\n{error_msgs_summary}")
tarfile.ReadError: file could not be opened successfully:
- method gz: ReadError('not a gzip file')
- method bz2: ReadError('not a bzip2 file')
- method xz: ReadError('not an lzma file')
- method tar: ReadError('invalid header')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/shutil.py", line 1316, in unpack_archive
func(filename, extract_dir, **kwargs)
File "/usr/lib/python3.11/shutil.py", line 1241, in _unpack_tarfile
raise ReadError(
shutil.ReadError: bats_0.1-1.tar.gz is not a compressed or uncompressed tar file
>>>
(swh) anlambert@carnavalet:/tmp$ file --mime bats_0.1-1.tar.gz
bats_0.1-1.tar.gz: application/x-compress; charset=binary
(swh) anlambert@carnavalet:/tmp$ tar -xvzf bats_0.1-1.tar.gz
bats/
bats/R/
bats/R/ar.R
bats/R/fixes.R
bats/R/spectrum.R
bats/R/acf.R
bats/DESCRIPTION
bats/COMPAT
bats/README
bats/INDEX
bats/TITLE
Merge request reports
Activity
Filter activity
Jenkins job DCORE/gitlab-builds #115 succeeded .
See Console Output and Coverage Report for more details.