Some contents on S3 are full of null bytes
While working on a dataset of README files, I find that many (all?) objects from somewhere around 0004fd
to somewhere around 00050f
are missing.
Here are all missing README files from 000000
to 00f000
:
00050e57f152c98df806abe5f4c9df16c0384357 has null bytes instead of magic value
00050e94a3aab5b6ed2ac2f61f32bdb8e0e49c34 has null bytes instead of magic value
00050f766a8878e80debad446ef9332db340b24c has null bytes instead of magic value
0004fd7b735d8f8bacc02f11e82561493e87180e has null bytes instead of magic value
0004fdab20c8c65b4b50d51e97e44d57c9873e04 has null bytes instead of magic value
0004fdbcbafdd89e08fb08a6a00f32af99cc8a00 has null bytes instead of magic value
0004fdc1cb527ec449a006f281b6938203df1caa has null bytes instead of magic value
0004fe31617a71fc8c47cab6d63ae82d2fe24ae3 has null bytes instead of magic value
0004fe7a30089658b273a8d3b41dc0c633dbc2f3 has null bytes instead of magic value
0004feec3b1b8081f6b08514cb21587139003ea6 has null bytes instead of magic value
0004ff0ab8bbf403d89baebfd0d828faf636a94d has null bytes instead of magic value
0004ff4e2f9503b2a1318bc604d88de1ded01daf has null bytes instead of magic value
0004ff858a116ee7be07a9b9a673a3d9d406839a has null bytes instead of magic value
0004ffd59cfda07fd5aa89f22fbeb44db33aac0e has null bytes instead of magic value
0004ffdc7591f37f89430ca3d18288e5da46d703 has null bytes instead of magic value
00050d22ca288a955bb2838cd0c6691436595222 has null bytes instead of magic value
00050dac0925c07ac6383bdb2a864fe9d13a65d3 has null bytes instead of magic value
00050daf0cfd07e7c3df89e1b5abb59d93aa58c0 has null bytes instead of magic value
00050daff45590ad389173396ca4f5f950f3e38d has null bytes instead of magic value
00050e57f152c98df806abe5f4c9df16c0384357 has null bytes instead of magic value
00050d22ca288a955bb2838cd0c6691436595222 has null bytes instead of magic value
0004fd7b735d8f8bacc02f11e82561493e87180e has null bytes instead of magic value
00050dac0925c07ac6383bdb2a864fe9d13a65d3 has null bytes instead of magic value
00050e94a3aab5b6ed2ac2f61f32bdb8e0e49c34 has null bytes instead of magic value
0004fdab20c8c65b4b50d51e97e44d57c9873e04 has null bytes instead of magic value
00050daf0cfd07e7c3df89e1b5abb59d93aa58c0 has null bytes instead of magic value
00050f766a8878e80debad446ef9332db340b24c has null bytes instead of magic value
0004fdbcbafdd89e08fb08a6a00f32af99cc8a00 has null bytes instead of magic value
0004fdc1cb527ec449a006f281b6938203df1caa has null bytes instead of magic value
00050daff45590ad389173396ca4f5f950f3e38d has null bytes instead of magic value
0004fe31617a71fc8c47cab6d63ae82d2fe24ae3 has null bytes instead of magic value
0004fe7a30089658b273a8d3b41dc0c633dbc2f3 has null bytes instead of magic value
0004feec3b1b8081f6b08514cb21587139003ea6 has null bytes instead of magic value
0004ff0ab8bbf403d89baebfd0d828faf636a94d has null bytes instead of magic value
0004ff4e2f9503b2a1318bc604d88de1ded01daf has null bytes instead of magic value
0004ff858a116ee7be07a9b9a673a3d9d406839a has null bytes instead of magic value
0004ffd59cfda07fd5aa89f22fbeb44db33aac0e has null bytes instead of magic value
0004ffdc7591f37f89430ca3d18288e5da46d703 has null bytes instead of magic value
For example:
$ curl https://softwareheritage.s3.amazonaws.com/content/00050e57f152c98df806abe5f4c9df16c0384357 | hexdump
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 58 100 58 0 0 94 0 --:--:-- --:--:-- --:--:-- 94
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000030 0000 0000 0000 0000 0000
000003a
we see here that it's 58 null bytes, instead of the gzipped value of https://archive.softwareheritage.org/browse/content/sha1:00050e57f152c98df806abe5f4c9df16c0384357/raw/
Interestingly, 58 bytes is also the size when compressing it with GNU gzip and the default options:
$ curl -s https://archive.softwareheritage.org/browse/content/sha1:00050e57f152c98df806abe5f4c9df16c0384357/raw/ | gzip | wc -c
58