analyze bogus mimetype values in content_mimetype table
I've extract some mimetype stats from content_mimetypes: mimetype-stats.txt
Some obviously bogus values stand out, most likely due to bugs in the relevant swh-indexer component, e.g.:
$ grep '\[' mimetype-stats.txt
[ [application/octet-stream|27805
[ [ [application/octet-stream|5068
[application/x-archive [ [ [|4496
[application/x-archive [|2843
[application/x-archive [ [|2610
[ [ [ [application/octet-stream|2602
[application/octet-stream|1122
[application/x-archive [application/x-archive|677
[application/x-archive [application/x-archive [ [|245
[application/x-archive [application/x-archive [application/x-archive [application/x-archive|176
[application/x-archive [application/x-archive [|153
[application/x-archive [application/x-archive [application/x-archive|127
[application/x-archive [application/x-archive [application/x-archive [|63
[application/x-archive|53
[application/x-archive [ [ [application/x-archive|50
[application/x-archive [application/x-archive [ [application/x-archive|33
[ [application/x-archive|25
[ [ [ [application/x-archive|25
[ [ [application/x-archive|14
[application/x-archive [ [application/x-archive|3
the pipe '|' here is the separator between mimetype value and count of contents with that mimetype. But all the '[' characters are part of the mimetype value, and looks wrong.
There might be other bogus values in the stats that I haven't noticed.
Migrated from T817 (view on Phabricator)