Using file extensions to exclude commits
We can already compute the most popular name of each content (and overwhelmingly often, there is only one), which easily gives us its extension, which is a proxy for the programming language (or a small set of programming languages)
As part of the provenance index, we compute the date of first occurrence of each content, but it has some incorrect dates due to buggy import scripts for example.
@rdicosmo just pointed out that we could exclude all revisions with a timestamp before the date of creation of a programming language; eg. obtained through GitHub Linguist's database (to map extensions to languages) and http://hopl.info/ (to map languages to creation date).
Or less radical approach: we could use this to flag surprising dates, and then a human can take a look and decide to exclude it or not.
Another option would be to use the histogram described in swh/meta#5089 and detect outliers in it.