Skip to content

Programming Language Identification

Description

The goal here is to identify programming languages used in each project and file...

We can use:

The possible good news is that there’s AboutCode that knows how to identify licenses and apparently can also recognize programming languages... https://github.com/aboutcode-org/scancode-toolkit We could sync up with them (especially https://github.com/pombredanne) and see if it scales (because that's the challenge too!).

I wonder if we could go beyond just detecting "Python" or "Go" and actually drill down to detect the libraries used (per file). But that seems to be more of a "research" topic.

Another thing: be careful with "low resource" languages (less known/trained)... with an intermediate "use case": detecting if there are UML/SysML models (apparently CEA wants to do that...). I think there’s value in identifying "legacy" languages (a PhD student working on software modernization noticed that LLMs aren't trained on a lot of languages...) or says domain-specific languages

Side-note: heuristics like file extension and/or aggregating information incl. in README.md is worth looking at

Actors

DiverSE (Inria)

Specifications

Specifications and documentation related to Programming Languages Identification are to centralized in this directory:

https://gitlab.softwareheritage.org/teams/codecommons/cc-public-resources/-/tree/main/specifications/Programming%20Language%20Identification?ref_type=heads

Edited by Benoit Chauvet