Programming Language Identification
Description
The goal here is to identify programming languages used in each project and file...
We can use:
- Github tools => TODO I haven't checked which tools GH uses exactly...
- Starcoder used https://github.com/yoeo/guesslang
The possible good news is that there’s AboutCode that knows how to identify licenses and apparently can also recognize programming languages... https://github.com/aboutcode-org/scancode-toolkit We could sync up with them (especially https://github.com/pombredanne) and see if it scales (because that's the challenge too!).
I wonder if we could go beyond just detecting "Python" or "Go" and actually drill down to detect the libraries used (per file). But that seems to be more of a "research" topic.
Another thing: be careful with "low resource" languages (less known/trained)... with an intermediate "use case": detecting if there are UML/SysML models (apparently CEA wants to do that...). I think there’s value in identifying "legacy" languages (a PhD student working on software modernization noticed that LLMs aren't trained on a lot of languages...) or says domain-specific languages
Side-note: heuristics like file extension and/or aggregating information incl. in README.md is worth looking at
Actors
DiverSE (Inria)
Specifications
Specifications and documentation related to Programming Languages Identification are to centralized in this directory: