leverage Shodan scans to find and ingest the "penumbra" of FOSS
A recent paper about //"the penumbra of open source"// uses an interesting approach to find Git repositories outside of the popular code hosting platforms. They leverage Shodan port scans and HTML fingerprints of code hosting web interfaces (e.g., <meta content="GitLab" property="og:site name">
) to identify self-hosted public Git repositories in the wild.
Their results are quite promising and they also show that the found repositories are valuable: they include several research repositories (but not only) that are found on either GitHub or Software Heritage.
We should explore the feasibility or replicating the same approach in production to increase our "long tail" of archived projects.
Migrated from T3475 (view on Phabricator)
Edited by Phabricator Migration user