Improve language indexer performance
Indexer language is slow due to the the tool used underneath (pygments) and possibly the content's size.
To give some details, pygments is used to detect language since it's the tool which detects more language. Problem is, its api is working only on text and not on bytes (and we deal with bytes). So we need to detect its encoding and then decode it appropriately.
It has been already improved recenly to detect the encoding incrementally (no task referencing this) but it's not enough.
Hints:
- take only the first 10k of the raw contents (as a possible configuration option).
- take only a percentage portion of the content (also a possible configuration option).
- use the detected encoding from the mimetype indexer and pass along that optional information.
Migrated from T722 (view on Phabricator)
Edited by Phabricator Migration user