Skip to content

Simplify indexer design: move away from the pipeline approach

That does not scale well in regards to scheduling. We cannot easily schedule the indexer (actually it is with a fork of the main scheduler but it's not a complete thing, the input is still done from a db extract).

Moving towards a range approach, we will be able to schedule a finite range (for content at least). Adding new indexer will just be a matter of adding yet another task type and the same amount of finite ranges.

That means though:

  • change the indexer's input from arbitrary list of ids to a range of ids (swh/devel/swh-indexer#991 (closed))
  • removing orchestrator approach
  • moving some logic within indexer (for example, the language, ctags, license indexers will need to filter themselves for textual content).

Migrated from T1310 (view on Phabricator)

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information