Skip to content
Snippets Groups Projects
Antoine Lambert's avatar
Antoine Lambert authored
pyld is not correctly loading json-ld content from http*://schema.org
(see https://github.com/digitalbazaar/pyld/issues/154) and raises an
exception when attempting to compact a codemeta document having
schema.org in its @context list.

As a workaround, remove schema.org from the @context list of a codemeta
document before compacting it.
a8618d05
History

Software Heritage - Indexer

Tools to compute multiple indexes on SWH's raw contents:

  • content:
    • mimetype
    • fossology-license
    • metadata
  • origin:
    • metadata (intrinsic, using the content indexer; and extrinsic)

An indexer is in charge of:

  • looking up objects
  • extracting information from those objects
  • store those information in the swh-indexer db

There are multiple indexers working on different object types:

  • content indexer: works with content sha1 hashes
  • revision indexer: works with revision sha1 hashes
  • origin indexer: works with origin identifiers

Indexation procedure:

  • receive batch of ids
  • retrieve the associated data depending on object type
  • compute for that object some index
  • store the result to swh's storage

Current content indexers:

  • mimetype (queue swh_indexer_content_mimetype): detect the encoding and mimetype
  • fossology-license (queue swh_indexer_fossology_license): compute the license
  • metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta vocabulary)

Current origin indexers:

  • metadata: translate file from an ecosystem-specific formats to JSON-LD (using schema.org/CodeMeta and ForgeFed vocabularies)