Commits · 1ece6d46de32cbaf9b91faa855225ca58c648291 · Platform / Development / swh-indexer

Dec 05, 2023
- Add latest blackify to git-blame-ignore-revs · 1ece6d46
  David Douard authored 1 year ago
  
  1ece6d46
Dec 04, 2023
- python: Fix black formatting after bump to 23.1.0 in pre-commit · 2e9f1d3e
  David Douard authored 1 year ago
  
  And replace comment type annotations by explicit ones.
  2e9f1d3e
Dec 03, 2023
- Apply swh-py-template 0.1.6 · b3221364
  David Douard authored 1 year ago
  
  View commits for tag v2.12.0 v2.12.0
  
  b3221364
Nov 29, 2023
- Migrate to copier-based swh-py-template and normalize the README file · 5355d8f8
  David Douard authored 1 year ago
  
  5355d8f8
Nov 22, 2023

tasks: Fix registration in scheduler database · 9084a5fb

Docstring is mandatory for a celery task function as its value
is inserted in the required description column of the task_type
table in the scheduler database.

Celery task function name is also used as task name (with underscores
replaced by dashes) so ensure task function names match the task types
registered in production scheduler database.

9084a5fb

Nov 21, 2023
- Add module to convert codemeta documents to bibtex entries · ef086410
  vlorentz authored 1 year ago
  
  View commits for tag v2.11.0 v2.11.0
  
  ef086410
Nov 08, 2023
- setup: declare indexer's tasks using the scheduler's plugin system · 8f3747b3
  David Douard authored 1 year ago
  
  So that we can get rid of indexer task types being created by swh-schedulers' sql init scripts.
  8f3747b3
Oct 26, 2023
- Fix crash when detected metadata files are dangling directory entries · 10f015ea
  vlorentz authored 1 year ago
  
  10f015ea
Oct 18, 2023

metadata_dictionary/base: Compare metadata filenames in lowercase · c60a72d5

Antoine Lambert authored 1 year ago

SingleFileIntrinsicMapping.detect_metadata_files was comparing lowercase
versions of filenames with SingleFileIntrinsicMapping.filename variable
value to detect metadata files.

But as SingleFileIntrinsicMapping.filename holds the canonical name of a
metadata file it can contain uppercase characters.

So ensure to compare lowercase versions of both filenames to avoid metadata
files being undetected.

This fixes indexing of Python intrinsic metadata.

c60a72d5

Jul 07, 2023
- Fix mypy/click: add swh.core[testing] in requirements-test.txt · 461b71a4
  David Douard authored 1 year ago
  
  It now needs types-click which is indeed a dependency of swh.core[testing].
  461b71a4
Jun 20, 2023
- Add diagram of metadata flow · f2f9250b
  vlorentz authored 1 year ago and vlorentz committed 1 year ago
  
  View commits for tag v2.10.0 v2.10.0
  
  f2f9250b
Jun 07, 2023
- docs: Fix incorrect namespace in swhpkg example · cab2c9b6
  vlorentz authored 1 year ago
  
  cab2c9b6
May 15, 2023
- pubspec: Fix crash on invalid yaml tag · 42c4fc8e
  vlorentz authored 1 year ago
  
  42c4fc8e
Apr 25, 2023
- Drop partition indexer related code which is no longer used · 29f48cc1
  Antoine R. Dumont authored 1 year ago
  
  Refs. #4733
  29f48cc1
Apr 18, 2023
- docs: Update list of supported metadata formats · 23444eb6
  vlorentz authored 1 year ago
  
  23444eb6
Apr 17, 2023
- docs: Add workflow for extrinsic metadata + mention storage on the path from loader to journal · 0012765f
  vlorentz authored 1 year ago
  
  View commit 0012765f 2 tags
  
  0012765f
Mar 21, 2023
- docs: Define new vocabulary for software dependencies · 0bb9928c
  Kumar Shivendu authored 2 years ago and vlorentz committed 2 years ago
  
  0bb9928c
Mar 13, 2023
- dart: Fix crash on yaml parser error · 418720e6
  vlorentz authored 2 years ago and vlorentz committed 2 years ago
  
  418720e6
- test_dart: Simplify code · c33bb27e
  vlorentz authored 2 years ago and vlorentz committed 2 years ago
  
  c33bb27e
Feb 21, 2023
- metadata_dictionary: Deduplicate extension-based file detection · 65602df6
  Kumar Shivendu authored 2 years ago and vlorentz committed 2 years ago
  
  65602df6
Feb 17, 2023
- mypy: Bump to 1.0 · 0cc2a326
  Antoine Lambert authored 2 years ago
  
  Related to swh/meta#4960
  0cc2a326
Feb 16, 2023
- Update and clean tox configuration for version 4 · b92df151
  Jérémy Bobbio (Lunar) authored 2 years ago
  
  Related to swh/meta#4959
  b92df151
Feb 13, 2023
- Disable reports of OperationalError to Sentry · e89c4996
  vlorentz authored 2 years ago and vlorentz committed 2 years ago
  
  (except QueryCancelled) also return 503/TransientRemoteException instead of 500/RemoteException, which should have no effect on clients.
  View commits for tag v2.9.3 v2.9.3
  
  e89c4996
Feb 02, 2023

pre-commit: Bump isort from 5.10.1 to 5.11.5 · 9f304353

Antoine Lambert authored 2 years ago

This fixes python 3.7 support due to poetry, a dependency of isort, that
removed support for that Python version in a recent release.

9f304353

Dec 19, 2022

docs: Include module indices only when building standalone package doc · 56cbcc96

Antoine Lambert authored 2 years ago

In order to remove warnings about /apidoc/*.rst files being included
multiple times in toc when building full swh documentation, prefer to
include module indices only when building standalone package documentation.

Also include them the proper sphinx way.

Related to T4496

56cbcc96

Dec 07, 2022

Remove tool ids from Kafka messages · e8549400

vlorentz authored 2 years ago

1. they are internal to the DB so they do not belong in Kafka
2. on unrelated errors, they cause swh.journal to crash because it does
not know how to handle integers in the output of unique_key()

e8549400

Nov 30, 2022
- Fix crash when indexing two REMD objects from the same deposit · f74b47bc
  vlorentz authored 2 years ago
  
  The deduplication code assumed `remd.target` matches the id of results, but this is no longer true, as we started using REMD objects whose `origin` context was used as result id, when `remd.target` is a directory (221d48e2).
  View commits for tag v2.9.1 v2.9.1
  
  f74b47bc
- metadata_dictionary: Fix 'Invalid IPv6 URL' crash · b2d8afff
  vlorentz authored 2 years ago
  
  b2d8afff
Nov 29, 2022
- README: Update list of indexers · f44e14b1
  vlorentz authored 2 years ago
  
  f44e14b1
- Fix ordering and idempotence in the 136 -> 137 upgrade script · d4d4c59c
  Nicolas Dandrimont authored 2 years ago
  
  d4d4c59c
- docs: Remove remaining references to ctags and content_language · 3faeac6c
  vlorentz authored 2 years ago
  
  3faeac6c
Nov 28, 2022

Drop content_language and content_ctags tables and related SQL functions · a5ee54ae
vlorentz authored 2 years ago

View commits for tag v2.9.0 v2.9.0

a5ee54ae

storage: Insert from temporary tables in consistent order · f7833b7e

vlorentz authored 2 years ago

This avoids having a transaction inserting row A then B, while another
inserts row B then A; which (probably) leads to deadlocks like this:

```
DeadlockDetected: deadlock detected
DETAIL:  Process 1842336 waits for ShareLock on transaction 1051957280; blocked by process 64261.
Process 64261 waits for ShareLock on transaction 1051957281; blocked by process 1842336.
HINT:  See server log for query details.
CONTEXT:  while inserting index tuple (1972253,5) in relation "origin_extrinsic_metadata"
SQL statement "insert into origin_extrinsic_metadata (id, metadata, indexer_configuration_id, from_remd_id, metadata_tsvector, mappings)
```

https://sentry.softwareheritage.org/share/issue/52b06caae89f4235a758887fd6817656/

This was already mitigating by sorting before inserting in temporary
tables, then expecting postgresql to read from temporary tables in the
same order rows where inserted. This is often true, but not guaranteed.

No test for this, because I do not see a way to replicate this more than
existing deadlock tests do.

f7833b7e

Nov 21, 2022

ExtrinsicMetadataIndexer: Add support for metadata with origin in context · 221d48e2

vlorentz authored 2 years ago

REMD from deposits target a directory, with an origin in its context,
so this workaround allows indexing deposits easily, without significantly
changing swh-search.

221d48e2

origin_head: Do not fetch complete snapshots for non-FTP visits · 03b4bb00

vlorentz authored 2 years ago

Some snapshots are really large. Rather than fetching them entirely only to
discard most of the branches, this commit only fetches some branches (to
check existence + to use less queries on small snapshots), then requests
specific branches as needed (usually only 2).

This should improve performance and reduce timeout exceptions from the
storage.

03b4bb00

Nov 03, 2022

journal writer: only flush kafka once per batch · b7f04dd9

Nicolas Dandrimont authored 2 years ago

This code was flushing kafka messages and waiting for the brokers on
every message, instead of just doing it once per batch.

b7f04dd9

Nov 02, 2022
- codemeta: Fix crash on SWORD documents that specify an id · 41e90e4a
  vlorentz authored 2 years ago
  
  View commits for tag v2.7.3 v2.7.3
  
  41e90e4a
Oct 26, 2022

codemeta: Fix malformed dates that used to be allowed by the deposit · 3bad4148
vlorentz authored 2 years ago

View commits for tag v2.7.2 v2.7.2

3bad4148

codemeta: Fix incorrect output namespace for dates and URLs · c0052f8e

vlorentz authored 2 years ago

Codemeta reexports schema:url, schema:dateCreated, ... with
`"@type": "@id"` and `"type": "schema:Date"` so that

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}
```

expands to:

```
{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}
```

However, our translation tried to translate directly to a partially expanded
form, like this:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}
```

or expanded to:

```
{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which are not what we want.

This commit replaces the hack for `@type` with the right solution that
works for all properties.

c0052f8e

Oct 25, 2022
- metadata_dictionary: Systematically check input URLs before adding to graph · a66d5b24
  vlorentz authored 2 years ago
  
  This is hopefully the definitive workaround for the PyLD issue.
  a66d5b24