Commits · v2.9.1 · vlorentz / Metadata indexer

Nov 30, 2022

Fix crash when indexing two REMD objects from the same deposit · f74b47bc

vlorentz authored 2 years ago

The deduplication code assumed `remd.target` matches the id of results,
but this is no longer true, as we started using REMD objects whose
`origin` context was used as result id, when `remd.target` is a
directory (221d48e2).

f74b47bc

metadata_dictionary: Fix 'Invalid IPv6 URL' crash · b2d8afff
vlorentz authored 2 years ago

b2d8afff

Nov 29, 2022
- README: Update list of indexers · f44e14b1
  vlorentz authored 2 years ago
  
  f44e14b1
- Fix ordering and idempotence in the 136 -> 137 upgrade script · d4d4c59c
  Nicolas Dandrimont authored 2 years ago
  
  d4d4c59c
- docs: Remove remaining references to ctags and content_language · 3faeac6c
  vlorentz authored 2 years ago
  
  3faeac6c
Nov 28, 2022

Drop content_language and content_ctags tables and related SQL functions · a5ee54ae
vlorentz authored 2 years ago

v2.9.0

a5ee54ae

storage: Insert from temporary tables in consistent order · f7833b7e

vlorentz authored 2 years ago

This avoids having a transaction inserting row A then B, while another
inserts row B then A; which (probably) leads to deadlocks like this:

```
DeadlockDetected: deadlock detected
DETAIL:  Process 1842336 waits for ShareLock on transaction 1051957280; blocked by process 64261.
Process 64261 waits for ShareLock on transaction 1051957281; blocked by process 1842336.
HINT:  See server log for query details.
CONTEXT:  while inserting index tuple (1972253,5) in relation "origin_extrinsic_metadata"
SQL statement "insert into origin_extrinsic_metadata (id, metadata, indexer_configuration_id, from_remd_id, metadata_tsvector, mappings)
```

https://sentry.softwareheritage.org/share/issue/52b06caae89f4235a758887fd6817656/

This was already mitigating by sorting before inserting in temporary
tables, then expecting postgresql to read from temporary tables in the
same order rows where inserted. This is often true, but not guaranteed.

No test for this, because I do not see a way to replicate this more than
existing deadlock tests do.

f7833b7e

Nov 21, 2022

ExtrinsicMetadataIndexer: Add support for metadata with origin in context · 221d48e2

vlorentz authored 2 years ago

REMD from deposits target a directory, with an origin in its context,
so this workaround allows indexing deposits easily, without significantly
changing swh-search.

221d48e2

origin_head: Do not fetch complete snapshots for non-FTP visits · 03b4bb00

vlorentz authored 2 years ago

Some snapshots are really large. Rather than fetching them entirely only to
discard most of the branches, this commit only fetches some branches (to
check existence + to use less queries on small snapshots), then requests
specific branches as needed (usually only 2).

This should improve performance and reduce timeout exceptions from the
storage.

03b4bb00

Nov 03, 2022

journal writer: only flush kafka once per batch · b7f04dd9

Nicolas Dandrimont authored 2 years ago

This code was flushing kafka messages and waiting for the brokers on
every message, instead of just doing it once per batch.

b7f04dd9

Nov 02, 2022
- codemeta: Fix crash on SWORD documents that specify an id · 41e90e4a
  vlorentz authored 2 years ago
  
  v2.7.3
  
  41e90e4a
Oct 26, 2022

codemeta: Fix malformed dates that used to be allowed by the deposit · 3bad4148
vlorentz authored 2 years ago

v2.7.2

3bad4148

codemeta: Fix incorrect output namespace for dates and URLs · c0052f8e

vlorentz authored 2 years ago

Codemeta reexports schema:url, schema:dateCreated, ... with
`"@type": "@id"` and `"type": "schema:Date"` so that

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}
```

expands to:

```
{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}
```

However, our translation tried to translate directly to a partially expanded
form, like this:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}
```

or expanded to:

```
{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which are not what we want.

This commit replaces the hack for `@type` with the right solution that
works for all properties.

c0052f8e

Oct 25, 2022
- metadata_dictionary: Systematically check input URLs before adding to graph · a66d5b24
  vlorentz authored 2 years ago
```
This is hopefully the definitive workaround for the PyLD issue.
```
  a66d5b24
- metadata: Make default tool configuration follow swh.indexer versions · a51cbf39
  vlorentz authored 2 years ago
```
It will allow invalidating cache after changes to mappings,
without changing the puppet manifest every time.
```
  a51cbf39
Oct 24, 2022
- Reset Sentry tags when leaving an object's context · dc48e9eb
  vlorentz authored 2 years ago
```
Without this, some Sentry issues were tagged with the wrong object,
which can be very confusing
```
  dc48e9eb
Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · e6895ba5

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

e6895ba5

Oct 07, 2022
- npm: Fix crash on invalid URLs in 'bugs' field. · 43125ac8
  vlorentz authored 2 years ago
  
  v2.7.1
  
  43125ac8
Sep 28, 2022
- Index extrinsic metadata from the deposit · 74867bf3
  vlorentz authored 2 years ago
```
It was added to the metadata_dictionary a while ago, but was excluded
by the indexer itself so far...
```
  2.7.0
  
  74867bf3
- codemeta: Fix crash when translating PropertyValue objects from codemeta-in-SWORD · fc6fea5d
  vlorentz authored 2 years ago
  
  fc6fea5d
- github: Ignore archive_url and issues_url; use custom codemeta:issueTracker · 6e4c67be
  vlorentz authored 2 years ago
```
The codemeta crosswalk maps archive_url and issues_url to codemeta
properties, but they are not actually useable because they are
link templates to API endpoints.
```
  6e4c67be
Sep 27, 2022

Make read_crosstable public and document it. · cdbf090b
vlorentz authored 2 years ago

cdbf090b
npm: Add test for 'author' value that used to crash · 9b741f2f
vlorentz authored 2 years ago
```
It was only fixed as a side-effect of other changes, but it's good
to have a regression test
```
9b741f2f

github and gitea: Use html_url as @id and clone_url as codeRepository · ac0e263b

vlorentz authored 2 years ago

They are closer semantics as 'html_url' is the main page of the repository,
so it is the best to identify it; and 'clone_url' is the URL that should
be given to 'git clone', as documented by https://schema.org/codeRepository

Additionally, that property was missing so far; but a future commit will
need to use it to identify fork relationships (node ids are required to
representation relationships between documents as we cannot use blank
nodes for that)

ac0e263b

Add Gitea metadata mapping · cb435e59
vlorentz authored 2 years ago

cb435e59
GitHub: use correct JSON-LD types for URLs and dates · 20becf4a
vlorentz authored 2 years ago

20becf4a

Sep 12, 2022

tests/conftest: Remove sentry fixtures · e25a2f4e

Antoine Lambert authored 2 years ago

They have been moved in a swh-core pytest plugin to share them with
other swh packages that might need it.

e25a2f4e

Sep 08, 2022

npm: Do not generate URIs with spaces in them · 6d7efad9
vlorentz authored 2 years ago
```
It makes rdflib complain, and is invalid anyway
```
v2.6.0

6d7efad9
Convert SWHID to str before passing to sentry_sdk.set_tag · f4e08f95
vlorentz authored 2 years ago
```
Sentry uses repr() by default, which does not look good in a UI
```
f4e08f95

Fix crash when indexing the same directory twice with non-deterministic order · b6385cec

vlorentz authored 2 years ago

persist_index_computations deduplicated row entries based on the entire
content of the row; but postgresql enforces the 'id' should be unique.

This was not an issue in older version of swh-indexer, because all
operations were deterministic, given a specific directory as input.

The recent switch to rdflib introduced non-determinism, so different
outputs may be returned for the same directory id; causing the
deduplication to not be good enough to avoid duplicate ids.

With this commit, deduplication is now done on 'id', as expected.

As a side-effect, persist_index_computations is now more efficient
because:

1. it runs in linear time instead of quadratic in the number of
   metadata items
2. it only compares dir ids, instead of the content of indexed metadata
   (which is arbitrarily large JSON-like data)

b6385cec

github: Add support for 'topics' · dd027419
vlorentz authored 2 years ago

dd027419

Sep 05, 2022

Fix crash when RawExtrinsicMetadata target new origins · befdbd7e

vlorentz authored 2 years ago

RawExtrinsicMetadata contain a swh:1:ori: identifier of the origin,
which the indexer needs to resolve, by querying its storage replica.

Because RawExtrinsicMetadata are created by loaders, they are often
created shortly after the origin is created by the corresponding lister,
so the origin may not be known to the storage replica used by the
indexer, causing this function to crash.

Waiting 10s seems to be good enough when run on my computer with
production data and moma's replica; so I set it to 60s just to be safe.

befdbd7e

Fix crash when RawExtrinsicMetadata objects have the same target · 68940cfc

vlorentz authored 2 years ago

... and they are processed in the same batch.

The last one received takes precedence, as it is likely to be more
up-to-date

68940cfc

Sep 02, 2022
- npm, maven: ignore blatantly invalid licenses · 44879ab5
  vlorentz authored 2 years ago
```
They cause noisy logs
```
  44879ab5
Sep 01, 2022
- cli: Pass all journal_client config keys to the JournalClient · b056d431
  vlorentz authored 2 years ago
  
  b056d431
- Filter out more invalid URIs that make PyLD crash · 2ebd7ee8
  vlorentz authored 2 years ago
  
  2ebd7ee8
- base: Filter out empty URIs so PyLD does not crash · 25e709c8
  vlorentz authored 2 years ago
  
  25e709c8
Aug 31, 2022
- indexer.cli: Allow batch_size configuration on journal client · 752e5d3f
  Antoine R. Dumont authored 2 years ago
```
Related to T4477
```
  v2.5.0
  
  752e5d3f
- Revert "metadata: Drop unsupported key 'type'" · 83abf119
  vlorentz authored 2 years ago
```
This reverts commit 85b675fd.

Support for these old objects is fixed in swh-model v6.4.1.
```
  v2.4.4
  
  83abf119
Aug 30, 2022
- rehash: Call objstorage.content_get() with a HashDict instead of single hash · 42cb3776
  vlorentz authored 2 years ago
```
Hash dicts are now prefered by swh-objstorage, in order to support
individual hash collisions.
```
  42cb3776