- Dec 07, 2022
-
- Nov 30, 2022
-
-
vlorentz authored
- Nov 29, 2022
-
-
vlorentz authored
-
Nicolas Dandrimont authored
-
vlorentz authored
-
- Nov 28, 2022
-
-
vlorentz authored
This avoids having a transaction inserting row A then B, while another inserts row B then A; which (probably) leads to deadlocks like this: ``` DeadlockDetected: deadlock detected DETAIL: Process 1842336 waits for ShareLock on transaction 1051957280; blocked by process 64261. Process 64261 waits for ShareLock on transaction 1051957281; blocked by process 1842336. HINT: See server log for query details. CONTEXT: while inserting index tuple (1972253,5) in relation "origin_extrinsic_metadata" SQL statement "insert into origin_extrinsic_metadata (id, metadata, indexer_configuration_id, from_remd_id, metadata_tsvector, mappings) ``` https://sentry.softwareheritage.org/share/issue/52b06caae89f4235a758887fd6817656/ This was already mitigating by sorting before inserting in temporary tables, then expecting postgresql to read from temporary tables in the same order rows where inserted. This is often true, but not guaranteed. No test for this, because I do not see a way to replicate this more than existing deadlock tests do.
- Nov 21, 2022
-
-
vlorentz authored
Some snapshots are really large. Rather than fetching them entirely only to discard most of the branches, this commit only fetches some branches (to check existence + to use less queries on small snapshots), then requests specific branches as needed (usually only 2). This should improve performance and reduce timeout exceptions from the storage.
- Nov 03, 2022
-
-
Nicolas Dandrimont authored
This code was flushing kafka messages and waiting for the brokers on every message, instead of just doing it once per batch.
-
- Nov 02, 2022
-
- Oct 26, 2022
-
-
vlorentz authored
Codemeta reexports schema:url, schema:dateCreated, ... with `"@type": "@id"` and `"type": "schema:Date"` so that ``` { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "url": "http://example.org", "dateCreated": "2022-10-26" } ``` expands to: ``` { "http://schema.org/url": { "@type": "@id", "@value": "http://example.org" }, "dateCreated": { "@type": "http://schema.org/Date", "@value": "2022-10-26" } } ``` However, our translation tried to translate directly to a partially expanded form, like this: ``` { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "url": { "@value": "http://example.org" }, "dateCreated": { "@value": "2022-10-26" } } ``` which prevents the compaction and expansion algorithms from adding a type themselves, causing the document to be compacted to: ``` { "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "schema:url": "http://example.org" "schema:dateCreated": "2022-10-26" } ``` or expanded to: ``` { "http://schema.org/url": { "@value": "http://example.org" }, "http://schema.org/dateCreated": { "@value": "2022-10-26" } } ``` which are not what we want. This commit replaces the hack for `@type` with the right solution that works for all properties.
- Oct 25, 2022
- Oct 24, 2022
-
-
vlorentz authored
Without this, some Sentry issues were tagged with the wrong object, which can be very confusing
-
- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- Oct 07, 2022
-
- Sep 28, 2022
- Sep 27, 2022
-
-
vlorentz authored
-
vlorentz authored
It was only fixed as a side-effect of other changes, but it's good to have a regression test
-
vlorentz authored
They are closer semantics as 'html_url' is the main page of the repository, so it is the best to identify it; and 'clone_url' is the URL that should be given to 'git clone', as documented by https://schema.org/codeRepository Additionally, that property was missing so far; but a future commit will need to use it to identify fork relationships (node ids are required to representation relationships between documents as we cannot use blank nodes for that)
-
vlorentz authored
-
vlorentz authored
-
- Sep 12, 2022
-
-
Antoine Lambert authored
They have been moved in a swh-core pytest plugin to share them with other swh packages that might need it.
-
- Sep 08, 2022
-
-
vlorentz authored
Sentry uses repr() by default, which does not look good in a UI
-
vlorentz authored
persist_index_computations deduplicated row entries based on the entire content of the row; but postgresql enforces the 'id' should be unique. This was not an issue in older version of swh-indexer, because all operations were deterministic, given a specific directory as input. The recent switch to rdflib introduced non-determinism, so different outputs may be returned for the same directory id; causing the deduplication to not be good enough to avoid duplicate ids. With this commit, deduplication is now done on 'id', as expected. As a side-effect, persist_index_computations is now more efficient because: 1. it runs in linear time instead of quadratic in the number of metadata items 2. it only compares dir ids, instead of the content of indexed metadata (which is arbitrarily large JSON-like data)
-
vlorentz authored
- Sep 05, 2022
-
-
vlorentz authored
RawExtrinsicMetadata contain a swh:1:ori: identifier of the origin, which the indexer needs to resolve, by querying its storage replica. Because RawExtrinsicMetadata are created by loaders, they are often created shortly after the origin is created by the corresponding lister, so the origin may not be known to the storage replica used by the indexer, causing this function to crash. Waiting 10s seems to be good enough when run on my computer with production data and moma's replica; so I set it to 60s just to be safe.
-
vlorentz authored
... and they are processed in the same batch. The last one received takes precedence, as it is likely to be more up-to-date
-
- Sep 02, 2022
-
-
vlorentz authored
They cause noisy logs
-
- Sep 01, 2022
- Aug 31, 2022
-
-
Antoine R. Dumont authored
Related to T4477
-