Commits · debian/2.8.0-1_swh1 · vlorentz / Metadata indexer

Nov 23, 2022
- Updated debian changelog for version 2.8.0 · 35f93997
  Jenkins for Software Heritage authored 2 years ago
  
  debian/2.8.0-1_swh1
  
  35f93997
- Update upstream source from tag 'debian/upstream/2.8.0' · 9086224d
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '2.8.0'
with Debian dir 8b4ffcb03cd177c9149fcfd09e3a4cf499a1b201
```
  9086224d
- New upstream version 2.8.0 · f13cc05d
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/2.8.0
  
  f13cc05d
Nov 21, 2022

ExtrinsicMetadataIndexer: Add support for metadata with origin in context · 221d48e2

vlorentz authored 2 years ago

REMD from deposits target a directory, with an origin in its context,
so this workaround allows indexing deposits easily, without significantly
changing swh-search.

221d48e2

origin_head: Do not fetch complete snapshots for non-FTP visits · 03b4bb00

vlorentz authored 2 years ago

Some snapshots are really large. Rather than fetching them entirely only to
discard most of the branches, this commit only fetches some branches (to
check existence + to use less queries on small snapshots), then requests
specific branches as needed (usually only 2).

This should improve performance and reduce timeout exceptions from the
storage.

03b4bb00

Nov 03, 2022

journal writer: only flush kafka once per batch · b7f04dd9

Nicolas Dandrimont authored 2 years ago

This code was flushing kafka messages and waiting for the brokers on
every message, instead of just doing it once per batch.

b7f04dd9

Nov 02, 2022
- debian: Fix package builds by adding pybuild.testfiles · c96aa584
  Antoine Lambert authored 2 years ago
  
  debian/2.7.3-2_swh1
  
  c96aa584
- Updated debian changelog for version 2.7.3 · fd11d25c
  Jenkins for Software Heritage authored 2 years ago
  
  debian/2.7.3-1_swh1
  
  fd11d25c
- Update upstream source from tag 'debian/upstream/2.7.3' · 213dc2da
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '2.7.3'
with Debian dir 7da9a21feb7239b589c6d53d33ca7baf0dc3504f
```
  213dc2da
- New upstream version 2.7.3 · 53e30c89
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/2.7.3
  
  53e30c89
- codemeta: Fix crash on SWORD documents that specify an id · 41e90e4a
  vlorentz authored 2 years ago
  
  v2.7.3
  
  41e90e4a
Oct 27, 2022
- Updated debian changelog for version 2.7.2 · 81c343b0
  Jenkins for Software Heritage authored 2 years ago
  
  debian/2.7.2-1_swh1
  
  81c343b0
- Update upstream source from tag 'debian/upstream/2.7.2' · afcf4a26
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '2.7.2'
with Debian dir 0a3e9b68e3a9a078d1a53cb19a27f9c8e117938e
```
  afcf4a26
- New upstream version 2.7.2 · 7a995c0a
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/2.7.2
  
  7a995c0a
Oct 26, 2022

codemeta: Fix malformed dates that used to be allowed by the deposit · 3bad4148
vlorentz authored 2 years ago

v2.7.2

3bad4148

codemeta: Fix incorrect output namespace for dates and URLs · c0052f8e

vlorentz authored 2 years ago

Codemeta reexports schema:url, schema:dateCreated, ... with
`"@type": "@id"` and `"type": "schema:Date"` so that

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": "http://example.org",
    "dateCreated": "2022-10-26"
}
```

expands to:

```
{
    "http://schema.org/url": {
        "@type": "@id",
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@type": "http://schema.org/Date",
        "@value": "2022-10-26"
    }
}
```

However, our translation tried to translate directly to a partially expanded
form, like this:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "url": {
        "@value": "http://example.org"
    },
    "dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which prevents the compaction and expansion algorithms from adding a
type themselves, causing the document to be compacted to:

```
{
    "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
    "schema:url": "http://example.org"
    "schema:dateCreated": "2022-10-26"
}
```

or expanded to:

```
{
    "http://schema.org/url": {
        "@value": "http://example.org"
    },
    "http://schema.org/dateCreated": {
        "@value": "2022-10-26"
    }
}
```

which are not what we want.

This commit replaces the hack for `@type` with the right solution that
works for all properties.

c0052f8e

Oct 25, 2022
- metadata_dictionary: Systematically check input URLs before adding to graph · a66d5b24
  vlorentz authored 2 years ago
```
This is hopefully the definitive workaround for the PyLD issue.
```
  a66d5b24
- metadata: Make default tool configuration follow swh.indexer versions · a51cbf39
  vlorentz authored 2 years ago
```
It will allow invalidating cache after changes to mappings,
without changing the puppet manifest every time.
```
  a51cbf39
Oct 24, 2022
- Reset Sentry tags when leaving an object's context · dc48e9eb
  vlorentz authored 2 years ago
```
Without this, some Sentry issues were tagged with the wrong object,
which can be very confusing
```
  dc48e9eb
Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · e6895ba5

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

e6895ba5

Oct 07, 2022
- Updated debian changelog for version 2.7.1 · 45673018
  Jenkins for Software Heritage authored 2 years ago
  
  debian/2.7.1-1_swh1
  
  45673018
- Update upstream source from tag 'debian/upstream/2.7.1' · 1d529cc1
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '2.7.1'
with Debian dir 3e6c8e43699958132eba9d7c620d7871d989a731
```
  1d529cc1
- New upstream version 2.7.1 · 57822414
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/2.7.1
  
  57822414
- npm: Fix crash on invalid URLs in 'bugs' field. · 43125ac8
  vlorentz authored 2 years ago
  
  v2.7.1
  
  43125ac8
Sep 28, 2022
- Index extrinsic metadata from the deposit · 74867bf3
  vlorentz authored 2 years ago
```
It was added to the metadata_dictionary a while ago, but was excluded
by the indexer itself so far...
```
  2.7.0
  
  74867bf3
- codemeta: Fix crash when translating PropertyValue objects from codemeta-in-SWORD · fc6fea5d
  vlorentz authored 2 years ago
  
  fc6fea5d
- github: Ignore archive_url and issues_url; use custom codemeta:issueTracker · 6e4c67be
  vlorentz authored 2 years ago
```
The codemeta crosswalk maps archive_url and issues_url to codemeta
properties, but they are not actually useable because they are
link templates to API endpoints.
```
  6e4c67be
Sep 27, 2022

Make read_crosstable public and document it. · cdbf090b
vlorentz authored 2 years ago

cdbf090b
npm: Add test for 'author' value that used to crash · 9b741f2f
vlorentz authored 2 years ago
```
It was only fixed as a side-effect of other changes, but it's good
to have a regression test
```
9b741f2f

github and gitea: Use html_url as @id and clone_url as codeRepository · ac0e263b

vlorentz authored 2 years ago

They are closer semantics as 'html_url' is the main page of the repository,
so it is the best to identify it; and 'clone_url' is the URL that should
be given to 'git clone', as documented by https://schema.org/codeRepository

Additionally, that property was missing so far; but a future commit will
need to use it to identify fork relationships (node ids are required to
representation relationships between documents as we cannot use blank
nodes for that)

ac0e263b

Add Gitea metadata mapping · cb435e59
vlorentz authored 2 years ago

cb435e59
GitHub: use correct JSON-LD types for URLs and dates · 20becf4a
vlorentz authored 2 years ago

20becf4a

Sep 12, 2022
- tests/conftest: Remove sentry fixtures · e25a2f4e
  Antoine Lambert authored 2 years ago
```
They have been moved in a swh-core pytest plugin to share them with
other swh packages that might need it.
```
  e25a2f4e
- Updated debian changelog for version 2.6.0 · 762a238b
  Jenkins for Software Heritage authored 2 years ago
  
  debian/2.6.0-1_swh1
  
  762a238b
- Update upstream source from tag 'debian/upstream/2.6.0' · 912ad29c
  Jenkins for Software Heritage authored 2 years ago
```
Update to upstream version '2.6.0'
with Debian dir 132f86a3595679ad6ca88ad2ca01b29bc4fc100b
```
  912ad29c
- New upstream version 2.6.0 · 2094e1a6
  Jenkins for Software Heritage authored 2 years ago
  
  debian/upstream/2.6.0
  
  2094e1a6
Sep 08, 2022

npm: Do not generate URIs with spaces in them · 6d7efad9
vlorentz authored 2 years ago
```
It makes rdflib complain, and is invalid anyway
```
v2.6.0

6d7efad9
Convert SWHID to str before passing to sentry_sdk.set_tag · f4e08f95
vlorentz authored 2 years ago
```
Sentry uses repr() by default, which does not look good in a UI
```
f4e08f95

Fix crash when indexing the same directory twice with non-deterministic order · b6385cec

vlorentz authored 2 years ago

persist_index_computations deduplicated row entries based on the entire
content of the row; but postgresql enforces the 'id' should be unique.

This was not an issue in older version of swh-indexer, because all
operations were deterministic, given a specific directory as input.

The recent switch to rdflib introduced non-determinism, so different
outputs may be returned for the same directory id; causing the
deduplication to not be good enough to avoid duplicate ids.

With this commit, deduplication is now done on 'id', as expected.

As a side-effect, persist_index_computations is now more efficient
because:

1. it runs in linear time instead of quadratic in the number of
   metadata items
2. it only compares dir ids, instead of the content of indexed metadata
   (which is arbitrarily large JSON-like data)

b6385cec

github: Add support for 'topics' · dd027419
vlorentz authored 2 years ago

dd027419