- Mar 23, 2022
-
-
vlorentz authored
committer=None happens on some malformed commits generated by old dgit version; and it is possible for author=None to happen for the same reason. For now, this is not supported by swh-model, so tests temporarily disable attrs checks that swh-model relies on.
-
- Mar 22, 2022
-
-
Antoine Lambert authored
Due to test modules being copied in subdirectories of the build directory by setuptools, it makes pytest fail by raising ImportPathMismatchError exceptions when invoked from root directory of the module. So ignore the build folder to discover tests.
-
- Mar 16, 2022
-
-
vlorentz authored
This was not covered by tests so far, because swh.model.tests.swh_model_data.TEST_OBJECTS did not contain any object with a raw_manifest. But it will in swh-model > 5.0.0
-
- Mar 15, 2022
- Mar 11, 2022
-
-
vlorentz authored
This allows both the postgresql and cassandra backends to make efficient queries by using an index (resp. clustering key) instead of scanning all visits of the given origin then sorting by date. This does not affect the results for the last majority of cases, as ids are always in increasing chronological, unless an origin was re-loaded from an old archive.
-
vlorentz authored
postgresql's query planner does not understand the origin is unique, so it performs a partial index scan on origin_visit_pkey, which is inefficient on origins with many visits. This commit itself is not enough to make it use the proper index, but provides this necessary change that will be used by a future commit.
-
vlorentz authored
According to Sentry, in the last 30 days: * directory_entry_get_by_path: 958 events, https://sentry.softwareheritage.org/share/issue/c4c2124953a145b2bd325f6f6b7df5a6/ * revision_get: 841 events, https://sentry.softwareheritage.org/share/issue/55fbe01c6f4d4c9bbf684c7608a62ad9/ * release_get: 14 events, https://sentry.softwareheritage.org/share/issue/37c53354541b4c4eaa1faf4e20a68418/ * origin_visit_find_by_date: 114 events, https://sentry.softwareheritage.org/share/issue/a674c12049a941968a717661a0226559/ * origin_get: 79 events, https://sentry.softwareheritage.org/share/issue/bf21d6bc7b24442eb18643d80d936d27/ ; 67 events, https://sentry.softwareheritage.org/share/issue/010a4b1e085a4e2089ba4897c6de6038/
-
- Mar 08, 2022
-
-
David Douard authored
it's not used by swh.storage.
-
- Mar 02, 2022
-
-
vlorentz authored
Motivation: replaces code duplication in the backends with a single one, to be consistent with the objstorage (which has many more backends) This also fixes the issue of metrics from 'extid_add' to be missing when using the postgresql storage.
-
- Feb 24, 2022
-
-
David Douard authored
- Add expected entry points for swh.core 2 db handling new features: - add a ``swh.storge.get_datastore()`` function - add ``swh.storage.postgreql.storage.Storage.get_current_version()`` method - move sql migration scripts in ``swh/storage/sql/upgrades`` - modify sql initialization scripts to match swh.core 2 (remove dbversion management code). - Update tests to use the new template-based database handling; this should have only minimal impact on test execution performances.
-
David Douard authored
-
- Feb 10, 2022
-
-
Antoine Lambert authored
To install the new hook: $ pre-commit install -t commit-msg
-
- Feb 04, 2022
- Feb 01, 2022
-
-
Nicolas Dandrimont authored
Extend the APIs for Revisions and Releases to honor the field by default, unless the new `ignore_displayname` argument is set.
-
Nicolas Dandrimont authored
It was made flaky by d4ddd415.
-
- Jan 31, 2022
-
-
Nicolas Dandrimont authored
This opens up the possibility of eventually ignoring the `name` and `email` fields stored in database in favor of parsing them again from the fullname field (and therefore to update our parsing logic without having to affect stored data).
-
Nicolas Dandrimont authored
This allows us to populate sensible name and email values out of the new displayname field, without having to store them.
-
- Jan 25, 2022
-
-
vlorentz authored
I don't expect directory_get_raw_manifest to be used, but it is needed for tests, so why not.
-
- Jan 21, 2022
-
-
vlorentz authored
It will be replaced by what is currently called 'offset_bytes'
-
vlorentz authored
This only keeps 'offset_bytes' to store the timezone, to support swh-model v5.0.0. However, this keeps writing 'offset' and 'negative_utc' to the postgresql database, just in case we need to roll back this change. But they are not read anymore.
-
- Jan 18, 2022
-
-
vlorentz authored
They should be a rare occurence, so adding these indices allows us to count and enumerate them without expensive full table scans.
-
vlorentz authored
[2022-01-17T16:03:27.448Z] /var/lib/jenkins/workspace/DSTO/tests-on-diff@2/docs/index.rst:25:hardcoded link 'https://archive.softwareheritage.org/api/' could be replaced by an extlink (try using ':swh_web:`api/`' instead)
-
- Jan 12, 2022
-
-
vlorentz authored
Assuming all contents passed to content_missing() have (at least) a missing algo, the function used to iterate over the size of the arg squared in the worst case (when all contents are found). With this commit, it starts with bucketing them by hash, so it does not need to iterate over *all* found contents for each content passed as arg.
-
vlorentz authored
This is twice as fast, according to https://forge.softwareheritage.org/T3577#72791
-
- Jan 06, 2022
-
-
vlorentz authored
Instead of grouping ids in queries in arbitrary batches (which forces the server node to coordinate with other nodes to complete the query), this sends queries with one id each, directly to the right node. This is the 'concurrent' algorithm from https://forge.softwareheritage.org/T3577#72791 which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.
-
- Jan 04, 2022
-
-
David Douard authored
- Dec 22, 2021
-
- Dec 16, 2021
-
-
Antoine R. Dumont authored
This also drops: - spurious copyright headers to those files if present. - fix a type issue revealed by the new mypy Related to T3812
-
- Dec 15, 2021
- Dec 13, 2021
-
-
vlorentz authored
Using `int()` on `date.timestamp()` rounded it up (toward zero), but the semantics of `model.Timestamp` is that the actual time is `ts.seconds + ts.microseconds/1000000`, so all negative dates were shifted one second up. In particular, this causes dates from `1969-12-31T23:59:59.000001` to `1969-12-31T23:59:59.999999` (inclusive) to smash into dates from `1970-01-01T00:00:00.000001` to `1970-01-01T00:00:00.999999`, which is how I discovered the issue.
-
- Dec 09, 2021
-
-
vlorentz authored
-
- Dec 08, 2021
-
-
Antoine Lambert authored
Now that we have packaged tenacity 6.2 for debian buster and use it in production, we can remove the workarounds to support tenacity < 5.
-
- Dec 07, 2021
-
-
Antoine Lambert authored
Directory entries are now checked for name duplicates in swh-model so we must ensure the CrashyEntry class is properly initialized. Closes T3776
-
- Nov 09, 2021
-
-
David Douard authored
The idea is that we check the BaseModel validity at journal deserialization time so that we still have access to the raw object from kafka for complete reporting (object id plus raw message from kafka). This uses a new ModelObjectDeserializer class that is responsible for deserializing the kafka message (still using kafka_to_value) then immediately create the BaseModel object from that dict. Its `convert` method is then passed as `value_deserializer` argument of the `JournalClient`. Then, for each deserialized object from kafka, if it's a HashableObject, check its validity by comparing the computed hash with its id. If it's invalid, report the error in logs, and if configured, register the invalid object in via the `reporter` callback. In the cli code, a `Redis.set()` is used a such a callback (if configured). So it simply stores invalid objects using the object id a key (typically its swhid), and the raw kafka message value as value. Related to T3693.
-