- Feb 04, 2022
- Feb 01, 2022
-
-
Nicolas Dandrimont authored
Extend the APIs for Revisions and Releases to honor the field by default, unless the new `ignore_displayname` argument is set.
-
Nicolas Dandrimont authored
It was made flaky by d4ddd415.
-
- Jan 31, 2022
-
-
Nicolas Dandrimont authored
This opens up the possibility of eventually ignoring the `name` and `email` fields stored in database in favor of parsing them again from the fullname field (and therefore to update our parsing logic without having to affect stored data).
-
Nicolas Dandrimont authored
This allows us to populate sensible name and email values out of the new displayname field, without having to store them.
-
- Jan 25, 2022
-
-
vlorentz authored
I don't expect directory_get_raw_manifest to be used, but it is needed for tests, so why not.
-
- Jan 21, 2022
-
-
vlorentz authored
It will be replaced by what is currently called 'offset_bytes'
-
vlorentz authored
This only keeps 'offset_bytes' to store the timezone, to support swh-model v5.0.0. However, this keeps writing 'offset' and 'negative_utc' to the postgresql database, just in case we need to roll back this change. But they are not read anymore.
-
- Jan 18, 2022
-
-
vlorentz authored
They should be a rare occurence, so adding these indices allows us to count and enumerate them without expensive full table scans.
-
vlorentz authored
[2022-01-17T16:03:27.448Z] /var/lib/jenkins/workspace/DSTO/tests-on-diff@2/docs/index.rst:25:hardcoded link 'https://archive.softwareheritage.org/api/' could be replaced by an extlink (try using ':swh_web:`api/`' instead)
-
- Jan 12, 2022
-
-
vlorentz authored
Assuming all contents passed to content_missing() have (at least) a missing algo, the function used to iterate over the size of the arg squared in the worst case (when all contents are found). With this commit, it starts with bucketing them by hash, so it does not need to iterate over *all* found contents for each content passed as arg.
-
vlorentz authored
This is twice as fast, according to https://forge.softwareheritage.org/T3577#72791
-
- Jan 06, 2022
-
-
vlorentz authored
Instead of grouping ids in queries in arbitrary batches (which forces the server node to coordinate with other nodes to complete the query), this sends queries with one id each, directly to the right node. This is the 'concurrent' algorithm from https://forge.softwareheritage.org/T3577#72791 which gives a >=2x speed-up on directories, and a >=8x speed-up on revisions.
-
- Jan 04, 2022
-
-
David Douard authored
- Dec 22, 2021
-
- Dec 16, 2021
-
-
Antoine R. Dumont authored
This also drops: - spurious copyright headers to those files if present. - fix a type issue revealed by the new mypy Related to T3812
-
- Dec 15, 2021
- Dec 13, 2021
-
-
vlorentz authored
Using `int()` on `date.timestamp()` rounded it up (toward zero), but the semantics of `model.Timestamp` is that the actual time is `ts.seconds + ts.microseconds/1000000`, so all negative dates were shifted one second up. In particular, this causes dates from `1969-12-31T23:59:59.000001` to `1969-12-31T23:59:59.999999` (inclusive) to smash into dates from `1970-01-01T00:00:00.000001` to `1970-01-01T00:00:00.999999`, which is how I discovered the issue.
-
- Dec 09, 2021
-
-
vlorentz authored
-
- Dec 08, 2021
-
-
Antoine Lambert authored
Now that we have packaged tenacity 6.2 for debian buster and use it in production, we can remove the workarounds to support tenacity < 5.
-
- Dec 07, 2021
-
-
Antoine Lambert authored
Directory entries are now checked for name duplicates in swh-model so we must ensure the CrashyEntry class is properly initialized. Closes T3776
-
- Nov 09, 2021
-
-
David Douard authored
The idea is that we check the BaseModel validity at journal deserialization time so that we still have access to the raw object from kafka for complete reporting (object id plus raw message from kafka). This uses a new ModelObjectDeserializer class that is responsible for deserializing the kafka message (still using kafka_to_value) then immediately create the BaseModel object from that dict. Its `convert` method is then passed as `value_deserializer` argument of the `JournalClient`. Then, for each deserialized object from kafka, if it's a HashableObject, check its validity by comparing the computed hash with its id. If it's invalid, report the error in logs, and if configured, register the invalid object in via the `reporter` callback. In the cli code, a `Redis.set()` is used a such a callback (if configured). So it simply stores invalid objects using the object id a key (typically its swhid), and the raw kafka message value as value. Related to T3693.
-
David Douard authored
allowing to use this dict independently of the fix_objects() function.
-
David Douard authored
keep the the fix_objects() function for bw compat for now.
-
David Douard authored
allows to choose replayed object types from the cli.
-
vlorentz authored
* merged origin and artifact metadata * added metametadata * uses structures instead of dict * removed raw_extrinsic_metadata_get_latest
-
- Oct 28, 2021
-
-
Antoine Lambert authored
It enables to return in an efficient way the list of unique snapshot identifiers resulting from the visits of an origin. Previously it was required to query all visits of an origin then query all visit statuses for each visit to extract such information. Introduced method enables to extract origin snaphots information in a single datase query. Related to T3631
-
- Oct 22, 2021
-
-
Antoine Lambert authored
Some revisions in the archive do not have committer date so workaround it to avoid errors when walking on such revisions when using the class CommitterDateRevisionsWalker.
-
- Oct 21, 2021
-
-
Antoine Lambert authored
-
- Oct 18, 2021
-
-
vlorentz authored
content_missing_by_sha1_git only checks the index and not the main table. This is incorrect, because contents should not be considered written before an entry is written to the main table, even if an entry exists in one of the indexes.
-
- Oct 11, 2021
- Oct 08, 2021
-
-
Nicolas Dandrimont authored
The size of individual revisions and releases is essentially unbounded. This means that, when the buffer storage is used as a way of limiting memory use for an ingestion process, it is still possible to go beyond the expected memory use when adding a batch of revisions or releases with large messages or other metadata. The duration of the database operations for revision_add or release_add is also commensurate to the size of the objects added in a batch, so using the buffer proxy to limit the time individual database operations takes was not effective. Adding a threshold on estimated sizes for batches of revision and release objects makes this overuse of memory and of database transaction time much less likely.
-
Nicolas Dandrimont authored
The size of individual revisions is essentially unbounded. This means that, when the buffer storage is used as a way of limiting memory use for an ingestion process, it is still possible to go beyond the expected memory use when adding a batch of revisions with extensive histories. The duration of the database operation for revision_add is also commensurate to the number of revision parents added in a batch, so using the buffer proxy to limit the time individual database operations takes was not effective. Adding a threshold on cumulated number of revision parents per batch makes this overuse of memory and of database transaction time much less likely.
-
Nicolas Dandrimont authored
The size of individual directories is essentially unbounded. This means that, when the buffer storage is used as a way of limiting memory use for an ingestion process, it is still possible to go beyond the expected memory use when adding a batch of (very) large directories. The duration of the database operation for directory_add is also commensurate to the number of directory entries added in a batch, so using the buffer proxy to limit the time individual database operations takes was not effective. Adding a threshold on cumulated number of directory entries per batch makes this overuse of memory and of database transaction time much less likely.