- Nov 09, 2021
-
-
David Douard authored
The idea is that we check the BaseModel validity at journal deserialization time so that we still have access to the raw object from kafka for complete reporting (object id plus raw message from kafka). This uses a new ModelObjectDeserializer class that is responsible for deserializing the kafka message (still using kafka_to_value) then immediately create the BaseModel object from that dict. Its `convert` method is then passed as `value_deserializer` argument of the `JournalClient`. Then, for each deserialized object from kafka, if it's a HashableObject, check its validity by comparing the computed hash with its id. If it's invalid, report the error in logs, and if configured, register the invalid object in via the `reporter` callback. In the cli code, a `Redis.set()` is used a such a callback (if configured). So it simply stores invalid objects using the object id a key (typically its swhid), and the raw kafka message value as value. Related to T3693.
-
David Douard authored
allowing to use this dict independently of the fix_objects() function.
-
David Douard authored
keep the the fix_objects() function for bw compat for now.
-
David Douard authored
allows to choose replayed object types from the cli.
-
vlorentz authored
* merged origin and artifact metadata * added metametadata * uses structures instead of dict * removed raw_extrinsic_metadata_get_latest
-
- Oct 28, 2021
-
-
Antoine Lambert authored
It enables to return in an efficient way the list of unique snapshot identifiers resulting from the visits of an origin. Previously it was required to query all visits of an origin then query all visit statuses for each visit to extract such information. Introduced method enables to extract origin snaphots information in a single datase query. Related to T3631
-
- Oct 22, 2021
-
-
Antoine Lambert authored
Some revisions in the archive do not have committer date so workaround it to avoid errors when walking on such revisions when using the class CommitterDateRevisionsWalker.
-
- Oct 21, 2021
-
-
Antoine Lambert authored
-
- Oct 18, 2021
-
-
vlorentz authored
content_missing_by_sha1_git only checks the index and not the main table. This is incorrect, because contents should not be considered written before an entry is written to the main table, even if an entry exists in one of the indexes.
-
- Oct 11, 2021
- Oct 08, 2021
-
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
The size of individual revisions and releases is essentially unbounded. This means that, when the buffer storage is used as a way of limiting memory use for an ingestion process, it is still possible to go beyond the expected memory use when adding a batch of revisions or releases with large messages or other metadata. The duration of the database operations for revision_add or release_add is also commensurate to the size of the objects added in a batch, so using the buffer proxy to limit the time individual database operations takes was not effective. Adding a threshold on estimated sizes for batches of revision and release objects makes this overuse of memory and of database transaction time much less likely.
-
Nicolas Dandrimont authored
The size of individual revisions is essentially unbounded. This means that, when the buffer storage is used as a way of limiting memory use for an ingestion process, it is still possible to go beyond the expected memory use when adding a batch of revisions with extensive histories. The duration of the database operation for revision_add is also commensurate to the number of revision parents added in a batch, so using the buffer proxy to limit the time individual database operations takes was not effective. Adding a threshold on cumulated number of revision parents per batch makes this overuse of memory and of database transaction time much less likely.
-
Nicolas Dandrimont authored
The size of individual directories is essentially unbounded. This means that, when the buffer storage is used as a way of limiting memory use for an ingestion process, it is still possible to go beyond the expected memory use when adding a batch of (very) large directories. The duration of the database operation for directory_add is also commensurate to the number of directory entries added in a batch, so using the buffer proxy to limit the time individual database operations takes was not effective. Adding a threshold on cumulated number of directory entries per batch makes this overuse of memory and of database transaction time much less likely.
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
This was already the case (as grouper called on an empty iterator just returns no batches), but add a test to enforce it.
-
- Sep 29, 2021
-
-
David Douard authored
-
David Douard authored
due to missing type annotation of the storage argument of _insert_objects(), we missed a bug in the processing of raw_extrinsic_metadata objects, passing set() as arguments of storage add methods.
-
David Douard authored
now the callable is expected to return a dict.
-
- Sep 28, 2021
-
-
vlorentz authored
-
- Sep 27, 2021
-
-
vlorentz authored
-
- Sep 23, 2021
-
-
Raphaël Gomès authored
This has a corresponding change in swh.model
-
- Sep 21, 2021
-
-
vlorentz authored
This is used by directory_ls and content_get.
-
Antoine Lambert authored
Methods snapshot_get and snapshot_get_branches should return None if the snapshot does not exist in the archive. Add missing tests to cover that case.
-
- Sep 16, 2021
-
-
Antoine Lambert authored
When searching for branches in an existing snapshot, a PartialBranches object must be returned regardless the number of found branches. None should only be returned when a snapshot does not exist. This fixes an inconsistency between the postgresql and cassandra backends. Related to T3413
-
- Sep 15, 2021
-
-
Antoine R. Dumont authored
This impacts both the `extid_get_from_extid` and `extid_get_from_target` endpoints. Whe extid_version/extid_type are not provided, this keeps the existing behavior of returning all extids matching. Related to T3567
-
- Sep 14, 2021
-
-
vlorentz authored
-
- Sep 10, 2021
- Sep 09, 2021
-
-
vlorentz authored
This should make it run up to 100 times faster, even on average directories.
-
vlorentz authored
Instead of fetching them one-by-one, with the very high latency this entails. This is preliminary work to make `directory_ls` less painfully slow.
-
vlorentz authored
And fall back to concurrent insertion.
-
- Sep 08, 2021
-
-
vlorentz authored
By reusing the 'steady state' main statement (which is quite large) across calls.
-
vlorentz authored
This adds a new config option for the cassandra backend, 'directory_entries_insert_algo', with three possible values: * 'one-per-one' is the default, and preserves the current naive behavior * 'concurrent' and 'batch' are attempts at being more efficient
-
- Sep 06, 2021
-
-
vlorentz authored
This will be used as a second pass on objects that failed with older versions of the script.
-
- Sep 03, 2021
-
-
vlorentz authored
-
- Aug 31, 2021
-
-
vlorentz authored
They were inaccurate and a performance bottleneck. We can/should use swh-counters instead, now.
-