- Oct 24, 2023
-
-
Jérémy Bobbio (Lunar) authored
To detect instances of corruption, we now verify that the key found in the index during lookups matches the one given for the query.
-
By setting only the soft limit, we get the expected behavior but we are still able to reset it to the original value. This allows us not to use pytest-forked, while still avoiding pollution of other tests.
-
Jérémy Bobbio (Lunar) authored
We used to build the perfect hash function with the CHD_PH algorithm with the default load factor value which was 0.5. We thus ended up a hash function which itself took the minimum amount of space, but would result in an index twice as large as the number of objects in the shard. We now set the load factor to the maximum value (0.99) using the ill-named `cmph_config_set_graphsize()` function to loose the minimum amount of space in the index. The 1% lost will still amout to 11-12 MiB for a 30 millions objects shard.
-
Jérémy Bobbio (Lunar) authored
The keys associated with each object are currently only used to create and later to query the perfect-hash function. This means there is no way to easily learn about all keys present in a given shard. As long as the keys are hashes, we could recover them by computing the hash of each object, but this would be a fairly operation given the target size. To make the shard self-contained, we write the keys (32 bytes) for each object to the index, before the position of the object entry. Each index entry is now 40 bytes. With shards of 100 GiB, with 3 kiB for each objects, the index will use about 1.23 GiB of extra space. Writing the keys in the index enables us to do a linear scan to get all keys or to retrieve an object if the perfect-hash function is broken in any way. Another bonus of keeping the key in the index is that the index can be rewritten to change the keysize, without having to rewrite the shard entirely
-
Jérémy Bobbio (Lunar) authored
`cmph_new()` will return NULL when there is an issue while creating the hash function. This was previously unchecked, leading to a nice segfault when trying to save the index. We now properly handle the case and throw an exception on the Python side, hinting about possible duplicate keys.
-
Jérémy Bobbio (Lunar) authored
- `ShardCreator.create()` → `.prepare()` To create a shard, you actually need to write header, objects, index and perfect hash function. This method only initializes an incomplete header, so let’s call it `prepare()` instead. Update the C symbol accordingly. - `ShardCreator.save()` → `.finalize()` When calling `write()`, objects are already written to the file. Using `save()` is misleading as one can easily think that everything is buffered until `save()` is called. Let’s call this `finalize()` as this is what makes the shard usable but writing the index and perfect hash function. Update the C symbol accordingly. - `shard_lookup_object_size` → `shard_find_object()` and `shard_lookup_object` → `shard_read_object()` These two functions must always be called in sequence. The first will seek to the position of the object entry, read the size. What follow immediately after is the object data, and this is what the second function retrieves. We rename these function to make it clearer than a lookup nedds actually both of them.
-
Jérémy Bobbio (Lunar) authored
The HashObject type does not bring any value. The code treats object content as opaque bytes without assuming any properties, imposing any restriction or having specific behaviors. In order to simplify the API, let’s just say `swh-perfecthash` stores `bytes` without making same special.
-
Jérémy Bobbio (Lunar) authored
We now properly raise `OSError` (and subclasses) by looking at the vale of errno. Several tests have been added triggering various error conditions. `resource.setrlimit()` is used to limit the maximum size of a file. This will make `write()` fail. As this limit would spill on other tests, we use `pytest-forked` to run these tests in a different process.
-
Jérémy Bobbio (Lunar) authored
Creating and consulting a Shard are two different operations that are always done separately. Having both in a single class meant there was multiple ways to be in an illegal state. Instead, we now have a `ShardCreator` that should be used as a context manager, like so: with ShardCreator("shard", object_count=len(objects)) as shard: for key, object in objects.items(): shard.write(key, object) The `Shard` class can now only be used to perform `lookup()`. The `load()` function was removed and inlined in initialization. A `close()` method is available, but more importantly, the class can now act as a context manager: with Shard("shard") as shard: return shard.lookup(key)
-
- Oct 13, 2023
-
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
Instead of an assertion failure, at least return the path of the shard that failed to load.
-
Nicolas Dandrimont authored
The main change is that self.ffi.buffer is corrrectly detected as incompatible with HashObject. Use self.ffi.unpack (which generates a bytes object) instead.
-
- Oct 12, 2023
-
-
Nicolas Dandrimont authored
This increases the probability that data is flushed to the shard before we claim that it's ready for use.
-
- Jun 23, 2023
-
-
Nicolas Dandrimont authored
This adds a script to build cmph locally, and hooks into the ffibuilder to use the local copy of cmph if it's available. This also configures cibuildwheel to properly generate manylinux wheels.
-
- Feb 17, 2023
-
-
Antoine Lambert authored
Related to swh/meta#4960
-
- Feb 16, 2023
-
-
-
Jérémy Bobbio (Lunar) authored
Related to swh/meta#4959
-
- Feb 02, 2023
-
-
Antoine Lambert authored
This fixes python 3.7 support due to poetry, a dependency of isort, that removed support for that Python version in a recent release.
-
- Dec 19, 2022
-
-
Antoine Lambert authored
In order to remove warnings about /apidoc/*.rst files being included multiple times in toc when building full swh documentation, prefer to include module indices only when building standalone package documentation. Also include them the proper sphinx way. Related to T4496
-
- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- May 09, 2022
-
-
Pratyush authored
-
- Apr 26, 2022
-
-
vlorentz authored
-
- Apr 21, 2022
-
-
Antoine Lambert authored
That hook can be frustrating as it can discard a long commit message if it finds a typo in it so better removing it.
-
- Apr 08, 2022
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Related to T3922
-
Antoine Lambert authored
black is considered stable since release 22.1.0 and the version we are currently using is quite outdated and not compatible with click 8.1.0, so it is time to bump it to its latest stable release. Please note that E501 pycodestyle warning related to line length is replaced by B950 one from flake8-bugbear as recommended by black. https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length Related to T3922
-
- Apr 06, 2022
-
-
Antoine Lambert authored
pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for pytest >= 7 so we can now drop the pytest pinning.
-
- Mar 22, 2022
-
-
Antoine Lambert authored
Due to test modules being copied in subdirectories of the build directory by setuptools, it makes pytest fail by raising ImportPathMismatchError exceptions when invoked from root directory of the module. So ignore the build folder to discover tests.
-
- Mar 03, 2022
-
-
Nicolas Dandrimont authored
-
- Feb 10, 2022
-
-
Antoine Lambert authored
To install the new hook: $ pre-commit install -t commit-msg
-
- Feb 07, 2022
-
-
Antoine R. Dumont authored
Related to T3916
-
- Jan 25, 2022
-
-
Loïc Dachary authored
-
Loïc Dachary authored
and add a safeguard in case the caller provides a key with the wrong size
-
- Dec 16, 2021
-
-
Antoine R. Dumont authored
This also drops spurious copyright headers to those files if present. Related to T3812
-
- Dec 08, 2021
-
-
Loïc Dachary authored
-
- Nov 10, 2021
-
-
Loïc Dachary authored
This package is intended to be used by the new object storage, as a low level dependency to create and lookup a Read Shard. It is implemented in C and based on the cmph library for better performances. It will be used when a Read Shard must be created with around fifty millions objects, totaling around 100GB. The objects and their key (their cryptographic signature) will be retrieved, in python from the postgres database where the Write Shard lives. One after the other they will be inserted in the Read Shard using the **write** method. In the end the **save** method will create the perfect hash table using the cmph library and store it in the file (it typically takes a few seconds). There is no write amplification during the creation of the Read Shard: each byte is written exactly once, sequentially. There is no read operation. The memory footprint is 2*n*32 where n is the number of inserted keys. The **lookup** method relies on the hash function which is loaded in memory when the **load** function is called. It obtains the offset of the object by looking up its offset in the file from an index which may be up to 2x the number of keys (it is not minimal). Signed-off-by:
Loïc Dachary <loic@dachary.org>
-
- Oct 06, 2021
-
-
Loïc Dachary authored
-
Nicolas Dandrimont authored
-