You need to sign in or sign up before continuing.
- Feb 17, 2025
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Bump development tools: mypy, codespell, isort, ... Move all tools configuration in pyproject.toml. Remove no longer needed mypy overrides.
-
- Aug 27, 2024
-
-
David Douard authored
-
- Jun 13, 2024
-
-
Antoine Lambert authored
-
- Jun 10, 2024
-
-
Antoine Lambert authored
On macOS, libcmph can be easily installed using Homebrew or MacPorts so ensure the Python extension can be built on that platform by checking libcmph paths and updating compilation flags accordingly.
- Mar 29, 2024
-
-
David Douard authored
-
- Feb 05, 2024
-
-
Antoine Lambert authored
Related to swh/meta#5075.
-
- Dec 20, 2023
-
-
Jérémy Bobbio (Lunar) authored
In order to be able to remove objects from objstorage–in the case of takedown notices–we add a new `Shard.delete()` method. As Shard files uses a perfect hash function computed on creation, and fixed offsets, completely removing an object would amount to recreate a new Shard from scratch. As these files are meant to be quite large and removals should be rare, we just overwrite the object size and data with zeros. The object position in the hash table is also replaced with UINT64_MAX in order to signal that the object has been removed. `Shard.lookup()` has been updated accordingly and will throw a `KeyError` if the object matching a key has been deleted. The interface is not ideal but it is due to a more general problem of the design of API. The caller must be careful not to run `delete()` on a “created” or “loaded” Shard as the method will take care of opening the Shard file in read/write mode, overwrite the right bytes and close the file again.
-
Jérémy Bobbio (Lunar) authored
We will need to reuse a populated Shard to implement deletion. Let’s refactor the lookup test so the Shard creation is done separately in a fixture.
-
Jérémy Bobbio (Lunar) authored
The FILE pointer was not NULL’ed after calling fclose() in shard_close(). This meant calling shard_close() and then shard_destroy() (which also calls shard_close()) would call fclose() twice.
-
Jérémy Bobbio (Lunar) authored
Commit ee9a7ae9 added keys to the index but the documentation had not been updated before. Take the opportunity to do some reformatting.
-
- Dec 04, 2023
-
-
David Douard authored
-
- Dec 03, 2023
-
- Nov 29, 2023
-
-
Antoine Lambert authored
When building package documentation outside tox by calling make in the docs directory, the include of Makefile.sphinx inside the docs Makefile was failing as its relative path was invalid. So adapt this relative path according if the SWH_PACKAGE_DOC_TOX_BUILD environment variable is set or not.
-
- Nov 28, 2023
-
-
David Douard authored
- (re)add pkg_resources in mypy's ignored list - remove sphinx-dev target from tox - update sphinx target to use non-editable install of swh.docs - remove MANIFEST.in - exclude .clang-format from package data files
-
- Nov 21, 2023
-
-
David Douard authored
-
David Douard authored
-
- Nov 16, 2023
-
- Oct 24, 2023
-
-
Jérémy Bobbio (Lunar) authored
To detect instances of corruption, we now verify that the key found in the index during lookups matches the one given for the query.
-
By setting only the soft limit, we get the expected behavior but we are still able to reset it to the original value. This allows us not to use pytest-forked, while still avoiding pollution of other tests.
-
Jérémy Bobbio (Lunar) authored
We used to build the perfect hash function with the CHD_PH algorithm with the default load factor value which was 0.5. We thus ended up a hash function which itself took the minimum amount of space, but would result in an index twice as large as the number of objects in the shard. We now set the load factor to the maximum value (0.99) using the ill-named `cmph_config_set_graphsize()` function to loose the minimum amount of space in the index. The 1% lost will still amout to 11-12 MiB for a 30 millions objects shard.
-
Jérémy Bobbio (Lunar) authored
The keys associated with each object are currently only used to create and later to query the perfect-hash function. This means there is no way to easily learn about all keys present in a given shard. As long as the keys are hashes, we could recover them by computing the hash of each object, but this would be a fairly operation given the target size. To make the shard self-contained, we write the keys (32 bytes) for each object to the index, before the position of the object entry. Each index entry is now 40 bytes. With shards of 100 GiB, with 3 kiB for each objects, the index will use about 1.23 GiB of extra space. Writing the keys in the index enables us to do a linear scan to get all keys or to retrieve an object if the perfect-hash function is broken in any way. Another bonus of keeping the key in the index is that the index can be rewritten to change the keysize, without having to rewrite the shard entirely
-
Jérémy Bobbio (Lunar) authored
`cmph_new()` will return NULL when there is an issue while creating the hash function. This was previously unchecked, leading to a nice segfault when trying to save the index. We now properly handle the case and throw an exception on the Python side, hinting about possible duplicate keys.
-
Jérémy Bobbio (Lunar) authored
- `ShardCreator.create()` → `.prepare()` To create a shard, you actually need to write header, objects, index and perfect hash function. This method only initializes an incomplete header, so let’s call it `prepare()` instead. Update the C symbol accordingly. - `ShardCreator.save()` → `.finalize()` When calling `write()`, objects are already written to the file. Using `save()` is misleading as one can easily think that everything is buffered until `save()` is called. Let’s call this `finalize()` as this is what makes the shard usable but writing the index and perfect hash function. Update the C symbol accordingly. - `shard_lookup_object_size` → `shard_find_object()` and `shard_lookup_object` → `shard_read_object()` These two functions must always be called in sequence. The first will seek to the position of the object entry, read the size. What follow immediately after is the object data, and this is what the second function retrieves. We rename these function to make it clearer than a lookup nedds actually both of them.
-
Jérémy Bobbio (Lunar) authored
The HashObject type does not bring any value. The code treats object content as opaque bytes without assuming any properties, imposing any restriction or having specific behaviors. In order to simplify the API, let’s just say `swh-perfecthash` stores `bytes` without making same special.
-
Jérémy Bobbio (Lunar) authored
We now properly raise `OSError` (and subclasses) by looking at the vale of errno. Several tests have been added triggering various error conditions. `resource.setrlimit()` is used to limit the maximum size of a file. This will make `write()` fail. As this limit would spill on other tests, we use `pytest-forked` to run these tests in a different process.
-
Jérémy Bobbio (Lunar) authored
Creating and consulting a Shard are two different operations that are always done separately. Having both in a single class meant there was multiple ways to be in an illegal state. Instead, we now have a `ShardCreator` that should be used as a context manager, like so: with ShardCreator("shard", object_count=len(objects)) as shard: for key, object in objects.items(): shard.write(key, object) The `Shard` class can now only be used to perform `lookup()`. The `load()` function was removed and inlined in initialization. A `close()` method is available, but more importantly, the class can now act as a context manager: with Shard("shard") as shard: return shard.lookup(key)
-
- Oct 13, 2023
-
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
Instead of an assertion failure, at least return the path of the shard that failed to load.
-
Nicolas Dandrimont authored
The main change is that self.ffi.buffer is corrrectly detected as incompatible with HashObject. Use self.ffi.unpack (which generates a bytes object) instead.
-
- Oct 12, 2023
-
-
Nicolas Dandrimont authored
This increases the probability that data is flushed to the shard before we claim that it's ready for use.
-
- Jun 23, 2023
-
-
Nicolas Dandrimont authored
This adds a script to build cmph locally, and hooks into the ffibuilder to use the local copy of cmph if it's available. This also configures cibuildwheel to properly generate manylinux wheels.
-
- Feb 17, 2023
-
-
Antoine Lambert authored
Related to swh/meta#4960
-
- Feb 16, 2023
-
-
Jérémy Bobbio (Lunar) authored
Related to swh/meta#4959
- Feb 02, 2023
-
-
Antoine Lambert authored
This fixes python 3.7 support due to poetry, a dependency of isort, that removed support for that Python version in a recent release.
-
- Dec 19, 2022
-
-
Antoine Lambert authored
In order to remove warnings about /apidoc/*.rst files being included multiple times in toc when building full swh documentation, prefer to include module indices only when building standalone package documentation. Also include them the proper sphinx way. Related to T4496
-
- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- May 09, 2022
-
-
Pratyush authored
-