- Dec 04, 2023
-
-
Jenkins for Software Heritage authored
Update to upstream version '1.2.0' with Debian dir a5bc8fdabad9c39956352379b14fda673da9e2b4
- Dec 03, 2023
-
-
David Douard authored
-
- Nov 29, 2023
-
-
Antoine Lambert authored
When building package documentation outside tox by calling make in the docs directory, the include of Makefile.sphinx inside the docs Makefile was failing as its relative path was invalid. So adapt this relative path according if the SWH_PACKAGE_DOC_TOX_BUILD environment variable is set or not.
-
- Nov 28, 2023
-
-
David Douard authored
- (re)add pkg_resources in mypy's ignored list - remove sphinx-dev target from tox - update sphinx target to use non-editable install of swh.docs - remove MANIFEST.in - exclude .clang-format from package data files
-
- Nov 21, 2023
-
-
David Douard authored
-
David Douard authored
-
- Nov 20, 2023
-
-
Jenkins for Software Heritage authored
Update to upstream version '1.1.0' with Debian dir fd6c19aa8223a2c638bf2b4c7d05eeab66c5fe6f
- Nov 16, 2023
-
-
David Douard authored
-
- Oct 24, 2023
-
-
Jérémy Bobbio (Lunar) authored
To detect instances of corruption, we now verify that the key found in the index during lookups matches the one given for the query.
-
By setting only the soft limit, we get the expected behavior but we are still able to reset it to the original value. This allows us not to use pytest-forked, while still avoiding pollution of other tests.
-
Jérémy Bobbio (Lunar) authored
We used to build the perfect hash function with the CHD_PH algorithm with the default load factor value which was 0.5. We thus ended up a hash function which itself took the minimum amount of space, but would result in an index twice as large as the number of objects in the shard. We now set the load factor to the maximum value (0.99) using the ill-named `cmph_config_set_graphsize()` function to loose the minimum amount of space in the index. The 1% lost will still amout to 11-12 MiB for a 30 millions objects shard.
-
Jérémy Bobbio (Lunar) authored
The keys associated with each object are currently only used to create and later to query the perfect-hash function. This means there is no way to easily learn about all keys present in a given shard. As long as the keys are hashes, we could recover them by computing the hash of each object, but this would be a fairly operation given the target size. To make the shard self-contained, we write the keys (32 bytes) for each object to the index, before the position of the object entry. Each index entry is now 40 bytes. With shards of 100 GiB, with 3 kiB for each objects, the index will use about 1.23 GiB of extra space. Writing the keys in the index enables us to do a linear scan to get all keys or to retrieve an object if the perfect-hash function is broken in any way. Another bonus of keeping the key in the index is that the index can be rewritten to change the keysize, without having to rewrite the shard entirely
-
Jérémy Bobbio (Lunar) authored
`cmph_new()` will return NULL when there is an issue while creating the hash function. This was previously unchecked, leading to a nice segfault when trying to save the index. We now properly handle the case and throw an exception on the Python side, hinting about possible duplicate keys.
-
Jérémy Bobbio (Lunar) authored
- `ShardCreator.create()` → `.prepare()` To create a shard, you actually need to write header, objects, index and perfect hash function. This method only initializes an incomplete header, so let’s call it `prepare()` instead. Update the C symbol accordingly. - `ShardCreator.save()` → `.finalize()` When calling `write()`, objects are already written to the file. Using `save()` is misleading as one can easily think that everything is buffered until `save()` is called. Let’s call this `finalize()` as this is what makes the shard usable but writing the index and perfect hash function. Update the C symbol accordingly. - `shard_lookup_object_size` → `shard_find_object()` and `shard_lookup_object` → `shard_read_object()` These two functions must always be called in sequence. The first will seek to the position of the object entry, read the size. What follow immediately after is the object data, and this is what the second function retrieves. We rename these function to make it clearer than a lookup nedds actually both of them.
-
Jérémy Bobbio (Lunar) authored
The HashObject type does not bring any value. The code treats object content as opaque bytes without assuming any properties, imposing any restriction or having specific behaviors. In order to simplify the API, let’s just say `swh-perfecthash` stores `bytes` without making same special.
-
Jérémy Bobbio (Lunar) authored
We now properly raise `OSError` (and subclasses) by looking at the vale of errno. Several tests have been added triggering various error conditions. `resource.setrlimit()` is used to limit the maximum size of a file. This will make `write()` fail. As this limit would spill on other tests, we use `pytest-forked` to run these tests in a different process.
-
Jérémy Bobbio (Lunar) authored
Creating and consulting a Shard are two different operations that are always done separately. Having both in a single class meant there was multiple ways to be in an illegal state. Instead, we now have a `ShardCreator` that should be used as a context manager, like so: with ShardCreator("shard", object_count=len(objects)) as shard: for key, object in objects.items(): shard.write(key, object) The `Shard` class can now only be used to perform `lookup()`. The `load()` function was removed and inlined in initialization. A `close()` method is available, but more importantly, the class can now act as a context manager: with Shard("shard") as shard: return shard.lookup(key)
-
- Oct 13, 2023
-
-
Nicolas Dandrimont authored
-
Nicolas Dandrimont authored
Instead of an assertion failure, at least return the path of the shard that failed to load.
-
Nicolas Dandrimont authored
The main change is that self.ffi.buffer is corrrectly detected as incompatible with HashObject. Use self.ffi.unpack (which generates a bytes object) instead.
-
- Oct 12, 2023
-
-
Nicolas Dandrimont authored
This increases the probability that data is flushed to the shard before we claim that it's ready for use.
-
- Aug 07, 2023
-
-
Jenkins for Software Heritage authored
Update to upstream version '1.0.0' with Debian dir 16f97fd124eb1cf7038287f4264600685341f9c3
- Jun 23, 2023
-
-
Nicolas Dandrimont authored
This adds a script to build cmph locally, and hooks into the ffibuilder to use the local copy of cmph if it's available. This also configures cibuildwheel to properly generate manylinux wheels.
-
- Feb 17, 2023
-
-
Antoine Lambert authored
Related to swh/meta#4960
-
- Feb 16, 2023
-
-
-
Jérémy Bobbio (Lunar) authored
Related to swh/meta#4959
-
- Feb 02, 2023
-
-
Antoine Lambert authored
This fixes python 3.7 support due to poetry, a dependency of isort, that removed support for that Python version in a recent release.
-
- Dec 19, 2022
-
-
Antoine Lambert authored
In order to remove warnings about /apidoc/*.rst files being included multiple times in toc when building full swh documentation, prefer to include module indices only when building standalone package documentation. Also include them the proper sphinx way. Related to T4496
-
- Oct 18, 2022
-
-
David Douard authored
- pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).
-
- May 09, 2022
-
-
Pratyush authored
-
- Apr 26, 2022
-
-
vlorentz authored
-
- Apr 21, 2022
-
-
Antoine Lambert authored
That hook can be frustrating as it can discard a long commit message if it finds a typo in it so better removing it.
-
- Apr 08, 2022
-
-
Antoine Lambert authored
-
Antoine Lambert authored
Related to T3922
-