Commits · master · David Douard / swh-perfecthash

Oct 24, 2023

Ensure the key found in the index matches the query · fe2f6f73

Jérémy Bobbio (Lunar) authored 1 year ago

To detect instances of corruption, we now verify that the key found in
the index during lookups matches the one given for the query.

fe2f6f73

Use our own setrlimit fixture instead of pytest-fork · 57415793

Nicolas Dandrimont authored 1 year ago and

Jérémy Bobbio (Lunar) committed 1 year ago

By setting only the soft limit, we get the expected behavior but we are
still able to reset it to the original value. This allows us not to use
pytest-forked, while still avoiding pollution of other tests.

57415793

Set the maximum load factor when building the perfect hash function · 6edde4c8

Jérémy Bobbio (Lunar) authored 1 year ago

We used to build the perfect hash function with the CHD_PH algorithm
with the default load factor value which was 0.5. We thus ended up
a hash function which itself took the minimum amount of space, but
would result in an index twice as large as the number of
objects in the shard.

We now set the load factor to the maximum value (0.99) using the
ill-named `cmph_config_set_graphsize()` function to loose the
minimum amount of space in the index.

The 1% lost will still amout to 11-12 MiB for a 30 millions objects
shard.

6edde4c8

Write keys to the index to make the shard self-contained · ee9a7ae9

Jérémy Bobbio (Lunar) authored 1 year ago

The keys associated with each object are currently only used to create
and later to query the perfect-hash function. This means there is no way
to easily learn about all keys present in a given shard. As long as the
keys are hashes, we could recover them by computing the hash of each
object, but this would be a fairly operation given the target size.

To make the shard self-contained, we write the keys (32 bytes) for each
object to the index, before the position of the object entry. Each index
entry is now 40 bytes. With shards of 100 GiB, with 3 kiB for each
objects, the index will use about 1.23 GiB of extra space.

Writing the keys in the index enables us to do a linear scan to
get all keys or to retrieve an object if the perfect-hash
function is broken in any way.

Another bonus of keeping the key in the index is that the index can
be rewritten to change the keysize, without having to rewrite the shard
entirely

ee9a7ae9

Properly handle errors when creating the hash function · a052b42d

Jérémy Bobbio (Lunar) authored 1 year ago

`cmph_new()` will return NULL when there is an issue while
creating the hash function. This was previously unchecked,
leading to a nice segfault when trying to save the index.

We now properly handle the case and throw an exception on the Python
side, hinting about possible duplicate keys.

a052b42d

Rename some functions and methods for clarity · 6923a2b3

Jérémy Bobbio (Lunar) authored 1 year ago

- `ShardCreator.create()` → `.prepare()`

  To create a shard, you actually need to write header, objects,
  index and perfect hash function. This method only initializes
  an incomplete header, so let’s call it `prepare()` instead.
  Update the C symbol accordingly.

- `ShardCreator.save()` → `.finalize()`

  When calling `write()`, objects are already written to the
  file. Using `save()` is misleading as one can easily think
  that everything is buffered until `save()` is called.
  Let’s call this `finalize()` as this is what makes the shard
  usable but writing the index and perfect hash function.
  Update the C symbol accordingly.

- `shard_lookup_object_size` → `shard_find_object()` and
  `shard_lookup_object` → `shard_read_object()`

  These two functions must always be called in sequence.
  The first will seek to the position of the object entry,
  read the size. What follow immediately after is the object data,
  and this is what the second function retrieves.
  We rename these function to make it clearer than a lookup
  nedds actually both of them.

6923a2b3

Use bytes directly and remove HashObject · 782e9896

Jérémy Bobbio (Lunar) authored 1 year ago

The HashObject type does not bring any value. The code treats object
content as opaque bytes without assuming any properties, imposing any
restriction or having specific behaviors.

In order to simplify the API, let’s just say `swh-perfecthash` stores
`bytes` without making same special.

782e9896

Improve error handling · ab1c9f6e

Jérémy Bobbio (Lunar) authored 1 year ago

We now properly raise `OSError` (and subclasses) by looking
at the vale of errno.

Several tests have been added triggering various error conditions.
`resource.setrlimit()` is used to limit the maximum size of a file.
This will make `write()` fail. As this limit would spill on other
tests, we use `pytest-forked` to run these tests in a different process.

ab1c9f6e

Split creating and consulting a Shard into different classes · 455e8262

Jérémy Bobbio (Lunar) authored 1 year ago

Creating and consulting a Shard are two different operations that
are always done separately. Having both in a single class meant
there was multiple ways to be in an illegal state.

Instead, we now have a `ShardCreator` that should be used as
a context manager, like so:

    with ShardCreator("shard", object_count=len(objects)) as shard:
        for key, object in objects.items():
            shard.write(key, object)

The `Shard` class can now only be used to perform `lookup()`.
The `load()` function was removed and inlined in initialization.
A `close()` method is available, but more importantly, the class
can now act as a context manager:

    with Shard("shard") as shard:
        return shard.lookup(key)

455e8262

Oct 13, 2023
- Add clang-format to pre-commit config · e9cf5fab
  Nicolas Dandrimont authored 1 year ago
  
  e9cf5fab
- create, load: Provide more descriptive error messages · f335225b
  Nicolas Dandrimont authored 1 year ago
  
  Instead of an assertion failure, at least return the path of the shard that failed to load.
  f335225b
- Add typing for cffi calls · 470d0271
  Nicolas Dandrimont authored 1 year ago
  
  The main change is that self.ffi.buffer is corrrectly detected as incompatible with HashObject. Use self.ffi.unpack (which generates a bytes object) instead.
  470d0271
Oct 12, 2023

Add fdatasync call to shard_save() · 13c140da

Nicolas Dandrimont authored 1 year ago

This increases the probability that data is flushed to the shard before
we claim that it's ready for use.

13c140da

Jun 23, 2023

Add configuration for building binary wheels · 951bae4f

Nicolas Dandrimont authored 1 year ago

This adds a script to build cmph locally, and hooks into the ffibuilder
to use the local copy of cmph if it's available.

This also configures cibuildwheel to properly generate manylinux wheels.

951bae4f

Feb 17, 2023
- mypy: Bump to 1.0.1 · 22ec91f3
  Antoine Lambert authored 2 years ago
  
  Related to swh/meta#4960
  22ec91f3
Feb 16, 2023
- Update documentation to use tox 4 command line syntax · 8135902b
  Jérémy Bobbio (Lunar) authored 2 years ago
  
  See: https://tox.wiki/en/latest/upgrading.html#updating-usage-with-e
  8135902b
- Update and clean tox configuration for version 4 · f24ba300
  Jérémy Bobbio (Lunar) authored 2 years ago
  
  Related to swh/meta#4959
  f24ba300
Feb 02, 2023

pre-commit: Bump isort from 5.10.1 to 5.11.5 · cbe85b8b

Antoine Lambert authored 2 years ago

This fixes python 3.7 support due to poetry, a dependency of isort, that
removed support for that Python version in a recent release.

cbe85b8b

Dec 19, 2022

docs: Include module indices only when building standalone package doc · 286661de

Antoine Lambert authored 2 years ago

In order to remove warnings about /apidoc/*.rst files being included
multiple times in toc when building full swh documentation, prefer to
include module indices only when building standalone package documentation.

Also include them the proper sphinx way.

Related to T4496

286661de

Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · 8142df4b

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

8142df4b

May 09, 2022
- add strict asyncio_mode in pytest.ini · e5e39fc6
  Pratyush authored 2 years ago
  
  e5e39fc6
Apr 26, 2022
- Bump mypy to v0.942 · 01e993f1
  vlorentz authored 2 years ago
  
  01e993f1
Apr 21, 2022

pre-commit: Remove codespell commit-msg hook · 239fd900

Antoine Lambert authored 2 years ago

That hook can be frustrating as it can discard a long commit message
if it finds a typo in it so better removing it.

239fd900

Apr 08, 2022

Add .git-blame-ignore-revs file with automatic reformatting commits · 05b03447
Antoine Lambert authored 2 years ago

05b03447
python: Reformat code with black 22.3.0 · d9654140
Antoine Lambert authored 2 years ago
```
Related to T3922
```
d9654140

pre-commit, tox: Bump black from 19.10b0 to 22.3.0 · 79f90bff

Antoine Lambert authored 2 years ago

black is considered stable since release 22.1.0 and the version
we are currently using is quite outdated and not compatible with
click 8.1.0, so it is time to bump it to its latest stable release.

Please note that E501 pycodestyle warning related to line length
is replaced by B950 one from flake8-bugbear as recommended by black.
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length

Related to T3922

79f90bff

Apr 06, 2022

requirements-test: Remove pytest pinning to < 7 · 38e39aa3

Antoine Lambert authored 2 years ago

pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for
pytest >= 7 so we can now drop the pytest pinning.

38e39aa3

Mar 22, 2022

pytest: Exclude build directory for tests discovery · 1e4e63af

Antoine Lambert authored 2 years ago

Due to test modules being copied in subdirectories of the
build directory by setuptools, it makes pytest fail by raising
ImportPathMismatchError exceptions when invoked from root
directory of the module.

So ignore the build folder to discover tests.

1e4e63af

Mar 03, 2022
- Make sure _hash_cffi.so is gitignored · 0c39e35b
  Nicolas Dandrimont authored 2 years ago
  
  0c39e35b
Feb 10, 2022
- pre-commit: Bump hooks and add new one to check commit message spelling · e9938d67
  Antoine Lambert authored 3 years ago
  
  To install the new hook: $ pre-commit install -t commit-msg
  e9938d67
Feb 07, 2022
- requirements-test: Pin pytest to < 7.0.0 · f1d98719
  Antoine R. Dumont authored 3 years ago
  
  Related to T3916
  f1d98719
Jan 25, 2022
- the desired key len is 32 for sha256 · 3b5a0b09
  Loïc Dachary authored 3 years ago
  
  v0.1.2
  
  3b5a0b09
- the key has a fixed len: do not hardcode it · d79745c5
  Loïc Dachary authored 3 years ago
  
  and add a safeguard in case the caller provides a key with the wrong size
  d79745c5
Dec 16, 2021
- Pin mypy and drop type annotations which makes mypy unhappy · 7fd6dbac
  Antoine R. Dumont authored 3 years ago
  
  This also drops spurious copyright headers to those files if present. Related to T3812
  7fd6dbac
Dec 08, 2021
- doc: add a description line · 6aa71e1d
  Loïc Dachary authored 3 years ago
  
  v0.1.1
  
  6aa71e1d
Nov 10, 2021

create and lookup a Read Shard with a perfect hash · 9266eaa6

Loïc Dachary authored 3 years ago


This package is intended to be used by the new object storage, as
a low level dependency to create and lookup a Read Shard.

It is implemented in C and based on the cmph library for better
performances. It will be used when a Read Shard must be created with
around fifty millions objects, totaling around 100GB.

The objects and their key (their cryptographic signature) will be
retrieved, in python from the postgres database where the Write Shard
lives. One after the other they will be inserted in the Read Shard
using the **write** method. In the end the **save** method will create
the perfect hash table using the cmph library and store it in the
file (it typically takes a few seconds).

There is no write amplification during the creation of the Read Shard:
each byte is written exactly once, sequentially. There is no read
operation. The memory footprint is 2*n*32 where n is the number of
inserted keys.

The **lookup** method relies on the hash function which is loaded in
memory when the **load** function is called. It obtains the offset of
the object by looking up its offset in the file from an index which
may be up to 2x the number of keys (it is not minimal).

Signed-off-by: Loïc Dachary <loic@dachary.org>

9266eaa6

Oct 06, 2021
- C stub compiled via cffi and tested in python · ed6bad56
  Loïc Dachary authored 3 years ago
  
  ed6bad56
- import template from swh-py-template (init-py-repo) · 8229acfd
  Nicolas Dandrimont authored 3 years ago
  
  8229acfd