Commits · debian/1.2.0-1_swh1 · Prateek Jain / swh-perfecthash

Dec 04, 2023
- Updated debian changelog for version 1.2.0 · b2160369
  Jenkins for Software Heritage authored 1 year ago
  
  debian/1.2.0-1_swh1
  
  b2160369
- Update upstream source from tag 'debian/upstream/1.2.0' · 14f72afd
  Jenkins for Software Heritage authored 1 year ago
```
Update to upstream version '1.2.0'
with Debian dir a5bc8fdabad9c39956352379b14fda673da9e2b4
```
  14f72afd
- New upstream version 1.2.0 · 839301a0
  Jenkins for Software Heritage authored 1 year ago
  
  debian/upstream/1.2.0
  
  839301a0
Dec 03, 2023
- Apply swh-py-template 0.1.6 · 65e6afd9
  David Douard authored 1 year ago
  
  v1.2.0
  
  65e6afd9
Nov 29, 2023

docs/Makefile: Fix doc build outside tox when using make command · bf3250b0

When building package documentation outside tox by calling make in the
docs directory, the include of Makefile.sphinx inside the docs Makefile
was failing as its relative path was invalid.

So adapt this relative path according if the SWH_PACKAGE_DOC_TOX_BUILD
environment variable is set or not.

bf3250b0

Nov 28, 2023

Update to swh-py-template v0.1.5 · 0cdc6b12

David Douard authored 1 year ago

- (re)add pkg_resources in mypy's ignored list
- remove sphinx-dev target from tox
- update sphinx target to use non-editable install of swh.docs
- remove MANIFEST.in
- exclude .clang-format from package data files

0cdc6b12

Nov 21, 2023
- Add a "Build dependencies" section in the README file · 6fd1ae1e
  David Douard authored 1 year ago
  
  6fd1ae1e
- Migrate to copier-based template · 68cb6785
  David Douard authored 1 year ago
  
  68cb6785
Nov 20, 2023
- Updated debian changelog for version 1.1.0 · 201020c2
  Jenkins for Software Heritage authored 1 year ago
  
  debian/1.1.0-1_swh1
  
  201020c2
- New upstream version 1.1.0 · 154baf0d
  Jenkins for Software Heritage authored 1 year ago
  
  debian/upstream/1.1.0
  
  154baf0d
- Update upstream source from tag 'debian/upstream/1.1.0' · 77818fda
  Jenkins for Software Heritage authored 1 year ago
```
Update to upstream version '1.1.0'
with Debian dir fd6c19aa8223a2c638bf2b4c7d05eeab66c5fe6f
```
  77818fda
Nov 16, 2023
- docs: include the README file in the main index page · 1c1eb348
  David Douard authored 1 year ago
  
  v1.1.0
  
  1c1eb348
Oct 24, 2023

Ensure the key found in the index matches the query · fe2f6f73

Jérémy Bobbio (Lunar) authored 1 year ago

To detect instances of corruption, we now verify that the key found in
the index during lookups matches the one given for the query.

fe2f6f73

Use our own setrlimit fixture instead of pytest-fork · 57415793

Nicolas Dandrimont authored 1 year ago and

Jérémy Bobbio (Lunar) committed 1 year ago

By setting only the soft limit, we get the expected behavior but we are
still able to reset it to the original value. This allows us not to use
pytest-forked, while still avoiding pollution of other tests.

57415793

Set the maximum load factor when building the perfect hash function · 6edde4c8

Jérémy Bobbio (Lunar) authored 1 year ago

We used to build the perfect hash function with the CHD_PH algorithm
with the default load factor value which was 0.5. We thus ended up
a hash function which itself took the minimum amount of space, but
would result in an index twice as large as the number of
objects in the shard.

We now set the load factor to the maximum value (0.99) using the
ill-named `cmph_config_set_graphsize()` function to loose the
minimum amount of space in the index.

The 1% lost will still amout to 11-12 MiB for a 30 millions objects
shard.

6edde4c8

Write keys to the index to make the shard self-contained · ee9a7ae9

Jérémy Bobbio (Lunar) authored 1 year ago

The keys associated with each object are currently only used to create
and later to query the perfect-hash function. This means there is no way
to easily learn about all keys present in a given shard. As long as the
keys are hashes, we could recover them by computing the hash of each
object, but this would be a fairly operation given the target size.

To make the shard self-contained, we write the keys (32 bytes) for each
object to the index, before the position of the object entry. Each index
entry is now 40 bytes. With shards of 100 GiB, with 3 kiB for each
objects, the index will use about 1.23 GiB of extra space.

Writing the keys in the index enables us to do a linear scan to
get all keys or to retrieve an object if the perfect-hash
function is broken in any way.

Another bonus of keeping the key in the index is that the index can
be rewritten to change the keysize, without having to rewrite the shard
entirely

ee9a7ae9

Properly handle errors when creating the hash function · a052b42d

Jérémy Bobbio (Lunar) authored 1 year ago

`cmph_new()` will return NULL when there is an issue while
creating the hash function. This was previously unchecked,
leading to a nice segfault when trying to save the index.

We now properly handle the case and throw an exception on the Python
side, hinting about possible duplicate keys.

a052b42d

Rename some functions and methods for clarity · 6923a2b3

Jérémy Bobbio (Lunar) authored 1 year ago

- `ShardCreator.create()` → `.prepare()`

  To create a shard, you actually need to write header, objects,
  index and perfect hash function. This method only initializes
  an incomplete header, so let’s call it `prepare()` instead.
  Update the C symbol accordingly.

- `ShardCreator.save()` → `.finalize()`

  When calling `write()`, objects are already written to the
  file. Using `save()` is misleading as one can easily think
  that everything is buffered until `save()` is called.
  Let’s call this `finalize()` as this is what makes the shard
  usable but writing the index and perfect hash function.
  Update the C symbol accordingly.

- `shard_lookup_object_size` → `shard_find_object()` and
  `shard_lookup_object` → `shard_read_object()`

  These two functions must always be called in sequence.
  The first will seek to the position of the object entry,
  read the size. What follow immediately after is the object data,
  and this is what the second function retrieves.
  We rename these function to make it clearer than a lookup
  nedds actually both of them.

6923a2b3

Use bytes directly and remove HashObject · 782e9896

Jérémy Bobbio (Lunar) authored 1 year ago

The HashObject type does not bring any value. The code treats object
content as opaque bytes without assuming any properties, imposing any
restriction or having specific behaviors.

In order to simplify the API, let’s just say `swh-perfecthash` stores
`bytes` without making same special.

782e9896

Improve error handling · ab1c9f6e

Jérémy Bobbio (Lunar) authored 1 year ago

We now properly raise `OSError` (and subclasses) by looking
at the vale of errno.

Several tests have been added triggering various error conditions.
`resource.setrlimit()` is used to limit the maximum size of a file.
This will make `write()` fail. As this limit would spill on other
tests, we use `pytest-forked` to run these tests in a different process.

ab1c9f6e

Split creating and consulting a Shard into different classes · 455e8262

Jérémy Bobbio (Lunar) authored 1 year ago

Creating and consulting a Shard are two different operations that
are always done separately. Having both in a single class meant
there was multiple ways to be in an illegal state.

Instead, we now have a `ShardCreator` that should be used as
a context manager, like so:

    with ShardCreator("shard", object_count=len(objects)) as shard:
        for key, object in objects.items():
            shard.write(key, object)

The `Shard` class can now only be used to perform `lookup()`.
The `load()` function was removed and inlined in initialization.
A `close()` method is available, but more importantly, the class
can now act as a context manager:

    with Shard("shard") as shard:
        return shard.lookup(key)

455e8262

Oct 13, 2023
- Add clang-format to pre-commit config · e9cf5fab
  Nicolas Dandrimont authored 1 year ago
  
  e9cf5fab
- create, load: Provide more descriptive error messages · f335225b
  Nicolas Dandrimont authored 1 year ago
```
Instead of an assertion failure, at least return the path of the shard
that failed to load.
```
  f335225b
- Add typing for cffi calls · 470d0271
  Nicolas Dandrimont authored 1 year ago
```
The main change is that self.ffi.buffer is corrrectly detected as
incompatible with HashObject. Use self.ffi.unpack (which generates a
bytes object) instead.
```
  470d0271
Oct 12, 2023

Add fdatasync call to shard_save() · 13c140da

Nicolas Dandrimont authored 1 year ago

This increases the probability that data is flushed to the shard before
we claim that it's ready for use.

13c140da

Aug 07, 2023
- Updated debian changelog for version 1.0.0 · c822355f
  Jenkins for Software Heritage authored 1 year ago
  
  debian/1.0.0-1_swh1
  
  c822355f
- New upstream version 1.0.0 · 6d781c7d
  Jenkins for Software Heritage authored 1 year ago
  
  debian/upstream/1.0.0
  
  6d781c7d
- Update upstream source from tag 'debian/upstream/1.0.0' · 54b4c262
  Jenkins for Software Heritage authored 1 year ago
```
Update to upstream version '1.0.0'
with Debian dir 16f97fd124eb1cf7038287f4264600685341f9c3
```
  54b4c262
Jun 23, 2023

Add configuration for building binary wheels · 951bae4f

Nicolas Dandrimont authored 1 year ago

This adds a script to build cmph locally, and hooks into the ffibuilder
to use the local copy of cmph if it's available.

This also configures cibuildwheel to properly generate manylinux wheels.

951bae4f

Feb 17, 2023
- mypy: Bump to 1.0.1 · 22ec91f3
  Antoine Lambert authored 1 year ago
```
Related to swh/meta#4960
```
  22ec91f3
Feb 16, 2023
- Update documentation to use tox 4 command line syntax · 8135902b
  Jérémy Bobbio (Lunar) authored 1 year ago
```
See: https://tox.wiki/en/latest/upgrading.html#updating-usage-with-e
```
  8135902b
- Update and clean tox configuration for version 4 · f24ba300
  Jérémy Bobbio (Lunar) authored 1 year ago
```
Related to swh/meta#4959
```
  f24ba300
Feb 02, 2023

pre-commit: Bump isort from 5.10.1 to 5.11.5 · cbe85b8b

Antoine Lambert authored 1 year ago

This fixes python 3.7 support due to poetry, a dependency of isort, that
removed support for that Python version in a recent release.

cbe85b8b

Dec 19, 2022

docs: Include module indices only when building standalone package doc · 286661de

Antoine Lambert authored 2 years ago

In order to remove warnings about /apidoc/*.rst files being included
multiple times in toc when building full swh documentation, prefer to
include module indices only when building standalone package documentation.

Also include them the proper sphinx way.

Related to T4496

286661de

Oct 18, 2022

pre-commit, tox: Bump pre-commit, codespell, black and flake8 · 8142df4b

David Douard authored 2 years ago

- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).

8142df4b

May 09, 2022
- add strict asyncio_mode in pytest.ini · e5e39fc6
  Pratyush authored 2 years ago
  
  e5e39fc6
Apr 26, 2022
- Bump mypy to v0.942 · 01e993f1
  vlorentz authored 2 years ago
  
  01e993f1
Apr 21, 2022

pre-commit: Remove codespell commit-msg hook · 239fd900

Antoine Lambert authored 2 years ago

That hook can be frustrating as it can discard a long commit message
if it finds a typo in it so better removing it.

239fd900

Apr 08, 2022
- Add .git-blame-ignore-revs file with automatic reformatting commits · 05b03447
  Antoine Lambert authored 2 years ago
  
  05b03447
- python: Reformat code with black 22.3.0 · d9654140
  Antoine Lambert authored 2 years ago
```
Related to T3922
```
  d9654140