Skip to content
Snippets Groups Projects
  1. Dec 04, 2023
  2. Dec 03, 2023
  3. Nov 29, 2023
    • Antoine Lambert's avatar
      docs/Makefile: Fix doc build outside tox when using make command · bf3250b0
      Antoine Lambert authored
      When building package documentation outside tox by calling make in the
      docs directory, the include of Makefile.sphinx inside the docs Makefile
      was failing as its relative path was invalid.
      
      So adapt this relative path according if the SWH_PACKAGE_DOC_TOX_BUILD
      environment variable is set or not.
      bf3250b0
  4. Nov 28, 2023
    • David Douard's avatar
      Update to swh-py-template v0.1.5 · 0cdc6b12
      David Douard authored
      - (re)add pkg_resources in mypy's ignored list
      - remove sphinx-dev target from tox
      - update sphinx target to use non-editable install of swh.docs
      - remove MANIFEST.in
      - exclude .clang-format from package data files
      0cdc6b12
  5. Nov 21, 2023
  6. Nov 20, 2023
  7. Nov 16, 2023
  8. Oct 24, 2023
    • Jérémy Bobbio (Lunar)'s avatar
      Ensure the key found in the index matches the query · fe2f6f73
      Jérémy Bobbio (Lunar) authored
      To detect instances of corruption, we now verify that the key found in
      the index during lookups matches the one given for the query.
      fe2f6f73
    • Nicolas Dandrimont's avatar
      Use our own setrlimit fixture instead of pytest-fork · 57415793
      Nicolas Dandrimont authored and Jérémy Bobbio (Lunar)'s avatar Jérémy Bobbio (Lunar) committed
      By setting only the soft limit, we get the expected behavior but we are
      still able to reset it to the original value. This allows us not to use
      pytest-forked, while still avoiding pollution of other tests.
      57415793
    • Jérémy Bobbio (Lunar)'s avatar
      Set the maximum load factor when building the perfect hash function · 6edde4c8
      Jérémy Bobbio (Lunar) authored
      We used to build the perfect hash function with the CHD_PH algorithm
      with the default load factor value which was 0.5. We thus ended up
      a hash function which itself took the minimum amount of space, but
      would result in an index twice as large as the number of
      objects in the shard.
      
      We now set the load factor to the maximum value (0.99) using the
      ill-named `cmph_config_set_graphsize()` function to loose the
      minimum amount of space in the index.
      
      The 1% lost will still amout to 11-12 MiB for a 30 millions objects
      shard.
      6edde4c8
    • Jérémy Bobbio (Lunar)'s avatar
      Write keys to the index to make the shard self-contained · ee9a7ae9
      Jérémy Bobbio (Lunar) authored
      The keys associated with each object are currently only used to create
      and later to query the perfect-hash function. This means there is no way
      to easily learn about all keys present in a given shard. As long as the
      keys are hashes, we could recover them by computing the hash of each
      object, but this would be a fairly operation given the target size.
      
      To make the shard self-contained, we write the keys (32 bytes) for each
      object to the index, before the position of the object entry. Each index
      entry is now 40 bytes. With shards of 100 GiB, with 3 kiB for each
      objects, the index will use about 1.23 GiB of extra space.
      
      Writing the keys in the index enables us to do a linear scan to
      get all keys or to retrieve an object if the perfect-hash
      function is broken in any way.
      
      Another bonus of keeping the key in the index is that the index can
      be rewritten to change the keysize, without having to rewrite the shard
      entirely
      ee9a7ae9
    • Jérémy Bobbio (Lunar)'s avatar
      Properly handle errors when creating the hash function · a052b42d
      Jérémy Bobbio (Lunar) authored
      `cmph_new()` will return NULL when there is an issue while
      creating the hash function. This was previously unchecked,
      leading to a nice segfault when trying to save the index.
      
      We now properly handle the case and throw an exception on the Python
      side, hinting about possible duplicate keys.
      a052b42d
    • Jérémy Bobbio (Lunar)'s avatar
      Rename some functions and methods for clarity · 6923a2b3
      Jérémy Bobbio (Lunar) authored
      - `ShardCreator.create()` → `.prepare()`
      
        To create a shard, you actually need to write header, objects,
        index and perfect hash function. This method only initializes
        an incomplete header, so let’s call it `prepare()` instead.
        Update the C symbol accordingly.
      
      - `ShardCreator.save()` → `.finalize()`
      
        When calling `write()`, objects are already written to the
        file. Using `save()` is misleading as one can easily think
        that everything is buffered until `save()` is called.
        Let’s call this `finalize()` as this is what makes the shard
        usable but writing the index and perfect hash function.
        Update the C symbol accordingly.
      
      - `shard_lookup_object_size` → `shard_find_object()` and
        `shard_lookup_object` → `shard_read_object()`
      
        These two functions must always be called in sequence.
        The first will seek to the position of the object entry,
        read the size. What follow immediately after is the object data,
        and this is what the second function retrieves.
        We rename these function to make it clearer than a lookup
        nedds actually both of them.
      6923a2b3
    • Jérémy Bobbio (Lunar)'s avatar
      Use bytes directly and remove HashObject · 782e9896
      Jérémy Bobbio (Lunar) authored
      The HashObject type does not bring any value. The code treats object
      content as opaque bytes without assuming any properties, imposing any
      restriction or having specific behaviors.
      
      In order to simplify the API, let’s just say `swh-perfecthash` stores
      `bytes` without making same special.
      782e9896
    • Jérémy Bobbio (Lunar)'s avatar
      Improve error handling · ab1c9f6e
      Jérémy Bobbio (Lunar) authored
      We now properly raise `OSError` (and subclasses) by looking
      at the vale of errno.
      
      Several tests have been added triggering various error conditions.
      `resource.setrlimit()` is used to limit the maximum size of a file.
      This will make `write()` fail. As this limit would spill on other
      tests, we use `pytest-forked` to run these tests in a different process.
      ab1c9f6e
    • Jérémy Bobbio (Lunar)'s avatar
      Split creating and consulting a Shard into different classes · 455e8262
      Jérémy Bobbio (Lunar) authored
      Creating and consulting a Shard are two different operations that
      are always done separately. Having both in a single class meant
      there was multiple ways to be in an illegal state.
      
      Instead, we now have a `ShardCreator` that should be used as
      a context manager, like so:
      
          with ShardCreator("shard", object_count=len(objects)) as shard:
              for key, object in objects.items():
                  shard.write(key, object)
      
      The `Shard` class can now only be used to perform `lookup()`.
      The `load()` function was removed and inlined in initialization.
      A `close()` method is available, but more importantly, the class
      can now act as a context manager:
      
          with Shard("shard") as shard:
              return shard.lookup(key)
      455e8262
  9. Oct 13, 2023
  10. Oct 12, 2023
  11. Aug 07, 2023
  12. Jun 23, 2023
  13. Feb 17, 2023
  14. Feb 16, 2023
  15. Feb 02, 2023
  16. Dec 19, 2022
  17. Oct 18, 2022
  18. May 09, 2022
  19. Apr 26, 2022
  20. Apr 21, 2022
  21. Apr 08, 2022
Loading