Skip to content
Snippets Groups Projects

Rename the project as swh-shard, migrate to pybind11 and add cli tools

Open David Douard requested to merge douardda/swh-perfecthash:pybind11 into master

It does not make much sense to call it perfecthash, since the aim of this package is creating, reading and manipulating shard files (which do use cmph to speed extracting content objects from the shard file, but this is an implementation detail, really).

Use pybind11 to wrap the cmph and shard manipulation code instead of cffi, it makes is a bit easier to add (C/C++) features in the extension.

Add a cli tool to manipulate shard files. Currently, it allows to:

  • read the header of the shard file
  • list entries in the shard file (as a list of {key: length})
  • get an object from a shard file
  • create a shard file from a list of files
  • delete one (or more) entry from a shard

The extension source files have been moved to src/_shard, and the python source files for the swh.shard package have been moved to src/swh/shard, moving away from all other swh package structure. This is required to prevent side effects of having the local 'swh' directory in the working directory of the developer, thus in the sys.path (by default), breaking the dark magic involved in the loading of the package when it is installed in editable mode (i.e. not to break pytest when executed directly from the source with the package being installed in editable mode).

Edited by David Douard

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Jenkins job DOPH/gitlab-builds #68 failed in 49 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • Author Maintainer

    This is a proposal for improvement of the shard file manipulation library.

    It should be 100% compatible with existing usage in winery (winery tests are OK with it).

    It provides a few cli tools to manipulate shard files:

    $ swh-shard  info toto.shard
    Shard toto.shard
    ├─version:    1
    ├─objects:    1996
    │ ├─position: 512
    │ └─size:     18843173
    ├─index
    │ ├─position: 18843685
    │ └─size:     80680
    └─hash
      └─position: 18924365
    
    $ swh-shard  ls toto.shard
    1641c1829c716fefe077aaf51639cd85f30ecc0518c97a17289e9a6e28df7055: 132 bytes
    c62ef6626212fa5123c8ad773cdff7aa4186b61038c86a0ad02eaa5b29f21eb1: 682 bytes
    05f43789f9bb3464e5287a98bdace0696a037fa183462a83c0071d3209d9d1ca: 3243 bytes
    a1dec2c7e652871138a9eefa41d9d302c42c5202886bd48b3a8a0ebc7404ff20: 2876 bytes
    [...]
    
    $ swh-shard  get toto.shard 1641c1829c716fefe077aaf51639cd85f30ecc0518c97a17289e9a6e28df7055 | sha256sum
    1641c1829c716fefe077aaf51639cd85f30ecc0518c97a17289e9a6e28df7055  -
    
    $ swh-shard create tutu.shard src/swh/shard/*.py
    There are 3 entries
    after deduplication: 3 entries
    Done
    Edited by David Douard
  • Nicolas Dandrimont
    • Resolved by David Douard

      This definitely goes in the right direction, thanks!

      I'm uncomfortable with having this all bundled into one commit. At least the changes are mostly done in separate files, so should be somewhat easy to split in logical commits.

      Generally I think a bit too much of the details of the serialization format of the shard index is leaking into the python binding module. This probably was already the case with the cffi binding, but it might be worth cleaning this up now? I've left a few comments to that effect.

  • David Douard mentioned in merge request swh-docs!473 (merged)

    mentioned in merge request swh-docs!473 (merged)

  • David Douard added 9 commits

    added 9 commits

    • d8ba4d01 - Do not use -std=c++17 when compiling C code for test_hash
    • 3e8a30c3 - extension: Replace %ld by %lu in string format for unsigned long
    • 353b0e5a - extension: initialize index entries as "deleted" entries
    • b4efc6bc - docs: give more details on the shard file format
    • 2591bbc8 - extension: rename extension and C files hash.{ch} as shard.{ch}
    • 9b0fc2b9 - Rename the package as swh.shard
    • b61b266c - Migrate to pybind11 and restructure the source code directory
    • 0785cc82 - Add a cli tool to manipulate shard files
    • 1d4938ea - Update the README file with a "Quick Start" section

    Compare with previous version

  • Jenkins job DOPH/gitlab-builds #69 failed in 48 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • David Douard added 3 commits

    added 3 commits

    • 56faf78e - Migrate to pybind11 and restructure the source code directory
    • 774fcef3 - Add a cli tool to manipulate shard files
    • 8f361662 - Update the README file with a "Quick Start" section

    Compare with previous version

  • Jenkins job DOPH/gitlab-builds #70 failed in 45 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • David Douard added 1 commit

    added 1 commit

    • 4462d74e - Update the README file with a "Quick Start" section

    Compare with previous version

  • Jenkins job DOPH/gitlab-builds #71 succeeded in 43 sec.
    See Console Output, Blue Ocean and Coverage Report for more details.

  • Author Maintainer

    Ok so I tried to improve things a bit... Need to recheck but it should be a step forward...

  • David Douard added 3 commits

    added 3 commits

    • cdef077e - Migrate to pybind11 and restructure the source code directory
    • 437ea4c9 - Add a cli tool to manipulate shard files
    • e80c7fcb - Update the README file with a "Quick Start" section

    Compare with previous version

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading