A python module implemented in C that ingests key/value pairs, writes them on file and creates a hash table to be used as an index, in the same file. The file is a Shard in the Read Storage.
Format
Format version
An index which is a hash table
The content of the objects
Let's assume the shard stores the following objects:
(the empty object, len=0, sha256=e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855)
What would
the write operation for all objects
the resulting on-disk format
a read operation for an object of a given hash (let's say 053057fda9a935f2d4fa8c7bc62a411a26926e00b491c07c1b2ec1909078a0a2)
look like?
My naive proposal would have been to store the fixed-length (sha256, offset, length) tuples, sorted, at the beginning of the file, and to do the read of an index entry as a bisection of this list of tuples. This would mean that the write operation can stream all the objects, but needs to store the list of hash/offset/length to sort it and write it to the file eventually.
Hopefully that's clarifies how the file is create but I can go into details if that's still unclear.
Reading fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9 is:
hash(fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9) => 1 # entry one
1 * size of an entry in the hash table (32 + 8 + 8) + size of the header == I1
seek I1
read the offset \O1 and the length L1 from I1
seek \O1
read SHA256, read L1 bytes
My naive proposal would have been to store the fixed-length (sha256, offset, length) tuples, ...
Which the index is.
do the read of an index entry as a bisection of this list of tuples.
This would be O(log(N)) where the proposed format is O(1) and that will have a significant impact as the storage grows. Reading the object will require only one additional read in the index instead of more.
Then for the reading, I don't understand this item "hash(fcde2b2edba56bf408601fb721fe9b5c338d10ee429ea04fae5511b68fbf8fb9) => 1"
What is the hash() function in this snippet? Is it something else than a lookup in the index part of the file?
what are "the parameters to the perfect hash functions"? what are the possible formats?
Given a set of keys, calculating the perfect hash function that will avoid any collision yields a set of parameters. The lookup function needs these parameters in addition to the key to figure out the position of the entry in the hash table. That is a few bits per entry (see https://homepages.dcc.ufmg.br/~nivio/papers/cikm07.pdf for instance). The possible formats depends on the actual function being implemented.
what are \O, l ans L in this example? are they positions of actual objects?
Lets forget about that and use @olasd's example and your very nice table drawn from it. I thought it would be useful to have some redundancy of the SHA256 so that both the index and the content are self contained but it is a detail.
What is the hash() function in this snippet? Is it something else than a lookup in the index part of the file?
It is the hash function that, given the SHA256 of an object, will return the position in the index of the entry that has the offset of the corresponding object.
I just realized that since a perfect hash function need parameters that may require additional sequential reads at the beginning of the file, it would actually make more sense to have a regular hash function with a format that allows for collisions. Even if the collisions are relatively frequent, the colliding entries may be stored adjacent to each other and will not require an additional read. They are likely to be in the same block most of the time. That would save the trouble of implementing a perfect hash function.
$ dd if=/dev/urandom bs=32 count=25000000 | base64 --wrap 32 > /tmp/r$ ls -lh /tmp/r-rw-rw-r-- 1 loic loic 1,1G juil. 19 19:11 /tmp/r$ sudo apt install cmph-tools$ cmph -a chd_ph -t 2 -g /tmp/rspace lower bound is 0.000 bits per keyStarting mapping step for mph creation of 33333334 keys with 33333347 binsStarting ordering stepStarting searching stepStarting compressing stepSuccessfully generated minimal perfect hash functionreal 0m16,081suser 0m15,256ssys 0m0,816s$ ls -lh /tmp/r.mph-rw-rw-r-- 1 loic loic 3,0M juil. 19 20:39 /tmp/r.mph
Pre-loading all of them in memory would cost a maximum ~4MB for each 100GB Shard, i.e. 40GB RAM for 1PB. It may not even be necessary to explicitly cache them since they will be read frequently.
In the global read index, I would consider storing, for each object, alongside the shard id, the length and offset of the object (which are comparatively cheap to store). This way, the per-shard index only gets used for standalone operation, which would (probably?) be an edge case. As I don't really grasp the specifics of the hashing algorithm, I have a hard time understanding how the actual performance will look like. log2(number objects in bucket) can still be a fairly small number of reads compared to 2 reads and a potentially expensive hashing algorithm?
In the global read index, I would consider storing, for each object, alongside the shard id, the length and offset of the object (which are comparatively cheap to store)
This is tempting, I agree. The only problem is that the global index does not scale out. There may be a solution involving a different method to store the global index in the Write Storage (Postgres) and the the Read Storage (hash table). Or a mix of both. The goal would be to make it so the probability that looking up an object in the Read Storage requires a lookup in the global index of the Write Storage decreases as the size of the Read Storage increases.
log2(number objects in bucket) can still be a fairly small number of reads compared to 2 reads and a potentially expensive hashing algorithm
The CPU requirements of the hash function are marginal for lookups and quite fast at build time (seconds in a global process that requires minutes). Even a single additional read to retrieve one object is expensive because I/O is the bottleneck. The index of a 100GB Shard is about 1GB and 2+ lookup in the index will require two+ I/O: they are unlikely to be in cache.
! In #3104 (closed), @douardda wrote:
Ideally, since the perfecthash feature will be needed only for a specific objstorage backend, it should be an optional dependency.
Wouldn't it make sense to put the cffi-based cmph wrapper in a dedicated python module/project (not necessarily under the swh namespace)?
Wouldn't it make sense to put the cffi-based cmph wrapper in a dedicated python module/project (not necessarily under the swh namespace)?
It would but who would maintain it in the long run ?
Humm not sure what the deal is there:
Yeah, there has been some kind of split over five years ago, reason why I was not very keen on re-using these. Moreover they include the sources of cmph and it seems easier to link with the packaged version.
! In #3104 (closed), @dachary wrote:
Wouldn't it make sense to put the cffi-based cmph wrapper in a dedicated python module/project (not necessarily under the swh namespace)?
It would but who would maintain it in the long run ?
SWH I guess: I don't see the difference whether it's embedded in swh-objstorage, winery or a dedicated package.
Humm not sure what the deal is there:
Yeah, there has been some kind of split over five years ago, reason why I was not very keen on re-using these. Moreover they include the sources of cmph and it seems easier to link with the packaged version.
So does it make sense to use this package instead of reimplementing one? What's the catch?
SWH I guess: I don't see the difference whether it's embedded in swh-objstorage, winery or a dedicated package.
Ok.
So does it make sense to use this package instead of reimplementing one? What's the catch?
I was concerned about adding a dependency to a five+ years old unmaintained pypi binary package. I did not even try to use it. If your recommendation is to go for it, that's what I'll do.
So does it make sense to use this package instead of reimplementing one? What's the catch?
In addition to being unmaintained, the current packages do not capture all SWH needs to be implemented in C and expose interfaces that are not useful. The goal is to use the CHD algorithm and not another: there is no need to expose them. Even if the decision is made to use another one at a later time, it will be a change internal to the module and not exposed via the API. The API of the module will be something similar to the existing cmph modules but with a different semantic: the list of keys must not be read from the shard, converted to python objects, provided as arguments to the generate_hash function only to be converted back to C objects to compute the hash function. The generate_hash function should be given a file descriptor to read the ~30 millions keys and a file descriptor to write the hash function.
So does it make sense to use this package instead of reimplementing one? What's the catch?
In addition to being unmaintained,
this could be addressed by asking authors to be in charge of the package
the current packages do not capture all SWH needs to be implemented in C and expose interfaces that are not useful. The goal is to use the CHD algorithm and not another: there is no need to expose them. Even if the decision is made to use another one at a later time, it will be a change internal to the module and not exposed via the API. The API of the module will be something similar to the existing cmph modules but with a different semantic: the list of keys must not be read from the shard, converted to python objects, provided as arguments to the generate_hash function only to be converted back to C objects to compute the hash function. The generate_hash function should be given a file descriptor to read the ~30 millions keys and a file descriptor to write the hash function.
That I did not know, so indeed, if we need a specific wrapper for our needs, then it make sense to create a dedicated swh-perfecthash package. In any case, I would prefer this package to be standalone rather than embedded in swh-objstorage (at least for packaging/distribution reasons).
In addition to being unmaintained,
this could be addressed by asking authors to be in charge of the package
I'm not sure how to do that? Do you mean, for instance, asking Greg Bowyer, author of https://github.com/GregBowyer/cmph-cffi to be in charge of which package? I'm probably slow today :-)
That I did not know, so indeed, if we need a specific wrapper for our needs, ...
I should have clarified that even if there was a well maintained cmph module, there would still be a need for a software heritage specific C package. The goal is to optimize the creation of the perfect hash table. During benchmarking I noticed that creating, writing and reading millions of binary entries from a file in pure python takes time, reason why I decided to go for a C implementation.