Documentation effort magma 0.0.20070911 released magma 0.0.20070829 released magma 0.0.20070806 released on SVN All the news
DOCUMENTATION :: How to solve hash key collisions
Magma storage backend, the flare system, is extensively based on SHA1 hashes. Each flare (see names to know more about flares) is actually saved in a 3 element format on disk.
- a directory named as the SHA1 of the full flare path.
- a contents file, inside previous directory, which stores flare contents where applicable
- a metadata companion file, inside same directory, which
stores flare metadata like full path,
struct statand hash rappresentation
If two flare paths lead to same hash, both will operate on same file contents. That's even worst you can think. Suppose that a regular file and a directory can be referred by the same hash. What will happen if file is scanned as a directory?
Proposal for a clean solution
The simplest possible approach is to add a middle layer in storage backend, adding a second hash inside main directory. That's the pattern.
Suppose that a file
f1 and a directory
compute the same hash 6d67fa810e3b294e7204ff5a72f98023f34a238f. To
distinguish them, the structure can be changed in:
+ 6d67fa810e3b294e7204ff5a72f98023f34a238f (h1) + 328a2764b4c543d28e3f0a8b73fc9d3e6f824522 (h2) + metadata + contents + 8a1b6c4d09e754fa57b6c28d532648ee243f6a38 (h3) + metadata + contents
How are calculated h2 and h3 hashes? Second level hashes are the result of sha1(strcat(h1,flare->path)). The probability that two colliding paths generate a second collision when concatenated to collided hash is simply not relevant.
That change should be reflected even inside the b-tree cache. A linear linked list should be enough to manage collisions. Each flare should be provided of a link to next flare with same hash. And cache operations should investigate a little bit more while scanning or modifying the b-tree, not accepting the first position (flare) found but matching also flare path to see if flare is exactly what was requested.
Which code should be changed to reflect this situation?
The short answer is a lot of code, but with minor changes.
struct magma_flare should be provided of a magma_flare_t
pointer to build linked list of colliding flares.
magma_search() should double check flare path too before
returning a flare. If flare path don't match and colliding list is not
empty, all linked elements should be scanned until valid one is found or
NULL pointer is reached.
magma_add_to_cache_after() should add a flare to collision
list if a flare with same hash exists in b-tree but is not the same flare.
magma_remove_from_cache() should check if flare has non-NULL
collision list. If has one, insted of performing normal element removal,
flare should be simply substituted by its following colliding flare.
First picture show how cache removal is managed when flare (2) has no collisions. Flare (3) has been choosen for its load, higher than load of flare (4).
Next picture show how cache removal is managed when removing a flare (2) which has a colliding flare (2b) with same hash and pointed to by flare (2) collision list.
should also be modified to handle new storage backend schema.