HEADS UP - HAMMER work

Simon 'corecode' Schubert corecode at fs.ei.tum.de
Thu Nov 13 10:44:45 PST 2008


Matthew Dillon wrote:
>     Sigh.  OK, fine.  Here it is.  This is subject to change.  The whole
>     purpose of doing a semi-sorted directory hash encoding is that many
>     standard system functions, like ls -la, sort the contents of the
>     directory.  For very large directories (e.g. hundreds of thousands of
>     files that blow out system caches) if there is no locality of reference
>     for sorted lookups and each lookup will essentially have to do a random
>     seek.  In HAMMER these are still direct seeks since we have the B-Tree.
>     (In UFS the operation devolves down to O(N^3) because the entire directory
>     must be scanned for each lookup if the caches are blown out).
> 
>     With a semi-sorted hash key the random seeks HAMMER has to do can be
>     reduced by up to a factor of 100.

I think while optimizing this is a worthwhile goal, I wonder who is doing sorted lookups in the first place?  Shouldn't programs do loop{readdir(); stat();} sort;  instead of loop{readdir();}; sort; loop{stat();}?  I'm thinking of NFS shares using the readdirplus NFS RPC - they would feel really unhappy if people wouldn't stat in the right order?

>     lsb bit is 0 to provide an iteration space.  The iteration limit is set
>     to 2^24 (for dealing with collisions).  That means the absolute guarantee
>     is 16 million files per directory but, realisitically, one could probably
>     put in excess of 2 billion files in a directory, or more, before
>     hitting the iteration limit.

What do you mean with "iteration limit"?  I somehow feel uncomfortable arbitrarily limiting the number of files in a directory based on statistical multiplexing.  Because in this case we're not dealing with slow lookups, but sort of unreasonable EEXISTs in the case of hash collisions.  Maybe I'm not understanding correctly, though.

cheers
  simon





More information about the Kernel mailing list