HEADS UP - HAMMER work
Dennis Melentyev
dennis.melentyev at gmail.com
Fri Nov 14 13:03:35 PST 2008
Hi!
2008/11/14 Oliver Fromme <check+kabwj800rsns8lml at fromme.com>:
> Matthew Dillon wrote:
> > 64-bit directory hash encoding (for smaller filenames out of bounds
> > indices just store a 0).
> >
> > aaaaa name[0] & 0x1F
> > bbbbb name[1] & 0x1F
> > ccccc name[2] & 0x1F
> > mmmmmm crc32(name + 3, len - 5) and some xor magic -> 6 bits
> > yyyyy name[len-2] & 0x1F
> > zzzzz name[len-1] & 0x1F
> > h[31] crc32(name, len) (entire filename)
> >
> > 0aaaaabbbbbccccc mmmmmmyyyyyzzzzz hhhhhhhhhhhhhhhh hhhhhhhhhhhhhhh0
> > [...]
. ....
>
> You already mentioned it. That's exactly the problem
> that I'm seeing ... I'm not sure whether a[], b[], c[],
> y[] and z[] buy you anything in practice.
>
> If a single directory contains a huge number of files,
> it is likely they are all of the same type, e.g. it could
> be a collection of images or whatever. That means they
> all have the same extension (e.g. .jpg), so y[] and z[]
> are useless.
>
> Furthermore, it isn't completely unlikely that they even
> begin with the same prefix. For example, all of my
> digital camera pics are named "img%05d.jpg". Admittedly
> those aren't millions (but more than 10k anyway), and
> I'm not stupid enough to collect them in a single
> directory. ;-)
>
> Another example: The cache directory of my Opera browser.
> It contains several thousands of files all beginning
> with "opr*".
>
> It might be a good idea to make a small survey, i.e. find
> people who actually _do_ have directories with a huge
> number of files in them (and I mean more than just a few
> thousands), and ask them what the filenames typically look
> like.
>
> An obvious improvement would be to store name[d-2] and
> name[d-1] in y[] and z[], respectively, where d is the
> location of the last dot in the filename, if any, or the
> location of the terminating zero if there is no dot.
> In other words: Ignore the extension when identifying
> y[] and z[]. Finding the last dot shouldn't be more
> computationally expensive than strlen(name), so this
> shouldn't be a problem.
I do agree with Oliver. But have another proposal:
Also, I doubt that there are usually more than 1-2 affected
directories per host. And usually, file names has very similar
pattern.
Sysctl/some-other-tunable with some kind of mask would be great for
fine-tuning (and just useless for the 90% of users).
like:
sysctl.hammer.dirhash.hashmask.prefix=1 (Starting at first filename
byte, 3 bytes fixed length)
sysctl.hammer.dirhash.hashmask.suffix=-1 (Starting last byte, 2 bytes length)
This way, admins would be able to re-tune it to their particular needs.
Just my 0.02UAH
--
Dennis Melentyev
More information about the Kernel
mailing list