one usage of up to a million files/directory

David Tweed david.tweed at gmail.com
Mon Nov 17 06:59:31 PST 2008


Hi,

I've been vaguely following DFly for a while, and in response to the
comment about finding out about use cases for huge directories in the
HAMMER update thread, I thought I'd share my experience.

I do research in image processing. Although there are many formats for
video as a stream in one big file, in the kind of research I do it's
often easier to work with one file/individual frame, partly because
many video formats don't have good support for random access,
particularly fast random access, and partly because it generally
simplifies the code to have the input in the same format as the output
(eg, do frame comparisons, run shell scripts, etc, that you can't do
on a full video stream). The "results" are often written out as one
image/file as well. Obviously this wastes a significant amount of disc
space due to no temporal compression but it's better for the
particular work we do. (We're working on Linux currently because it's
a mainstream Unix-y option.)

We go up to about a million files/directory, deliberately splitting
stuff up at larger sizes partly from OS directory limitations but
partly because lots of other things like tar-ing up stuff, etc, get
really tricky. (There's also the problem that we are image researchers
rather than unix gurus so we tend to do simple things like the
memory-limited argument passing (generally from shell globs) rather
than sophisticated xargs stuff, and many programs can't deal with
excessively large numbers of arguments anyhow, which again argues
against ultra-huge directories.)

As regards names, the original data is often in the form of
base_xxxxxx.ext where ext is a frame number that increases
sequentially, with generally just one sequence in a directory. In the
output directories there can be several sequences which often have
names like some choice from base1_xxxxxx.ext, base2_xxxxxx_yyyyyy.ext,
base3_xxxxxx_yyyyyy_zzzzzz.ext. Some choices for the xxxxxx, yyyyyyy,
zzzzzzz are the pid (when running the same stochastic method on the
same dataset at the same time on SMP to check they converge to the
same thing), frame number and "sub-frame iteration number" (so you get
frame_000001_000001.jpg, ...., frame_000001_000100.jpg for frame 1
before moving onto frame 2). I don't think we've ever had a directory
with many files that weren't numerically ordered by one or more
indexes but all completely unique.

(Incidentally, lots of unix tools become really annoying with large
numbers of files. Eg, I've got an inefficient but user-centred script
that replaces ls so that it gives output like

base1_[000000-0097832].jpg
base1_[0097835-010000].jpg
base2_[000000-0000999]_[000000-0000010].jpg

where human beings can spot problems rather than screenfulls of stuff
that's essentially useless for reading. And tab completion really
should learn that if theres more than 50 possibilities then don't ask
whether to show me all the possibilities, just tell me there's too
many.)

I don't know how all this affects directory the hashing scheme, and
it's not a finished product, but hopefully it's a data point about how
large number of files get named.

-- 
cheers, dave tweed__________________________
david.tweed at gmail.com
Rm 124, School of Systems Engineering, University of Reading.
"while having code so boring anyone can maintain it, use Python." --
attempted insult seen on slashdot





More information about the Kernel mailing list