HAMMER update 06-Feb-2008

Wed Feb 6 17:19:04 PST 2008

:How will this affect parallel IO (reads, but especially writes)? Would=20
:having such a global structure serialize it? (I'm assuming, possibly=20
:wrongly, that having trees per-cluster allowed you to lock individual=20
:clusters).

    Reads will not be effected at all... the locking occurs at the B-Tree
    node layer.

    Writes will not be serialized and will still be asynchronous so the
    most typical striping setups on multi-disk filesystems should still
    yield very high performance.  Writes WILL be far more likely to be
    sequential which should actually improve write performance.  Also
    keep in mind that writes are buffered by the buffer cache, so there
    is a caching layer between userland and the physical disk.

    Mixed data writes (parallel write operations by multiple processes in
    different parts of the filesystem) will generally lay down new
    information sequentially on disk, which can be detrimental for read
    performance since the individual files will not be entirely sequential.
    I seem to recall a paper at a USENIX long ago where someone tested
    locality of reference for reads after laying down writes from 
    parallel sources sequentially, and it was no worse then trying to zone
    the disparate writes, so I'm not really worried about this case.

    Also, once you get over a track or two's worth of data, it costs about
    the same to seek 3 tracks as it does to seek 10 tracks, so as long as
    writes are not *completely* strewn about due to lots of parallel write
    activity occuring, it shouldn't be a problem.  They won't be because 
    writes are cached in the buffer cache prior to being flushed out.  We
    should get nice long bursts of sequentially ordered data on disk.

    --

    I don't like to think that I wasted a ton of time building the 
    cluster mechanism, and its kinda sad to see so much code removed.  But
    most of the work over the last few months has been B-Tree centric,
    implementing the inode cache, high level VOPs, record structures, etc...
    and those parts of the codebase remain intact.

    It really got to the point where implementing the last bits was starting
    to take way way too much time.  When things start to take that much time
    to do, I know I've made a mistake somewhere in the design.  Better to
    fix it now then to try to slog through the complexity later on.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>