Blogbench RAID benchmarks

Fri Jul 22 09:27:43 PDT 2011

    Ok, after much experimentation I've figured out what is going on.

    First, why is UFS skewed towards writing to the extreme detriment of
    reads while HAMMER is skewed towards reading to the extreme detriment
    of writes?  In a word: flushing meta-data out in UFS doesn't require
    as many locks to be held as flushing meta-data out in HAMMER does.

    The issue in UFS can be somewhat controlled by an I/O scheduler
    but it isn't straightforward due to the way disk drives handle write
    I/O's verses read I/O's.  Write I/O's tend to get acknowledged instantly
    by the hard drive up until the point where the hard drive's own ram
    cache fills up with dirty data, and there is no way to gauge and control
    the backlog.  One must also ensure that some hardware protocol tags are
    reserved for reading and some are reserved for writing so read I/O isn't
    able to completely stall out write I/O or vise versa.  DragonFly does
    this in its CAM layer (I don't know about FreeBSD, it is something I
    added recently).  It's very difficult to control write bandwidth in an
    I/O scheduler without simulating/calculating probable seek times for
    random vs linear write I/O.

    For HAMMER the problem is that HAMMER's flusher threads are constantly
    getting stalled out by B-Tree locks being held by the ~100 reader
    threads (in the blogbench test).  Fixing this in HAMMER cannot be done
    in the I/O scheduler, because stalling out read I/O's in the I/O
    scheduler (in order to try to make more bw available for writing)
    will simply cause the related B-Tree locks to be held even longer and
    cause write activity to actually go down.  The fix has to be in HAMMER
    itself.  NOTE: I cannot solve this by giving the flusher's exclusive
    locks priority over the frontend's shared locks without creating
    major 3-thread deadlock chains, and using exclusive locks in the readers
    results in reduced read concurrency.

    --

    So, I am going to commit some experimental code to HAMMER which tries
    to manage the locking conflicts between the frontend reader threads and
    the backend flusher threads.  I am going to do this by creating a
    pulse-width modulated time-domain multiplexer in HAMMER which tries
    to 'slot in' reads and writes based on the number of inodes backlogged
    in the flusher.

    Basically the idea of using a PWM is this:  You take a fixed period of
    time, say 1/5 of a second:

    [----------------------------------------------]

    You alot a portion of the time slice to the backend flusher and the
    remainder to the frontend.

    [wwwwwwrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr]	Flusher lightly loaded
    [wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwrrrrrrr]	Flusher heavily loaded

    Even though read I/O operations in a heavily loaded system can stall
    for much more than 1/5 of a second causing the read operations to
    delays a certain number of ticks before being initiated gives the flusher
    a chance to win locking conflicts and thus the flusher is able to
    gain performance over the frontend reads.

    --

    This change isn't just to help blogbench out.  It also appears to solve
    some major issues with namecache stalls that occur when HAMMER is
    heavily write-loaded, and issues with things like vi ':wq' operations
    (which fsync()) seem to be improved.  My commit message also mentions
    it helping with 'ls' and 'find' but I think the 'ls' and 'find' issue
    needs a bit more work.

    The effect on the blogbench tests is basically to improve write
    performance a little at the cost of read performance.  This tradeoff
    is due to hard drive seek times and is unavoidable.

			read	write	For blogbench in stage 2 after the
					system caches are blown out.
					Approximate values only.  R articles
					vs W articles.

    UFS:		600	4000	(freebsd)
    HAMMER BEFORE:	20000	50	(dragonfly)
    HAMMER AFTER:	2500	150	(dragonfly) <-- this is an improvement
							even though it may
							not seem that way.


    As you can see HAMMER still prioritizes reads, and that is precisely
    what I want to have happen... reads are far more important than writes.
    We don't want writes to stall out completely but neither do we want
    writes to be able to stall reads out completely.  In the blogbench test
    one basically has ~100 threads issuing random reads, but the read issued
    by each thread is for a while file and is thus linear.  In otherwords,
    increasing the write activity by a little decreases the disk bandwidth
    (due to spindles/seeks) by a lot.

    So, Francois, lets see how the stuff I committed works out.  You need
    to remove the temporary patches I forwarded to you on IRC.  What I
    committed is the final version.  I dunno if the graphs will look any
    better since they are so badly skewed towards the pre-system-cache-
    blowout numbers, but things should run more smoothly.

						-Matt