SMP (Was: Why did you choose DragonFly?)

Tue Sep 21 10:50:42 PDT 2010

:That explains the noticable performance difference just logging in... I
:always just thought avalon was getting used for something else I didn't
:know about...

    Yah, the bulk build runs Avalon out of memory faster than it can swap
    pages out because the bulk build is also loading the disk heavily
    with reads.  The pageout daemon just can't retire the data quickly
    enough.  That causes the VM system to stall on low real memory
    for a few seconds every so often while the bulk build is running.

    It shows the very real limitations of a single disk drive when no
    swapcache/SSD is present.

    Pkgbox64 does exactly the same bulk build as Avalon, with exactly
    the same single-drive setup for the filesystem, but also has a 40G
    SSD stuffed with swapcache enabled.  Ok, Pkgbox64 also has 4G of ram
    instead of 3G (since it's running 64-bit), but that isn't why it
    performs better.  It performs better because the swapcache offloads
    100% of the main disk's meta-data and the swap-based TMPFS is
    entirely in the SSD.

    Single-drive limitations are still present, but pushed way out on the
    performance curve.  The key fact here is that the SSD doesn't need
    to be very large.  It's barely 40G (verses the 750G main disk) and yet
    has a huge positive effect on the machine's performance.

:Is there benchmarks around for swapcache?  i.e. same hardware, same
:software task, with and without swapcache?
:

    Hmm.  It's a bit hard to benchmark a machine under that sort of load
    but I could do some blogbench tests.  It comes down to the filesystem
    meta-data essentially only having to be read from disk once and
    from then on until machine reboot being available in either the
    system ram or the swapcache SSD regardless of what else is going on
    in the system.  On the practical side the swapcache does not allowed
    continuous high bandwith writing to the SSD since the SSD has to last
    a reasonable period of time (10 years) before wearing out.  That
    works well in real life but benchmarks compress the time scale so
    for the benchmark to be accurate the write bandwidth limitations have
    to be turned off.

    For example, on leaf, once the meta-data is read once after boot,
    things like 'find' on local disk run very fast until the next reboot.
    A find using meta-data cached in memory can run around 82000+ files
    per second at the moment.  With a data set large enough that main
    memory cannot hold the meta-data a find running through using
    swapcached data on the SSD can run around 53000 fps.

    leaf:/root# /usr/bin/time find /build | wc -l
	   32.81 real         2.57 user        28.69 sys
    2701733	<---- purely from ram				82344 fps

    leaf:/root# /usr/bin/time find /build /home | wc -l
	   89.13 real         4.49 user        64.22 sys
    4775413	<---- doesn't fit in ram, SSD used		53578 fps

    (fresh reboot, swapcache disabled)
    leaf:/root# /usr/bin/time find /build /home | wc -l
	  916.00 real         5.92 user        62.48 sys
    4775170							 5213 fps

    (repeat, swapcache disabled.. depend on ram)
    leaf:/root# /usr/bin/time find /build /home | wc -l
	  449.39 real         5.06 user        59.16 sys
    4775175							10625 fps

    (repeat, third run, swapcache disabled)
    leaf:/root# /usr/bin/time find /build /home | wc -l
	  402.09 real         5.30 user        60.78 sys
    4775177							11875 dps

    So the grand result in the case of leaf is that a nominally running
    system with swapcache can do directory operations 5 times faster.
    Short-term caching of smaller directory subsets will of course run
    at full ram bandwidth, but on a machine like leaf there are always
    a few things going on and it often just takes leaving your xterm for
    a few minutes before your cached meta-data starts getting thrown
    away.  Someone working on the machine doing git pulls or source
    tree searches regularly will also be regularly annoyed without
    swapcache.

    As you can see there is a huge difference in nominal name lookup
    performance with a swapcache/SSD present when the filesystem(s)
    are large enough such that normal ram caching is unable to hold
    the data set.

    Even in smaller systems where the filesystems are not so large
    both normal and overnight activities (using firefox, overnight
    locate.db, etc) tend to blow away what meta-data might have been
    cached previously, not to mention cause active but idle programs
    to get paged out.  Even a smaller system such as a workstation can
    seriously benefit from a swapcache/SSD setup.

    Similarly when one is talking about a server running web services,
    rsync services, mail, etc... those services tend to have large
    meta-data footprints.  rsync will scan the entire directory tree even for
    incremental syncs.  git clients and cvs servers and clients are also
    heavy meta-data users.  Someone running a large mail server can wind
    up with a huge backlog of mail queue files.  Swapcache greatly improves
    the sustainable performance of those services.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>