SMP (Was: Why did you choose DragonFly?)

Matthew Dillon dillon at apollo.backplane.com
Mon Sep 20 19:18:07 PDT 2010


    I think our ability to advertise our features has been a bit lacking.
    We're programmers more than we are salesman.

    Take device serial numbers in devfs for example.  A simple feature that
    gives one a guaranteed device path to access a physical hard drive,
    no matter where it was mounted.  Soft-labeling schemes such as LVM or
    Geom use have their place (and of course we support LVM's labeling now),
    but personally speaking I don't think anything could be simpler than
    simply accessing the device by its serial number, and the non-duplication
    guarantee is important in a world where one can 'dd' partitions back
    and forth almost at will.  But mostly it's the sheer simplicity of
    stuffing the serial number in /etc/fstab, /boot/loader.conf, and
    /etc/rc.conf, and then not caring how the drive is actually attached
    to the system, that makes the feature worth its weight in gold.

    Implementation details often get lost in the noise as well.  The coding
    model we use for SMP is far, far less complex than the coding model
    other OSs use, and just as fast.  It doesn't just making coding easier,
    it makes maintaining the codebase easier and the result tends to be
    more stable in the long run IMHO.  We don't have any significant
    cross-subsystem pollution.  We don't have a serious problem with
    deadlocks (because tokens CAN'T deadlock).  Problems tend to be
    localized.  Our LWKT token abstraction is a very big deal in more
    ways than one.

    It is hard to describe the level of integration some of our
    reimplemented subsystems have and the additional capabilities they
    provide.  When was the last time you as a DragonFly user worried
    about NULLFS ever not doing the right thing, even from inside a NFS
    mount?  It just works, so well that the base install depends on it
    heavily for HAMMER PFS mounts.  Or the namecache... as invisible a
    subsystem to the end-user as it is possible to get, yet our
    reengineered namecache make things like 'fstat' work incredibly well,
    able to report the actual full file paths for open files descriptors.
    Or VN for that matter.  It just works.

    This stuff doesn't 'just work' in other OSs.  There are aliasing
    problems, system resource issues, directory recursion issues and
    limitations related to recursive mount points, the potential for
    system deadlocks, and numerous other issues.

    We are lacking in a few areas, but frankly I don't consider other
    open systems to be all that much ahead of us.  People talk-up soft
    mirroring and soft raid all the time (and I want to get them into DFly
    too), but I have yet to see an implementation on an open system which
    actually has any significant robustness.  A friend of mine has a
    hot fail-over setup running on Linux which works fine up until the
    point something goes wrong, or he makes a mistake.  Then it is history.
    At best current setups save you from an instant crash but you have
    to swap out drives and reboot anyway if you want to be safe.  At
    worst something goes wrong and the system decides to start rebuilding
    a 2TB drive which itself takes 2 days and say goodbye to the high
    performance production system in the meantime (or allow it to take 5
    days copying more slowly instead of 2, which is just as bad).  Or
    the setup requires the use of a complex set of subsystems which
    are as likely to blow up when a disk detaches as they are to return
    the BIOs with an EIO.

    I want these features in DragonFly, but I want them done right.

    Another similar example would be crypted disks.  Alex recently brought 
    in cryptsetup along with LVM and spent a good deal of time getting all
    the crypto algorithms working.  But what's the point if the in-kernel
    software crypto is only single-threaded on your SMP system?  Apparently
    the expectation was that one would have to buy a crypto card or a MB
    with a built-in crypto acellerator.  So I went and fixed our in-kernel
    software crypto and now our crypted disk implementation runs almost
    as fast as unencrypted on a quad cpu box.  THAT I would consider
    production-viable.  What we originally inherited I would not.

    --

    In terms of differentiation from other systems HAMMER is a very big
    deal.  I think in many ways HAMMER has saved the project from the
    userbase-bleeding effect that is often endemic with projects like ours.
    I wish I had done it earlier.  Once someone starts using HAMMER they
    find it really difficult to move away to anything else.  If there's
    an issue at all it is simply that people looking in from the outside
    have no idea just how flexible HAMMER's fine-grained history and
    quasi-real-time mirroring/backup/retention capabilities are.

    Swapcache is also underrated.  And the use of swap as well.  Swap
    has gone out of favor in recent years as the gulf between cpu/memory
    and disk performance has widened.  But as storage densities continue
    to rise into the multi-terrabyte range even for entry-level consumer
    systems, even throwing in a ton of ram isn't enough to cache the
    active 'overnight' dataset.  find/locatedb, web services, large
    repositories, rsyncs, you name it.  They all take their toll.

    Swapcache pretty much fixes that whole mess at the cost of a small
    40-80G SSD.  $100-$200.  Not only does it 'refresh' older systems
    with less ram by making swap viable again, it also caches filesystem
    meta-data generally across the whole system and is capable of caching
    file data as well, making it extremely useful even on well-endowed
    systems.  The 'overnight' meta-data set will easily fit in a 40G SSD.
    And swapcache is designed with SSDs in mind.  It clusters large I/Os
    and has very low write multiplication effects when used with a SSD.
    Normal filesystems tend to have larger write multiplication effects
    and wear the SSD out faster.

    TMPFS becomes more viable with swapcache too.  We have two bulk
    pkgsrc building boxes.  More than two even, but two that I regularly
    use.  One has a swapcache SSD (Pkgbox64) and one does not (Avalon).
    There is a gulf of difference in overall system performance between
    the two.  The system requirements would be much higher, even needing
    more disk spindles for reasonable performance, without TMPFS/swapcache.

    Nearly all of DragonFly's own production systems use swapcache now.
    Only Avalon sitting in its remote colo facility doesn't have a small
    SSD swapcache setup.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>





More information about the Users mailing list