Software Transactional Memory?

Wed Dec 28 10:43:26 PST 2005

:Someone used a "transactions system" that is found in modern databases to 
:handle SMP. Quoting:
:..

    I did a quick browse of the paper.  I believe the concept is quite valid.
    In fact, I am a big fan of opportunistic locking (which they are using, or
    at least partially using).  What they are describing would not be useful
    in all situations considering that there is still a lock involved.  The
    pure DragonFly model is to avoid having to use the lock in the first
    place, but even so I recognize the fact that there are still planty of
    cases where we don't have a choice.

    I would argue that a pure opportunistic model might be an even better
    solution then the one described in the paper.  With a pure opportunistic
    model the structures are basically just guarenteed not to go away from
    the point of view of SMP access and a transaction involves creating a
    new copy of the structure containing the changes.  Synchronization
    occurs at the very end of the transaction and either gets a thumbs up
    or a thumbs down.  If it succeeds, the new structures replace the old
    by changing the originating pointer to the structure.  The old structure
    remains around only as long as is necessary to satisfy other accessors
    (who will then fail when they try to commit their transaction).

    Clearly this model is not useful in all circumstances.  To be efficient
    the structures involved have to be fairly small.  So, for example,
    the model would work for a route table entry or a VM page, but would
    not work for TCP state.

    Another model that works very well is a replication model, which is
    what Jeff's route table code does (that is going in after the release).
    With a replication model the data is physically replicated on each
    cpu involved in accessing or modifying it, giving lock-free read-only
    access.  Modifications to the data are then chained through the cpu's
    involved (that is, modifications are made locally in the context of
    each cpu).  A simple 9 nS critical section is all that is needed.

    But the best model, in my view, is one that is inherently contention-free.
    Our TCP implementation is a good example of this.  Only one cpu manages
    any particular connection so there is no contention and no locking
    required at all.  The trade-off is that one winds up doing more context
    switches but this also has the effect of concentrating the bad news in
    one place where it can be optimized well.

    This is what the focus of my BayLisa talk was... getting actual hard
    numbers from some of our algorithms:

	http://www.dragonflybsd.org/docs/LISA200512/

    What I found, basically, was that overheads for certain MP operations 
    were far lower then I originally thought they would be, and that it is
    possible for algorithms to naturally aggregate operations into very
    efficient batches to not only reduce or remove the context switching
    overhead from the performance equation, but also to take maximum advantage
    of the cpu's code and data caches.  

    If you look at the end of the slide show there are some numbers from
    naturally aggregated work initiated by received packets (multiple packets
    per interrupt) and flowing through to the TCP stack running lock-free 
    on another cpu.

    Earlier in the slide show there are slides showing the positive cache
    effects of work batching, such as when batching multiple cross-cpu
    free() operations.  There are actually two slides.  The 'take 1' slide
    had bad numbers for the target cpu due to KTR logging overhead.  The
    'take 2' slide has correct numbers (after I removed the extra arguments
    from the KTR calls I was making).

						-Matt