Lockless algorithms [was Re: splxxx replacements?]

Fri Nov 17 12:56:45 PST 2006

On Fri, Nov 17, 2006 at 09:04:53AM -0800, Matthew Dillon wrote:
>     The route table work is rather significant, though it won't really
>     shine until Giant is removed from the protocol threads.  Basically
>     the route table is now replicated across all cpus and thus can be 
>     accessed by any given cpu with only a critical section for 
>     protection.
> 
>     We still use tokens in several major areas, but we have been shifting
>     the ones that were only being used to lock short codepaths over to
>     spinlocks.

[lock categories]

> 
>     RCU is interesting but I personally prefer replication and
>     cpu-localization instead of RCU.  So far nothing has come up that really
>     needs an RCU implementation.  RCU suffers from a level of complexity
>     that makes it somewhat difficult to understand the code using it, and
>     I don't like that aspect of it.

RCU is a very specific and powerful algorithm that is used in place of traditional
reader/writer locks in the Linux kernel. Many of the algorithms (hash tables and
the like) are highly parallelized because of it in Linux. The algorithm is
dependent on memory barriers to guarantee a certain kind of ordering with regard
to a data structure pointers. Read and write barrier guarantee that ordering down
to the specific cache line. Replication has an different semantic from what I
understand and RCU shouldn't be dismissed.

You should know that a paper was presented at OLS 2006 regarding a lockless page
cache from Nick Piggin that uses RCU and other lockless techniques to get a highly
parallelized page cache. They get a linear increase in performance for the CPU set
they are using. It's an impressive feat. I don't see how something like (data ?)
replication can handle something like that. RCU read sides are down to the
nanoseconds for an acquire which is very fast, faster than an atomic operation
by far.

They basically get clogged at about 3 CPUs, but with this the get nearly linear
increase in performance as CPUs are added.

	http://www.linuxsymposium.org/2006/linuxsymposium_procv2.pdf

Since RCU has different data guarantees than a traditionally locked data structure.
The user of the data via the API must have a means of dealing with this, but it
will defintely deliever some serious performance with shared memory systems if you
do.
Also Linux directory cache uses RCU as well and apparently it's a performance
problem as well. There are many examples in Linux regard RCU and it would be a good
thing to look at for ideas.

There are patent issues and the GPL license, but this is just too powerful an
algorithm to ignore. In many way, this brings out the ultimate in what shared
memory system can do.

>     The use of tokens in the system is primarily being relegated to situations
>     that actually need the above characteristics.  A very good example of
>     this is the traversal of system structures such as the system mountlist.
>     An example of incorrect use would be in, say, kern_objcache.c, where we
>     really ought to be using a spinlock there instead.

Ok, so you're still using tokens for larger subsystems that have long execution
paths if I understand you, right ?

One of the claims about dfBSD that I found interesting was that the your use of
token to break up long kernel paths was an incremental way of MPing the kernel.
This is in contrast to the complete removal of Giant in a lock path for a kernel
syscall. The question here that I've been wondering if tokens are living up to
its claim or not ?

That's really the main question I have using it as a general mechanism for MP
work in a legacy kernel (I'm thinking about using it for another system that is
already using a code path protection scheme).

bill