git: kernel - Initial commit for Existential structural tracking

Michael Neumann mneumann at ntecs.de
Fri Oct 16 02:24:51 PDT 2020


On Wed, Oct 14, 2020 at 11:13:23AM -0900, Matthew Dillon wrote:
> So while cache line ping-ponging has been a known issue for decades, it
> didn't actually impact performance all that much compared to other issues.
> Primarily lock stalls, expensive data topologies, slower CPUs, lower IPC
> throughput, slower memory subsystems, and fewer cores.
> 
> In modern day, the lock stalls are basically gone, and this has brought
> cache line ping-ponging to the forefront.  Also in modern day, the number
> of cpu cores (as well as hardware topologies for many-core systems) has
> greatly increased.  IPC throughput is far higher, cpus are far faster, but
> memory subsystems are still relatively slow.   *ANYTHING* which disrupts
> the instruction pipeline in a modern cpu seriously impacts performance in
> ways that it just didn't in older cpus.  Even cpus as recently as just a
> decade ago were not impacted nearly as badly.
> 
> Also, *anything* that causes a cache line ping-pong can seriously impact
> performance on a modern cpu.  It does NOT have to be a locked operation...
> it doesn't matter whether it is an atomic op or not.  If you modify data on
> a cache line from one core, then access it from another core, it will
> degrade performance.   So it isn't just locking that has/had to go through
> major improvements, but also data structures.

Is the degraded performance coming from the cache misses that occur in the
other cores once a cache line has been modified/invalidated, or are there other
effects that degrade performance, like cache invalidation traffic? Are they
still using MOESI for cache coherency and snooping on the "memory bus"?

> A good example of this would be the 'reference count' field found in many
> structures.  Just having to increment and decrement that field... for
> example, when a VM page is heavily shared or a VNODE is heavily shared, can
> seriously degrade performance.  It takes decades to adjust the algorithms
> across the entire kernel to get rid of the old algorithms and replace them
> with new ones.

Maybe one should rethink the concept of sharing pages or sharing memory
altogether, especially if that memory is mutated, and instead invest even more
into message passing and shared nothing architectures. Of course, Unix requires
sharing for efficient fork and shared libraries, but then, these concepts are
probably also the most ugly ones to implement and use correctly. And then, we
can't change the applications running on top of the kernel :/.

I wonder if it would make sense to work towards giving up this 1:1 relationship
between a user-level process and it's related kernel thread that executes the
system call and instead let the syscalls send messages to specific kernel
threads that then process the syscall (in some cases, I assume DragonFly is
already doing that in the syscall handler). Async syscalls if you will, with
some integration into the scheduler to make them appear blocking to the calling
process. Similar to how io_ring in Linux works, if I understand that correctly.

I know that you and others (e.g. sephe) did a lot of work in regards to using
message passing approaches for certain subsystems in the kernel to make it
scale better and to make it correct. That took a decade, but it was really
worth it! For the remaining subsystems it's probably very difficult to
accomplish and in case of sharing pages, that's something indirectly dictated
by the applications via the use of POSIX apis.

As a side note, the Mill CPU allows to share memory regions safely between
protection domains (on the same core) and as such gets rid of any need for
shared libraries, which btw lead to rather insecure applications as lots of
security relevant code is run in the same process. The Mill has a global
address space and gets away with the TLBs being in the hot path *before* the
actual caches. Caches in the Mill are virtually addressed and the
virtual->physical address translation is done right before it hits the slow
memory, so it doesn't matter if the TLB is slow too. That saves a lot of power
and silicon. Instead it has "Protection"-LBs that operate in paralell to the
cache access and AFAIK they are much smaller than a traditional TLB and IIRC
due to speculation, not in the hot path anyways.

> So in modern times, even a simple operation such as ++fubar->refs (atomic
> or not) can represent a problem if it is used in a data structures that is
> heavily shared AND ALSO in critical code paths.  Let alone
> atomic-compare-and-set or atomic-swap ... they are all 'bad' in terms of
> causing cache mastership changes.
> 
> This prompted Intel to try to introduce transactional memory (really
> transactional memory buffers in the cpu cores, not actual transactional
> memory).  But it basically failed as a concept.  It was developed in the
> Haswell era when processors and algorithms were a lot slower.  In modern
> times processors and algorithms are now much faster and Intel's
> transactional instructions basically don't improve performance (they
> actually slow things down) in most of the use cases that we would want to
> use them in.

After having seen what they do with the Mill CPU, I strongly believe that the
whole OoO computing is a dead end. All they do is adding hacks on top of hacks.
Even worse, they still make micro kernels suck, which affects OS design from
the ground as well as the way we still structure our applications into shared
libraries instead of a more service oriented approach (the OpenBSD guys are
doing a lot of priviledge separation stuff, but that's not something that comes
for free and always looks nice).

> --
> 
> So cache line ping-ponging is far more important a problem now than it was
> in prior years.  Fixing it essentially requires finding ways to work with
> data structures without *ANY* modifying memory ops in the critical path.

I see. Lockless algorithms alone just don't do it as they don't try to avoid or
minimize mutation.

> One way to do this is to give shared locks or existance locks an expiration
> (which you can think of as an expiration in time ticks but the 'ticks'
> aren't really time).  By doing so, the structure is guaranteed to remain in
> its current state (or with some level of usability) in the critical path
> without having to modify it for every access.  Obviously this only works
> under certain circumstances, but those tend to be the low-hanging fruit on
> a modern system.
> 
> For example, 'libc' is heavily shared and basically never changes, so
> having a time-expired shared lock to access that file and its VM pages
> would be perfectly acceptable.  Another example would be path lookups and
> vnode accesses.  Most components in a path lookup are relatively static,
> read-only entities.  Like "/a/b/c/d/e" ... "/", "a", "b", "c", and "d" are
> probably not going to be renamed out from under the lookup.  You still have
> to deal with the case of course, but you can skew performance to the
> read-only case and not worry so much about the exclusive/modifying case.

Wouldn't it be feasible to make this lock adaptive? That is, it starts out as a
"normal" lock and once it detects that it is not modified very often it turns
itself into a time-expired shared lock?

> Same for heavily shared VM pages.  If we remove the ref-count from the
> vm_page structure, and instead just have a 'this page is in one or more
> pmaps' flag, then a simple time-limited shared lock can be used to map and
> unmap the page without having to modify its contents at all (the contents
> of the vm_page structure, that is), and without having to obtain any locks.
> 
> This is now the low-hanging fruit in a modern BSD or Linux operating system
> (well, at least for FreeBSD, DragonFly, and Linux).  That said, performance
> is already very high in this regard.  At best in a heavily concurrent
> compile environment we may be able to squeeze out another 5-7% or so
> (reduce system% overhead from 15% to around 8%), but it is still important
> as the number of cores in computers continues to rise and as improving
> performance is predicated on avoiding atomic ops (which disrupt the
> instruction pipeline) entirely.

Thanks for writing down all this highly interesting stuff!

Regards,

  Michael




More information about the Commits mailing list