MESI Caching work start (was Re: caching and userapi)
Matthew Dillon
dillon at apollo.backplane.com
Thu Jul 17 22:58:48 PDT 2003
:>
:> It depends what you mean by 'the caching'. I would say, in general,
:
:http://www.dragonflybsd.org/Status/index.cgi
:
:By caching and userapi, I was referring to two of the items marked "not
:started" under Status->Index. They seemed like they might be
:partially implementable without having all of the messaging and VFS
:work done.
:
: -Kip
Ah. Oh. *THAT* caching. *THAT* caching is the MESI caching model
for object data! It's pretty sophisticated but as a first incremental
step it represents implementing shared/exclusive ranged offset 'locks'
on VM Objects, like this (this is just my guess at the initial API,
all comments, ideas, and discussion is welcome!):
struct vm_object_lock lobj;
struct vm_object *object;
off_t base;
off_t bytes;
shlock_range(&lobj, object, base, bytes);
Obtain a read lock on the specified unaligned byte range within the
VM object. The lobj structure will be filled with information about
the lock and must remain intact until unlock_range() is called.
While the range is locked the VM object may instantiate new pages
within the specified range but may not modify or delete existing
pages, or truncate the object below base+bytes.
You do not have to zero out lobj prior to use.
exlock_range(&lobj, object, base, bytes);
Obtain an exclusive (write) lock on the specified unaligned byte range
within the VM object. The lobj structure will be filled with
information about the lock and must remain intact until unlock_range()
is called.
While the range is locked the VM object may instantiate new pages
within the specified range but only YOU can modify or delete pages
within the specified range. You can truncate the object to any
byte offset within the range as long as base+bytes is greater or equal
to the current size of the object (i.e. you own it to EOF).
You do not have to zero out lobj prior to use.
unlock_range(&lobj);
Release a previously obtained byte range. Once released you may
throw away your lobj structure.
This is intended to replace the piecemeal VM page and BUF locking we
do now, and note that it is intended to work on BYTE boundaries (be
completely architecture independant).
My idea is that the functions would actually construct the
contents of the lobj structure and link it into a list of compatible
locks already held by the VM object (based at the VM object), so no
malloc()'s are required for management of the ranged locks. Note that
for synchronous system calls or perhaps even for syscalls in general
(one could embed an lobj structure in the syscall message for kernel
use) lobj can be declared on the kernel stack. In DragonFly the kernel
stack is NOT normally swappable. Generally speaking lobj's could be
embedded in many system structures throughout the kernel.
These range locks protect against things like truncates occuring while
someone else is in the middle of a read() or write(), it protects the
atomicy of a read() or write(), it protects valid pages within an object
from being manipulated at unexpected times, would replace the I/O
in progress junk currently maintained in the vm_page structure, would
replace the vfs_busy_pages() and vfs_unbusy_pages() functions, and
so on and so forth.
Down the line, as another incremental stage, these functions would be
extended with a call-back or message-back feature which would allow
a more sophisticated MESI subsystem to request that holders of a lock
downgrade or invalidate their lock. This would be used to implement
fully coherent data mapping in a multi-machine clustered environment
and perhaps even as a means to implement 'smart' TLB invalidation
within VM spaces shared between cpus. I don't know exactly how far we
could extend the concept, it will depend on how many features we can
pack in without making the API and structures unreasonably complex.
Your imagination is the limit!
The immediate benefit is that a single range lock can govern any number
of VM pages without scaling the overhead... an extremely important
concept for all sorts of subsystems that should radically improve I/O
throughput on high speed systems that currently bog down in buffer
cache overheads. One range lock can govern a 64K I/O instead of 8
separate vm_page_t 'locks', for example.
-Matt
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Kernel
mailing list