MESI Caching work start (was Re: caching and userapi)

Thu Jul 17 22:58:48 PDT 2003

:> 
:>     It depends what you mean by 'the caching'.  I would say, in general,
:
:http://www.dragonflybsd.org/Status/index.cgi
:
:By caching and userapi, I was referring to two of the items marked "not
:started" under Status->Index. They seemed like they might be
:partially implementable without having all of the messaging and VFS 
:work done.
:
:				-Kip

    Ah.  Oh.  *THAT* caching.  *THAT* caching is the MESI caching model
    for object data!  It's pretty sophisticated but as a first incremental
    step it represents implementing shared/exclusive ranged offset 'locks'
    on VM Objects, like this (this is just my guess at the initial API,
    all comments, ideas, and discussion is welcome!):

    struct vm_object_lock lobj;
    struct vm_object *object;
    off_t base;
    off_t bytes;

    shlock_range(&lobj, object, base, bytes);

	Obtain a read lock on the specified unaligned byte range within the
	VM object.  The lobj structure will be filled with information about
	the lock and must remain intact until unlock_range() is called.

	While the range is locked the VM object may instantiate new pages
	within the specified range but may not modify or delete existing
	pages, or truncate the object below base+bytes.

	You do not have to zero out lobj prior to use.

    exlock_range(&lobj, object, base, bytes);

	Obtain an exclusive (write) lock on the specified unaligned byte range
	within the VM object.  The lobj structure will be filled with 
	information about the lock and must remain intact until unlock_range()
	is called.

	While the range is locked the VM object may instantiate new pages
	within the specified range but only YOU can modify or delete pages
	within the specified range.  You can truncate the object to any
	byte offset within the range as long as base+bytes is greater or equal
	to the current size of the object (i.e. you own it to EOF).

	You do not have to zero out lobj prior to use.

    unlock_range(&lobj);

	Release a previously obtained byte range.  Once released you may
	throw away your lobj structure.

    This is intended to replace the piecemeal VM page and BUF locking we
    do now, and note that it is intended to work on BYTE boundaries (be
    completely architecture independant).

    My idea is that the functions would actually construct the
    contents of the lobj structure and link it into a list of compatible 
    locks already held by the VM object (based at the VM object), so no
    malloc()'s are required for management of the ranged locks.  Note that
    for synchronous system calls or perhaps even for syscalls in general
    (one could embed an lobj structure in the syscall message for kernel
    use) lobj can be declared on the kernel stack.  In DragonFly the kernel
    stack is NOT normally swappable.  Generally speaking lobj's could be
    embedded in many system structures throughout the kernel.

    These range locks protect against things like truncates occuring while
    someone else is in the middle of a read() or write(), it protects the
    atomicy of a read() or write(), it protects valid pages within an object
    from being manipulated at unexpected times, would replace the I/O
    in progress junk currently maintained in the vm_page structure, would
    replace the vfs_busy_pages() and vfs_unbusy_pages() functions, and
    so on and so forth.

    Down the line, as another incremental stage, these functions would be
    extended with a call-back or message-back feature which would allow
    a more sophisticated MESI subsystem to request that holders of a lock
    downgrade or invalidate their lock.  This would be used to implement
    fully coherent data mapping in a multi-machine clustered environment
    and perhaps even as a means to implement 'smart' TLB invalidation
    within VM spaces shared between cpus.  I don't know exactly how far we
    could extend the concept, it will depend on how many features we can
    pack in without making the API and structures unreasonably complex.
    Your imagination is the limit!

    The immediate benefit is that a single range lock can govern any number
    of VM pages without scaling the overhead... an extremely important
    concept for all sorts of subsystems that should radically improve I/O
    throughput on high speed systems that currently bog down in buffer
    cache overheads.  One range lock can govern a 64K I/O instead of 8 
    separate vm_page_t 'locks', for example.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>