Plans for 1.5

Tue Dec 20 12:44:08 PST 2005

:
:Matthew Dillon wrote:
:> 			    STAGE 2 - I/O Subsystem
:> 			    Starting early February!
:
:What are your thoughts on also adding a new flag to madvise to prefault 
:pages, so as to avoid faults through the VM system?
:
:Perhaps a flag named MADV_PREFAULT, or to be consistent with 
:MADV_WILLNEED, MADV_NEEDNOW. Of course, such a call would be limited to 
:a maximum size per call, which is queryable, etc.
:
:As I recall, this is one of your main points for using buffered 
:read/write calls instead of mmap. This could be even more significant 
:with faults causing page-ins from cross-system boundaries.

    Well, adding a MADV_PREFAULT flag would round-out the madvise features
    list but I don't think it would be all that useful on a production 
    system because the current page-faulting heuristics already do 
    some degree of pre-faulting.

:Also, how about extending sendfile to behave as a general copyfile, or 
:simply adding a general copyfile syscall?
:
:With a framework like CCMS it would seem that such operations could 
:easily be turned into a simple message passed to the file's source 
:system for direct copying. Such a call could perhapes even be an async. 
:operation with a method for querying progress via select/poll/kqueue.
:
:- Jason

    This will be trivial with CCMS.   In fact, with CCMS it will be 
    possible to implement transactional operations by 'locking' the
    cache state for arbitrary offset ranges without having to actually
    instantiate the related buffers.  The ability to predispose the cache
    state is already integrated into my plans for CCMS.  It is an absolute
    requirement in order to be able to be cache coherent across many
    entities (aka multiple machines linked by a network) without having to
    incur a communication between the machines for every little operation.

    So, for example, lets take the file read heuristic.  Each vnode would
    have a CCMS tree and buffers would be managed within that tree.

    read()
    {
	...
	buf = getblk(&vp->v_ccms_tree, offset, bytes, CCMS_STATE_SHARED);
	uiomove(....)
    }

    Without predisposition each getblk() call would not only have to 
    access the local CCMS tree, it would also have to access other cache
    coherent CCMS trees (such as the local VM page cache for this vnode,
    or CCMS trees representing other physical machines in the cluster).  It
    would have to do it for every getblk() call.  That would be rather
    expensive to say the least.

    But with predisposition the CCMS subsystem would be able to create
    a predisposition node 'above' the buffer that covers a far greater
    range of data.  As long as the cache state in the predisposed node
    is at least the state we want to obtain for our buffers, we can then
    obtain additional buffers within the larger range without incuring
    any additional communication with other coherent CCMS trees.

    If we take the whole-file case in a cluster of 4 machines, predisposition
    of the 'root' node would give one machine EXCLUSIVE access and the other
    three machines INVALID access, or might give all 4 machines SHARED access.
    etc.  Communication would only be required when a machine needs a cache
    state that has not been predisposed.  So if all machines are predisposed
    to SHARED and one wants to write to the file, a communication would have
    to be made to change the other machines to INVALID (for whole or part
    of the tree).

    Now it just so happens that we have a nice place to store this sort of
    predisposed state... we can store it in the internal nodes of the tree
    (say, if I were to use a RADIX tree).  So, e.g. if a program is reading
    data from a file in 8K chunks and CCMS is using a base-16 radix tree,
    the immediate parent node would be able to predispose a 64K chunk of
    the data range and the parent above that would be able to predispose
    a 1MB chunk of the data range... all the way up to the root node which
    would be able to predispose the entire file.

    Another alternative would be to actually have a 'predisposed node'
    entity which covers a particular range and is under the direct control
    of various kernel subsystems instead of internalized within CCMS.  
    That would imply using a BTree or Red-Black tree instead of a radix tree.

    I haven't decided which is the best approach to take.  At the moment I
    am leaning towards a radix tree because the vast majority of operations
    will be both non-contended and tend to operate on power-of-2 boundaries.
    Even entities such as inode and bitmap blocks (~6144 byte buffers) would
    be fairly optimal in a radix tree, the entity would simply have two
    references from the tree instead of one.  I can even short-cut radix
    tree nodes by collapsing 'empty' internal nodes.  So e.g. if the root
    represents a 2^64 bit range and we want to lock several contiguous
    4K chunks, the next level down could be short-cutted to 2^16 (skipping
    the nodes for 2^60, 2^56, 2^52 .... 2^20).  The radix tree can thus be
    made very, very efficient.

    Where the radix tree breaks down is when you have major *CONFLICTING*
    entities which are on byte-aligned boundaries.  So, for example, you
    have one entity locking the range 0-1111 and another locking the range
    1112-2000.  The problem only occurs when there is a conflict at the
    boundary because otherwise the storage of the node in the radix tree
    would simply be in the 'smallest container that fits the whole entity'.
    But when you have a conflict at a byte boundary the radix tree would
    have to represent each entity down to the byte boundary.  That could
    wind up eating up to (64 bits divided by 4 bits for base-16) = 16
    internal nodes in the radix tree.

    But, honestly, I can't think of any situation where programs would 
    actually compete on byte boundaries short of the program doing something
    really stupid.  If it comes to it, the degenerate memory use can be
    controlled simply by blocking one of the conflicting entities until
    the other has finished.

						-Matt