Plans for 1.5
dillon at apollo.backplane.com
Tue Dec 20 12:44:08 PST 2005
:Matthew Dillon wrote:
:> STAGE 2 - I/O Subsystem
:> Starting early February!
:What are your thoughts on also adding a new flag to madvise to prefault
:pages, so as to avoid faults through the VM system?
:Perhaps a flag named MADV_PREFAULT, or to be consistent with
:MADV_WILLNEED, MADV_NEEDNOW. Of course, such a call would be limited to
:a maximum size per call, which is queryable, etc.
:As I recall, this is one of your main points for using buffered
:read/write calls instead of mmap. This could be even more significant
:with faults causing page-ins from cross-system boundaries.
Well, adding a MADV_PREFAULT flag would round-out the madvise features
list but I don't think it would be all that useful on a production
system because the current page-faulting heuristics already do
some degree of pre-faulting.
:Also, how about extending sendfile to behave as a general copyfile, or
:simply adding a general copyfile syscall?
:With a framework like CCMS it would seem that such operations could
:easily be turned into a simple message passed to the file's source
:system for direct copying. Such a call could perhapes even be an async.
:operation with a method for querying progress via select/poll/kqueue.
This will be trivial with CCMS. In fact, with CCMS it will be
possible to implement transactional operations by 'locking' the
cache state for arbitrary offset ranges without having to actually
instantiate the related buffers. The ability to predispose the cache
state is already integrated into my plans for CCMS. It is an absolute
requirement in order to be able to be cache coherent across many
entities (aka multiple machines linked by a network) without having to
incur a communication between the machines for every little operation.
So, for example, lets take the file read heuristic. Each vnode would
have a CCMS tree and buffers would be managed within that tree.
buf = getblk(&vp->v_ccms_tree, offset, bytes, CCMS_STATE_SHARED);
Without predisposition each getblk() call would not only have to
access the local CCMS tree, it would also have to access other cache
coherent CCMS trees (such as the local VM page cache for this vnode,
or CCMS trees representing other physical machines in the cluster). It
would have to do it for every getblk() call. That would be rather
expensive to say the least.
But with predisposition the CCMS subsystem would be able to create
a predisposition node 'above' the buffer that covers a far greater
range of data. As long as the cache state in the predisposed node
is at least the state we want to obtain for our buffers, we can then
obtain additional buffers within the larger range without incuring
any additional communication with other coherent CCMS trees.
If we take the whole-file case in a cluster of 4 machines, predisposition
of the 'root' node would give one machine EXCLUSIVE access and the other
three machines INVALID access, or might give all 4 machines SHARED access.
etc. Communication would only be required when a machine needs a cache
state that has not been predisposed. So if all machines are predisposed
to SHARED and one wants to write to the file, a communication would have
to be made to change the other machines to INVALID (for whole or part
of the tree).
Now it just so happens that we have a nice place to store this sort of
predisposed state... we can store it in the internal nodes of the tree
(say, if I were to use a RADIX tree). So, e.g. if a program is reading
data from a file in 8K chunks and CCMS is using a base-16 radix tree,
the immediate parent node would be able to predispose a 64K chunk of
the data range and the parent above that would be able to predispose
a 1MB chunk of the data range... all the way up to the root node which
would be able to predispose the entire file.
Another alternative would be to actually have a 'predisposed node'
entity which covers a particular range and is under the direct control
of various kernel subsystems instead of internalized within CCMS.
That would imply using a BTree or Red-Black tree instead of a radix tree.
I haven't decided which is the best approach to take. At the moment I
am leaning towards a radix tree because the vast majority of operations
will be both non-contended and tend to operate on power-of-2 boundaries.
Even entities such as inode and bitmap blocks (~6144 byte buffers) would
be fairly optimal in a radix tree, the entity would simply have two
references from the tree instead of one. I can even short-cut radix
tree nodes by collapsing 'empty' internal nodes. So e.g. if the root
represents a 2^64 bit range and we want to lock several contiguous
4K chunks, the next level down could be short-cutted to 2^16 (skipping
the nodes for 2^60, 2^56, 2^52 .... 2^20). The radix tree can thus be
made very, very efficient.
Where the radix tree breaks down is when you have major *CONFLICTING*
entities which are on byte-aligned boundaries. So, for example, you
have one entity locking the range 0-1111 and another locking the range
1112-2000. The problem only occurs when there is a conflict at the
boundary because otherwise the storage of the node in the radix tree
would simply be in the 'smallest container that fits the whole entity'.
But when you have a conflict at a byte boundary the radix tree would
have to represent each entity down to the byte boundary. That could
wind up eating up to (64 bits divided by 4 bits for base-16) = 16
internal nodes in the radix tree.
But, honestly, I can't think of any situation where programs would
actually compete on byte boundaries short of the program doing something
really stupid. If it comes to it, the degenerate memory use can be
controlled simply by blocking one of the conflicting entities until
the other has finished.
More information about the Kernel