You could do worse than Mach ports
Terry Lambert
tlambert2 at mindspring.com
Thu Jul 17 00:50:39 PDT 2003
Matthew Dillon wrote:
> Well, I used mach messaging long ago on the NeXT machine. The basic
> problem with mach messages is that they are 'heavy weight'. The
> messaging system has far too much knowledge about the information
> being sent, and it presumes fairly expensive memory mapping operations
> which I believe can be avoided.
The NeXTStep implementation was on top of Mach 2 (actually "2.5"
is what people call it, though there was never any such release).
The Mach 3.0 messaging is somewhat cleaner about this.
At the very least, you could do worse than to steal some ideas
about the problem space that you will have to cover with the idea
of messaging (yes, I'm aware of your Amiga and VMS experience ;^)).
I admit that the Mach VM primitives are fairly expensive to deal
with, and I would avoid them too, if I could get away with it. I
don't think the messaging implementation is married to them, though.
> I far prefer to convert the I/O subsystem to pass VM Object ranges in the
> iovec instead of user address space ranges. This provides a way to
> reference the data without anyone having to map it at all... the DMA
> subsystem for example would be able to work directly from the physical
> pages pulled from the vm_page_t's in the object. And, more importantly,
> the messaging system would not have to have any deep knowledge of the
> data being passed. It would also be possible to pass VM Object references
> (or their logical equivalent: a file descriptor) into and out of user
> space and only actually map the ones associated with filesystem meta
> data. File data would not have to be mapped, making a userspace VFS
> stack potentially almost as efficient as a kernelspace one.
It wouldn't have to be mapped in the pass-through case, which is
currently handled in FreeBSD by the VOP_GETVOBJ (or whatever),
which I had originally envisioned as VOP_GETFINALVP.
> If you think about it virtually all data references in an I/O operation
> do not actually have to be touched or accessed by intermediate VFS layers.
> Not even UFS needs to touch the file data in a read() or write(). It's
> only the client at one end and the physical block device (via DMA usually)
> at the other that ever needs to touch the file data.
I really disagree with this (big surprise 8-)).
The problem with this model is that you can only represent a small
subset of the filesystem types, and almost none of the interesting
ones.
For anything that does anything interesting, you are going to have
to map file data as well as metadata pages into the address space
of whoever handles operations on the vp; specifically, here are the
classes of manipulations, with a few examples of each:
1) File folding, where you store metadata in a file on the FS,
and hide it from the upper layer, presenting the data in
the file as metadata to the upper layer.
QUOTAFS Implement quotas for all FS's
UMSDOSFS Implement UNIX permissions, etc., on top of
FS's which don't support it
2) Transformation, where each page of data goes through some
transformation from the lower layer representation to the
upper layer representation, in order to normalize it.
ISOUTF8FS Implement conversion of data from a legacy
(e.g. NFSv2, NFSv3, UFS) FS from an ISO
character set, such as ISO8859-1 8-bit data
into UTF-8 representation for the upper
layer code, which expects all directory data
to be stored as UTF-8
CRYPTFS An FS that implements per file encryption using
a restartable streaming crypto algorithm, that
XOR's the pages with the key data to present
unencrypted data to the upper layer on a file
by file basis, instead of restricting you to a
single, global key per device.
3) Directory folding, where you obtain multiple forks for files and
implicit association of metadata, with the ability to back up
and restore the resulting information, and have it functional
afterwards.
EXATTRFS Implements extended attributes on any underlying
FS by converting the file reference to a directory
reference, and the "data fork" into a reference to
a file in that directory, with the ability to store
an arbitrary number of other files there, as well,
calling them "extended attribute streams" instead.
ACLFS Inplement access control lists; this could be via
an "ACL fork" in EXATTRFS, or via some other means.
VERSIONFS Implements file versioning by way of storing the
versions themselves in the underlying directory
that represents the file. Utilizes POSIX namespace
escapes in order to select specific versions other
than "the most recent".
4) Semantic layering, where explicit semantics are implemented on
top of the underlying FS, without needing to store specific
additional metadata.
UMAPFS Same as in Ficus/BSD
TRANSLUCENTFS Same as in Solaris
UNIONFS Same as BSD/MacOS X "-union"/"unionfs"
etc..
In all these cases, except UMAPFS, you are manipulating data pages other
than those representing strictly metadata. And those are all basically
a heck of a lot more intersting than just "NULLFS", which is what getting
the backing object buys you (at least when crossing a protection domain,
that is all it buys you).
Probably, whatever you call your message port or how you end up
implementing it, you are going to have to treat it as a unit with
the mapping across protection domains interface, in order to avoid
any races between mapping and then operating on the mapped objects
(given your stated out-of-order execution model).
-- Terry
More information about the Kernel
mailing list