You could do worse than Mach ports

Terry Lambert tlambert2 at mindspring.com
Thu Jul 17 00:50:39 PDT 2003


Matthew Dillon wrote:
>     Well, I used mach messaging long ago on the NeXT machine.  The basic
>     problem with mach messages is that they are 'heavy weight'.  The
>     messaging system has far too much knowledge about the information
>     being sent, and it presumes fairly expensive memory mapping operations
>     which I believe can be avoided.

The NeXTStep implementation was on top of Mach 2 (actually "2.5"
is what people call it, though there was never any such release).
The Mach 3.0 messaging is somewhat cleaner about this.

At the very least, you could do worse than to steal some ideas
about the problem space that you will have to cover with the idea
of messaging (yes, I'm aware of your Amiga and VMS experience ;^)).

I admit that the Mach VM primitives are fairly expensive to deal
with, and I would avoid them too, if I could get away with it.  I
don't think the messaging implementation is married to them, though.


>     I far prefer to convert the I/O subsystem to pass VM Object ranges in the
>     iovec instead of user address space ranges.  This provides a way to
>     reference the data without anyone having to map it at all... the DMA
>     subsystem for example would be able to work directly from the physical
>     pages pulled from the vm_page_t's in the object.  And, more importantly,
>     the messaging system would not have to have any deep knowledge of the
>     data being passed.  It would also be possible to pass VM Object references
>     (or their logical equivalent: a file descriptor) into and out of user
>     space and only actually map the ones associated with filesystem meta
>     data.  File data would not have to be mapped, making a userspace VFS
>     stack potentially almost as efficient as a kernelspace one.

It wouldn't have to be mapped in the pass-through case, which is
currently handled in FreeBSD by the VOP_GETVOBJ (or whatever),
which I had originally envisioned as VOP_GETFINALVP.


>     If you think about it virtually all data references in an I/O operation
>     do not actually have to be touched or accessed by intermediate VFS layers.
>     Not even UFS needs to touch the file data in a read() or write().  It's
>     only the client at one end and the physical block device (via DMA usually)
>     at the other that ever needs to touch the file data.

I really disagree with this (big surprise 8-)).

The problem with this model is that you can only represent a small
subset of the filesystem types, and almost none of the interesting
ones.

For anything that does anything interesting, you are going to have
to map file data as well as metadata pages into the address space
of whoever handles operations on the vp; specifically, here are the
classes of manipulations, with a few examples of each:

1)	File folding, where you store metadata in a file on the FS,
	and hide it from the upper layer, presenting the data in
	the file as metadata to the upper layer.

	QUOTAFS		Implement quotas for all FS's

	UMSDOSFS	Implement UNIX permissions, etc., on top of
			FS's which don't support it

2)	Transformation, where each page of data goes through some
	transformation from the lower layer representation to the
	upper layer representation, in order to normalize it.

	ISOUTF8FS	Implement conversion of data from a legacy
			(e.g. NFSv2, NFSv3, UFS) FS from an ISO
			character set, such as ISO8859-1 8-bit data
			into UTF-8 representation for the upper
			layer code, which expects all directory data
			to be stored as UTF-8

	CRYPTFS		An FS that implements per file encryption using
			a restartable streaming crypto algorithm, that
			XOR's the pages with the key data to present
			unencrypted data to the upper layer on a file
			by file basis, instead of restricting you to a
			single, global key per device.

3)	Directory folding, where you obtain multiple forks for files and
	implicit association of metadata, with the ability to back up
	and restore the resulting information, and have it functional
	afterwards.

	EXATTRFS	Implements extended attributes on any underlying
			FS by converting the file reference to a directory
			reference, and the "data fork" into a reference to
			a file in that directory, with the ability to store
			an arbitrary number of other files there, as well,
			calling them "extended attribute streams" instead.

	ACLFS		Inplement access control lists; this could be via
			an "ACL fork" in EXATTRFS, or via some other means.

	VERSIONFS	Implements file versioning by way of storing the
			versions themselves in the underlying directory
			that represents the file.  Utilizes POSIX namespace
			escapes in order to select specific versions other
			than "the most recent".

4)	Semantic layering, where explicit semantics are implemented on
	top of the underlying FS, without needing to store specific
	additional metadata.

	UMAPFS		Same as in Ficus/BSD

	TRANSLUCENTFS	Same as in Solaris

	UNIONFS		Same as BSD/MacOS X "-union"/"unionfs"

etc..

In all these cases, except UMAPFS, you are manipulating data pages other
than those representing strictly metadata.  And those are all basically
a heck of a lot more intersting than just "NULLFS", which is what getting
the backing object buys you (at least when crossing a protection domain,
that is all it buys you).

Probably, whatever you call your message port or how you end up
implementing it, you are going to have to treat it as a unit with
the mapping across protection domains interface, in order to avoid
any races between mapping and then operating on the mapped objects
(given your stated out-of-order execution model).

-- Terry





More information about the Kernel mailing list