Device layering work patch #1

Fri May 14 22:37:39 PDT 2004

    This patch is of 'alpha' quality, I expect to commit it (with further
    work) in about a week.  I would appreciate testing so I can be confident
    that I'm not going to blow up the system when I commit this stage,
    but nobody should install this alpha patch on a production machine.

	fetch http://apollo.backplane.com/DFlyMisc/dev01.patch

    This patch starts to move our device subsystem to a reference-counted
    model and accomplishes certain pre-requisits to later work.

    This patch:

    * Simplifies device instantiation and resolution.  makedev() has been
      removed in favor of make_dev(), and udev2dev() will only create
      new minor numbers, not new major numbers.  The port dispatch is
      now installed directly in the dev_t and the mess that figured out
      what port to use has been removed.  Finally, a d_clone function has
      been added (replacing d_autoq which was never used).  This function
      is called whenever a new device structure is created, allowing 
      the device to populate device-specific fields when the device
      structure is created rather then populating these fields at 
      device-open.  This function will also (soon) allow the creation of
      a device node to be vetoed.

      Device fields are now sufficiently initialized when the device
      structure is allocated such that we can remove NULL checks and other
      special cases from the rest of the device path.

    * Devices are created and searched for based on (devsw,major,minor)
      instead of (major,minor).  This allows us to overload device major
      numbers.  All such devices are accessible but only the devices
      registered for userland access via cdevsw_add() are directly accessible

      All devices in the system that failed to call cdevsw_add() now call
      it.  If you fail to call cdevsw_add(), your device will not be
      'visible' (hooked in) to userland via /dev.

    * The disk subsystem, which provides partition table management and 
      translation, now overloads the device major supplied to it by the
      raw disk device and creates its own device which is distinct from
      the raw block device.  The raw block device's devsw is hidden from
      userland (not registered with cdevsw_add()).

      This means that the disk subsystem no longer needs to overload
      fields in dev_t's owned by the underlying raw block device which
      in turn means that, theoretically, we can stack the partition manager
      on top of any device that supports block operations.

      This is a huge simplification over the 'override' mechanism that the
      disk subsystem used before, which was not stackable.

    * I've started ref-counting the dev_t and cdevsw structures.  This
      needs a lot more work, but it is a good start.

    * struct buf's b_dev field is now allowed to track the device through
      translations.  That is, when the disk layer translates an I/O for
      execution by the underlying raw device it now sets b_dev to the
      underlying raw device.  This requires that b_dev always be initialized
      prior to the initial dev_dstrategy() call.  To enforce this behavior,
      biodone() now setse b_dev to NODEV.

			    ------- Future stages --------

    The following work is intended for future stages and not present in this
    patch:

    * Separate b_blkno and b_pblkno from the struct buf and place it in
      an attached chain of structures which associate a device layer
      with a cached block number.  i.e. a chain of (dev_t, blkno) pairs.

      This will allow us to arbitrarily layer devices and remove filesystem
      block number special casing within the struct buf structure.
      For example, we would be able to emplace a crypto layer in between
      the disk layer and the raw device, and we would theoretically be
      able to use a normal file (without the VN device) as backing store
      for a 'disk' and even be able to cache the block translations for
      the backing file itself along with everything else.

      The caching of such translations through multiple layers will now be
      possible, but not mandatory.  A layer will be allowed to 'replace'
      an existing cache translation (by overwriting it with a new dev_t and
      blkno) rather then add a new structure, and in fact will if chaining
      structures are not readily available.

    * Implement a userland block device facility.  This would work much like
      pty's in that a userland process would be able to attach to one side
      of the device while the kernel attaches to the other as a block device,
      and requests will be passed back and forth to the userland process.

      This facility would also be capable of 'glueing' a generic file
      descriptor, such as a socket, regular file, or anything else capable
      of stream or block I/O, to the backend side of the device.

      Once we have a userland interface we will be able to write a crypto
      layer, serializing log layer, snapshot layer, network backing store,
      etc etc etc.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>