New brainfart for threaded VFS and data passing between threads.

Matthew Dillon dillon at apollo.backplane.com
Tue Mar 30 15:00:40 PST 2004


    The recent PIPE work adapted from Alan Cox's work in FreeBSD-5 has really
    lit a fire under my seat.  It's amazing how such a simple concept can
    change the world as we know it :-)

    Originally the writer side of the PIPE code was mapping the supplied
    user data into KVM and then signalling the reader side.  The reader
    side would then copy the data out of KVM.

    The concept Alan codified is quite different:  Instead of having the
    originator map the data into KVM, simply supply an array of vm_page_t's
    to the target and let the target map the data into KVM.  In the case 
    of the PIPE code, Alan used the SF_BUF API (which was originally developed
    by David Greenman for the sendfile() implementation) on the target side
    to handle the KVA mappings.

    Seems simple, eh?  But Alan got an unexpectedly huge boost in performance
    on IA32 when he did this.  The performance boost turned out to be due 
    to two facts:

	* Avoiding the KVM mappings and the related kernel_object manipulations
	  required for those mappings saves a lot of cpu cycles when all you
	  want is a quick mapping into KVM.

	* On SMP, KVM mappings generated IPIs to all cpus in order to 
	  invalidate the TLB.  By avoiding KVM mappings all of those IPIs
	  go away.

	* When the target maps the page, it can often get away with doing
	  a simple localized cpu_invlpg().  Most targets will NEVER HAVE TO
	  SEND IPIs TO OTHER CPUS.  The current SF_BUF implementation still
	  does send IPIs in the uncached case, but I had an idea to fix that
	  and Alan agrees that it is sound... and that is to store a cpumask
	  in the sf_buf so a user of the sf_buf only invalidates the cached
	  KVM mapping if it had not yet been accessed on that particular cpu.

	* For PIPEs, the fact that SF_BUF's cached their KVM mappings
	  reduced the mapping overhead almost to zero.

    Now when I heard about this huge performance increase I of course 
    immediately decided that DragonFly needed this feature to, and so we
    now have it for DFly pipes.  

			Light Bulb goes off in head

    But it also got me to thinking about a number of other sticky issues
    that we face, especially in our desire to thread major subsystems (such
    as Jeff's threading of the network stack and my desire to thread VFS),
    and also issues related to how to efficiently pass data between threads,
    and how to efficiently pass data down through the I/O subsystem.

    Until now, I think everyone here and in FreeBSD land were stuck on the
    concept of the originator mapping the data into KVM instead of the
    target for most things.  But Alan's work has changed all that.

    This idea of using SF_BUF's and making the target responsible for mapping
    the data has changed everything.  Consider what this can be used for:

    * For threaded VFS we can change the UIO API to a new API (I'll call it
      XIO) which passes an array of vm_page_t's instead of a user process
      pointer and userspace buffer pointer.

      So 'XIO' would basically be our implementation of target-side mappings
      with SF_BUF capabilities.

    * We can do away with KVM mappings in the buffer cache for the most
      prevalient buffers we cache... those representing file data blocks.
      We still need them for meta-data, and a few other circumstances, but
      the KVM load on the system from buffer cache would drop by like 90%.

    * We can use the new XIO interface for all block data referencse from
      userland and get rid of the whole UIO_USERSPACE / UIO_SYSSPACE mess.
      (I'm gunning to get rid of UIO entirely, in fact).

    * We can use the new XIO interface for the entire I/O path all the way
      down to busdma, yet still retain the option to map the data if/when
      we need to.  I never liked the BIO code in FreeBSD-5, this new XIO
      concept is far superior and will solve the problem neatly in DragonFly.

    * We can eventually use XIO and SF_BUF's to codify copy-on-write at
      the vm_page_t level and no longer stall memory modifications to I/O
      buffers during I/O writes.

    * I will be able to use XIO for our message passing IPC (our CAPS code),
      making it much, much faster then it currently is.  I may do that as
      a second step to prove-out the first step (which is for me to create
      the XIO API).

    * Once we have vm_page_t copy-on-write we can recode zero-copy TCP 
      to use XIO, and won't be a hack any more.

    * XIO fits perfectly into the eventual pie-in-the-sky goal of
      implementing SSI/Clustering, because it means we can pass data
      references (vm_page_t equivalents) between machines instead of 
      passing the data itself, and only actually copy the data across
      on the final target.  e.g. if on an SSI system you were to do
      'cp file1 file2', and both file1 and file2 are on the same filesystem,
      the actual *data* transfer might only occur on the machine housing
      the physical filesystem and not on the machine doing the 'cp'.  Not
      one byte.  Can you imagine how fast that would be?

    And many other things.  XIO is the nutcracker, and the nut is virtually
    all the remaining big-ticket items we need to cover DragonFly.

    This is very exciting to me.

						    -Matt






More information about the Kernel mailing list