just curious

Matthew Dillon dillon at apollo.backplane.com
Mon Jul 21 11:36:16 PDT 2003

:> 	copyfile(int sfd, int dfd, off_t soffset, off_t doffset, off_t bytes)
:I assume this operation would end up being implemented by changing the 
:state of VM buffers, yes?

    "As optimally as possible", whatever that turns out to be.  This type
    of function would have extreme flexibility.  It could copy the data
    as a catch-all, but consider these cases:

    * The target descriptor/object could potentially share the source's
      pages until they are otherwise modified by either side (kinda like how
      fork() works).

      The shared data can be unshared when the target flushes the data to
      backing store, or left shared, depending on what we want to accomplish.

    * In a clustered operating system the backing store would not even have
      to *EXIST* on the machine doing the operation.

      machine 1: cp -r directory1 directory2
	 (directory1 resides on machine2)
	 (directory2 resides on machine3)

      The cp program could potentially issue normal read() and write() calls.
      machine 2 would provide handles backing the buffer mappings, but if
      the cp program does not actually touch the buffer then the write() could
      simply pass the handle to the target (machine 3), which would then
      talk DIRECTLY to machine 2 to obtain the data.

      result: The backing store is copied directly to the machine that needs
	      it, from machine 2 to machine 3.

    * In a clustered operating system consider the case where the two 
      directories reside on machine 2:

      machine 1: cp -r directory1 directory2
	 (directory1 and directory2 resides on machine2)

      In this case the result is that we can potentially optimize the operation
      to a direct SCSI->SCSI DMA op, with no file data touched by any machine's

    * A sophisticated filesystem could be made aware of possible file data
      page sharing.  e.g. when you copy data from one file to another there
      is no reason why both files cannot share those data pages which are
      the same.

      In that case the cp -r operation could 'copy' gigabytes of data without
      actually copying anything other then the meta-data.  

      Can you say "Complete filesystem copy in each jail"?  I knew you could!

    All of this has MAJOR implications for clustering systems because, when
    fully implemented years down the road, it means that you can do a lot of
    work local to a box which does not necessarily have a fast network 
    connection, but which nevertheless is able to cause massive amounts of
    data to be moved about the cluster.

    Of course this stuff can get quite complex.  The base implementation will
    simply be to copy, but having the flexibility to do this sort of thing
    is important and passing around VM objects makes it all possible.

    (And now I think Hiten will see why we shouldn't bother porting over
    the zero-copy socket code from 5.x, because we will be able to do it
    trivially once all of these features are put in place).

					    Matthew Dillon 
					    <dillon at xxxxxxxxxxxxx>

More information about the Kernel mailing list