cvs commit: src/sys/i386/i386 bcopy.s bzero.s genassym.c globals.s support.s swtch.s src/sys/i386/include globaldata.h md_var.h src/sys/i386/isa npx.c src/sys/conf files.i386

Thu Apr 29 10:25:59 PDT 2004

dillon      2004/04/29 10:25:03 PDT

DragonFly src repository

  Modified files:
    sys/i386/i386        genassym.c globals.s support.s swtch.s 
    sys/i386/include     globaldata.h md_var.h 
    sys/i386/isa         npx.c 
    sys/conf             files.i386 
  Added files:
    sys/i386/i386        bcopy.s bzero.s 
  Log:
  Rewrite the optimized memcpy/bcopy/bzero support subsystem.  Rip out the
  old FreeBSD code almost entirely.

  * Add support for stacked ONFAULT routines, allowing copyin and copyout to
    call the general memcpy entry point instead of rolling their own.

  * Split memcpy/bcopy and bzero into their own files

  * Add support for XMM (128 bit) and MMX (64 bit) media instruction copies

  * Rewrite the integer code.  Also note that most of the previous integer
    and FP special case support had been ripped out of DragonFly long ago
    in that the assembly was no longer being referenced.  It doesn't make
    sense to have a dozen different zeroing/copying routines so focus on
    the ones that work well with recent (last ~5 years) cpus.

  * Rewrite the FP state handling code.  Instead of restoring the FP state
    let it hang, which allows userland to make multiple syscalls and/or for
    the system to make multiple bcopy()/memcpy() calls without having to
    save/restore the FP state on each call.  Userland will take a fault when
    it needs the FP again.

    Note that FP optimized copies only occur for block sizes >= 2048 bytes,
    so this is not something that userland, or the kernel, will trip up on
    every time it tries to do a bcopy().

  * LWKT threads need to be able to save the FP state, add the simple
    conditional and 5 lines of assembly required to do that.

  AMD Athlon notes: 64 bit media instructions will get us 90% of the way
  there.  It is possible to squeeze out slightly more memory bandwidth from
  the 128 bit XMM instructions (SSE2).  While it does not exist in this commit
  there are two additional features that can be used:  prefetching and
  non-temporal writes.  Prefetching is a 3dNOW instruction and can squeeze
  out significant additionaL performance if you fetch ~128 bytes ahead of
  the game, but I believe it is AMD-only.  Non-temporal writes can double
  UNCACHED memory bandwidth, but they have a horrible effect on L1/L2
  performance and you can't mix non-temporal writes with normal writes without
  completely destroying memory performance (e.g. multiple GB/s -> less then
  100 MBytes/sec).

  Neither prefetching nor non-temporal writes are implemented in this commit.

  Revision  Changes    Path
  1.37      +1 -0      src/sys/i386/i386/genassym.c
  1.21      +3 -0      src/sys/i386/i386/globals.s
  1.11      +84 -889   src/sys/i386/i386/support.s
  1.32      +24 -3     src/sys/i386/i386/swtch.s
  1.24      +1 -0      src/sys/i386/include/globaldata.h
  1.13      +11 -2     src/sys/i386/include/md_var.h
  1.14      +37 -36    src/sys/i386/isa/npx.c
  1.24      +2 -0      src/sys/conf/files.i386

http://www.dragonflybsd.org/cvsweb/src/sys/i386/i386/genassym.c.diff?r1=1.36&r2=1.37&f=h
http://www.dragonflybsd.org/cvsweb/src/sys/i386/i386/globals.s.diff?r1=1.20&r2=1.21&f=h
http://www.dragonflybsd.org/cvsweb/src/sys/i386/i386/support.s.diff?r1=1.10&r2=1.11&f=h
http://www.dragonflybsd.org/cvsweb/src/sys/i386/i386/swtch.s.diff?r1=1.31&r2=1.32&f=h
http://www.dragonflybsd.org/cvsweb/src/sys/i386/include/globaldata.h.diff?r1=1.23&r2=1.24&f=h
http://www.dragonflybsd.org/cvsweb/src/sys/i386/include/md_var.h.diff?r1=1.12&r2=1.13&f=h
http://www.dragonflybsd.org/cvsweb/src/sys/i386/isa/npx.c.diff?r1=1.13&r2=1.14&f=h
http://www.dragonflybsd.org/cvsweb/src/sys/conf/files.i386.diff?r1=1.23&r2=1.24&f=h