DragonflyBSD fast syscall support and x86 improvements

Fri Jun 23 12:38:17 PDT 2006

:Hi all,
:first of all, I have to say I'm not too much familiar with
:DragonFlyBSD kernel (so I have no idea if this is completely new for
:the project), but I contributed some for FreeBSD kernel and I'm rather
:experienced with IA32.
:
:What I would like to work on are some x86 'improvements' to DFLY
:kernel. For improvements I mean adding different versions of some
:critical functions for different versions of the CPU (i.e.: in P4 you
:could just implement an atomic memory barrier using mov + *fence
:instructions against using xchg or stubs like these).
:
:First of all,  a good discussion point is about optimizations
:activation: do you prefer compile-times stubs or run-time patching?
:Actually, I'm planning to improve and add a run-time patching concept
:inherited from Linux to FreeBSD, and maybe you would be interested in
:its port. Run-time patching is very useful, but on the other side it
:faces a lot of problems (inlined functions can't fit and it deals with
:dimensions problems), if we don't want to loose performance matters.
:Compile time stubs are simpler and possibly quicker but the kernel
:needs re-compilation in order to get benefits, that is not so nice.
:So, some feedbacks about this is very appreciated.

    Well, insofar as preformance goes I only worry about 'modern' cpus,
    meaning relatively recent cpus (last few years).  So, for example,
    I would consider trying to optimize for a P4 a waste of time.

    I do have some concerns re: getting reasonably equitable performance
    across cpu vendors.  My main experience is with AMD parts, and
    generally speaking all synchronizing instructions cost about the same.
    An SFENCE operation, for example, costs exactly the same as a locked
    xchgl when the cpu has data it needs to flush (which is always if you
    are trying to integrate a memory barrier with a locking mechanism of
    some sort).

    Procedural patching has some merit, especially since both the CALL and
    RET can be subsumed by the branch prediction cache, but my preference
    is to inline a reasonably optimal operation and have it conditionally
    call a procedure if it 'fails'.  For example, take a look at our
    spinlock functions in sys/spinlock2.h.

    Procedural patching makes more sense for more complex procedures such
    as bcopy(), and we use it for that.

:Once that we have choicen a method in order to apply changes,  the
:first thing I would like to add (BTW, I don't know if it exists
:alredy) is sysenter/sysexit support replacing interrupt 0x80 (I have
:an item in the FreeBSD list for volounteers about it, since I think I
:would like to add it there too)

    Hmm.  Well, I'm not keen on the idea, unless a significant savings
    in time -- at least 50ns, can be demonstrated.  Even with
    SYSENTER/SYSEXIT we still have to save most registers.

    However, I do think this might be viable if combined with argument
    registerization as you describe below.  I am *NOT* keen on using
    FP registers for this, though.  If a system call has too many arguments
    to fit into normal registers I'd rather just leave them on the stack.

:and possibly evaluating the usage of
:FPU/MMX/XMM registers in order to gather syscalls parameters.
:Feedbacks about it are very appreciated too. I have other ongoing
:projects we can discuss later, but they are very architecture
:dependant so, since maybe they can fit in FreeBSD, they cannot in
:DragonflyBSD.
:
:Thanks for your time,
:Attilio

    I think some degree of registerization might be beneficial, but with
    the provisio that all we are doing here is passing the system call
    arguments themselves in registers, not attempting to reimplement
    the libc functions as registerized calls.

    So for example the libc 'read' system call would look something like
    this:

    read:	(libc)
	pushl	%ebx
	movl	8(%esp),%edx
	movl	12(%esp),%ecx
	movl	16(%esp),%ebx
	movl	$SYS_read,%eax
	sysenter/syscall/whatever
	addl	$16,%esp
	ret

    The kernel entry point would save the register set as per normal,
    using 'pushal', but instead of then doing a copyin() of the arguments
    the kernel syscall2() function would simply map them from the pushed
    registers.

    However, I have a provisio on doing things this way....  the system
    call arguments MUST match existing syscall argument structures.  That
    is, whatever order 'pushal' pushes registers onto the stack, mapping
    them into memory, must match the argument order that we already
    specify in our system call argument structure for any given system
    call.

    The exception code would almost certainly also have to reserve space
    on the stack for the rest of the system call structure... see
    sys/sysproto.h for an example.  The 'header' (struct sysmsg) needs
    space reserved so the exception code can provide a pointer to the
    '*_args' structure directly to syscall2() (or whatever).

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>