DragonflyBSD fast syscall support and x86 improvements
dillon at apollo.backplane.com
Fri Jun 23 12:38:17 PDT 2006
:first of all, I have to say I'm not too much familiar with
:DragonFlyBSD kernel (so I have no idea if this is completely new for
:the project), but I contributed some for FreeBSD kernel and I'm rather
:experienced with IA32.
:What I would like to work on are some x86 'improvements' to DFLY
:kernel. For improvements I mean adding different versions of some
:critical functions for different versions of the CPU (i.e.: in P4 you
:could just implement an atomic memory barrier using mov + *fence
:instructions against using xchg or stubs like these).
:First of all, a good discussion point is about optimizations
:activation: do you prefer compile-times stubs or run-time patching?
:Actually, I'm planning to improve and add a run-time patching concept
:inherited from Linux to FreeBSD, and maybe you would be interested in
:its port. Run-time patching is very useful, but on the other side it
:faces a lot of problems (inlined functions can't fit and it deals with
:dimensions problems), if we don't want to loose performance matters.
:Compile time stubs are simpler and possibly quicker but the kernel
:needs re-compilation in order to get benefits, that is not so nice.
:So, some feedbacks about this is very appreciated.
Well, insofar as preformance goes I only worry about 'modern' cpus,
meaning relatively recent cpus (last few years). So, for example,
I would consider trying to optimize for a P4 a waste of time.
I do have some concerns re: getting reasonably equitable performance
across cpu vendors. My main experience is with AMD parts, and
generally speaking all synchronizing instructions cost about the same.
An SFENCE operation, for example, costs exactly the same as a locked
xchgl when the cpu has data it needs to flush (which is always if you
are trying to integrate a memory barrier with a locking mechanism of
Procedural patching has some merit, especially since both the CALL and
RET can be subsumed by the branch prediction cache, but my preference
is to inline a reasonably optimal operation and have it conditionally
call a procedure if it 'fails'. For example, take a look at our
spinlock functions in sys/spinlock2.h.
Procedural patching makes more sense for more complex procedures such
as bcopy(), and we use it for that.
:Once that we have choicen a method in order to apply changes, the
:first thing I would like to add (BTW, I don't know if it exists
:alredy) is sysenter/sysexit support replacing interrupt 0x80 (I have
:an item in the FreeBSD list for volounteers about it, since I think I
:would like to add it there too)
Hmm. Well, I'm not keen on the idea, unless a significant savings
in time -- at least 50ns, can be demonstrated. Even with
SYSENTER/SYSEXIT we still have to save most registers.
However, I do think this might be viable if combined with argument
registerization as you describe below. I am *NOT* keen on using
FP registers for this, though. If a system call has too many arguments
to fit into normal registers I'd rather just leave them on the stack.
:and possibly evaluating the usage of
:FPU/MMX/XMM registers in order to gather syscalls parameters.
:Feedbacks about it are very appreciated too. I have other ongoing
:projects we can discuss later, but they are very architecture
:dependant so, since maybe they can fit in FreeBSD, they cannot in
:Thanks for your time,
I think some degree of registerization might be beneficial, but with
the provisio that all we are doing here is passing the system call
arguments themselves in registers, not attempting to reimplement
the libc functions as registerized calls.
So for example the libc 'read' system call would look something like
The kernel entry point would save the register set as per normal,
using 'pushal', but instead of then doing a copyin() of the arguments
the kernel syscall2() function would simply map them from the pushed
However, I have a provisio on doing things this way.... the system
call arguments MUST match existing syscall argument structures. That
is, whatever order 'pushal' pushes registers onto the stack, mapping
them into memory, must match the argument order that we already
specify in our system call argument structure for any given system
The exception code would almost certainly also have to reserve space
on the stack for the rest of the system call structure... see
sys/sysproto.h for an example. The 'header' (struct sysmsg) needs
space reserved so the exception code can provide a pointer to the
'*_args' structure directly to syscall2() (or whatever).
<dillon at xxxxxxxxxxxxx>
More information about the Kernel