DragonflyBSD fast syscall support and x86 improvements

Fri Jun 23 14:14:32 PDT 2006

2006/6/23, Matthew Dillon <dillon at xxxxxxxxxxxxxxxxxxxx>:
   Well, insofar as preformance goes I only worry about 'modern' cpus,
   meaning relatively recent cpus (last few years).  So, for example,
   I would consider trying to optimize for a P4 a waste of time.
I tend to agree with you. Experience, more than the tecnical
documents, teaches us that optimizations in earlier CPUs doesn't
provide enough improvements to justify the amount of work.
   I do have some concerns re: getting reasonably equitable performance
   across cpu vendors.  My main experience is with AMD parts, and
   generally speaking all synchronizing instructions cost about the same.
   An SFENCE operation, for example, costs exactly the same as a locked
   xchgl when the cpu has data it needs to flush (which is always if you
   are trying to integrate a memory barrier with a locking mechanism of
   some sort).
Yes, this is true for Intel CPUs too (I have some patches for FreeBSD
that replaces memory barriers code with p4-suitable one and my
benchmarks show pratically the same timeage for both kernels). However
this is just an example I did to show better what I mean. I think one
of the most interesting application of AMD/Intel extensions is about
using non-temporal hints in SMP architecture, in order to reduce snoop
traffic in cache/memory for massive usages of them.
   Procedural patching has some merit, especially since both the CALL and
   RET can be subsumed by the branch prediction cache, but my preference
   is to inline a reasonably optimal operation and have it conditionally
   call a procedure if it 'fails'.  For example, take a look at our
   spinlock functions in sys/spinlock2.h.
   Procedural patching makes more sense for more complex procedures such
   as bcopy(), and we use it for that.
This is not what I meant.
For run-time patching I was meaning a method that doesn't pay any
penalty about performance. Using a function pointer, for example, will
add an extra call/ret that wastes pipelines and prefetch buffers
(please note that a jmp doesn't solve your problems here).
A dumb way to do a real run-time patching is something similar (in pseudo code):
if (cpuid & BIT_SSE2) {
   if (dest.size < orig.size)
       panic();
   memcpy(dest.addr, orig.addr, dest.size);
}
What I have in mind is a little bit more articulated, but it can't
solve the problem of inlined functions.
On my side, I'm pro-compiletime stubs, since I think if somebody
really cares about performance a kernel re-compiling is the better
thing.
:Once that we have choicen a method in order to apply changes,  the
:first thing I would like to add (BTW, I don't know if it exists
:alredy) is sysenter/sysexit support replacing interrupt 0x80 (I have
:an item in the FreeBSD list for volounteers about it, since I think I
:would like to add it there too)
   Hmm.  Well, I'm not keen on the idea, unless a significant savings
   in time -- at least 50ns, can be demonstrated.  Even with
   SYSENTER/SYSEXIT we still have to save most registers.
Yes, this is true but we totally replace IDT/GDT code, that is the
real bottleneck. Since the stack is in L1 when we pusha, we don't need
to care too much about it.
   However, I do think this might be viable if combined with argument
   registerization as you describe below.  I am *NOT* keen on using
   FP registers for this, though.  If a system call has too many arguments
   to fit into normal registers I'd rather just leave them on the stack.
It's not clear to me if you want to use just GP registers or not FPU
in particular (BTW, I mentioned FPU registers just for a complete
overview, I have to think more how to implement a quick syscall
parameters gathering).
:and possibly evaluating the usage of
:FPU/MMX/XMM registers in order to gather syscalls parameters.
:Feedbacks about it are very appreciated too. I have other ongoing
:projects we can discuss later, but they are very architecture
:dependant so, since maybe they can fit in FreeBSD, they cannot in
:DragonflyBSD.
:
:Thanks for your time,
:Attilio
   I think some degree of registerization might be beneficial, but with
   the provisio that all we are doing here is passing the system call
   arguments themselves in registers, not attempting to reimplement
   the libc functions as registerized calls.
Why not? The point is to not touch the uspace stack here from the
program side and to not touch the kspace stack to kernel side. Even if
we reasonably can assume L1 stacks for successive hits...(so the
improvement is not so important here...).
Attilio

--
Peace can only be achieved by understanding - A. Einstein