cvs commit: src/sys/kern imgact_elf.c init_main.c kern_checkpoint.c kern_descrip.c kern_event.c sys_generic.c sys_pipe.c uipc_syscalls.c uipc_usrreq.c vfs_aio.c vfs_syscalls.c src/sys/sys filedesc.h src/sys/dev/misc/streams streams.c ...

Wed Jun 22 14:49:49 PDT 2005

> If we care more about performance we could make it 16 bytes
> so the array lookup is scaled by a factor of two again, which 
> I might just do actually because that removes the overhead of 
> a multiply.
Call me a naive assembly programmer but...

I doubt it really matters at all.  Some numbers:
PII, PIII:
  add r,r         1 cycle
  shl r,i         "free"
  lea r,[r+r*i]   1 cycle
  mul             4 cycles

P4:
  add r,r         0.5 *an instruction depending on the result
                       may overlap so the cost is amortized
  shl r,i         4!!!
  lea r, [r+r*i]  4
  mul             16
  mov r,r         0.5

Now x*12 = x*8+x*4 = 4*(2*x+x)

so we can expect on PII, PIII:
(multiples of 12)
  shl eax, 2
  lea eax, [eax+eax*2]
  (1 cycle)
(multiples of 16)
  "free"

on the P4:
(multiples of 12)
   add eax, eax
   add eax, eax
   mov ebx, eax
   add ebx, eax
   add ebx, eax
   (2.5)
(multiples of 16)
   add eax, eax
   add eax, eax
   add eax, eax
   add eax, eax
   (2)

Note: in none of these cases do we ever actually MUL nor
does any decent compiler

*everyone* not just assembly programmers should read

http://www.agner.org/assem/pentopt.pdf

which I find is the single most concise and effective guide
to understanding the costs of "standard operations"--sorry
no commentary on SMP situations there.

also, I'm sure there are nay-sayers about optimizing against
x86 but architecture independent design for performance is 
nonsense.  Try running some of those NetBSD benchmarks on non-intel
hardware, watch the performance just drop to the floor.

-Jon