[PATCH] Suggested FreeBSD merge

Mon Nov 15 12:20:32 PST 2004

On Mon, Nov 15, 2004 at 10:24:18AM -0800, Matthew Dillon wrote:
> :The address was just a value off-hand. I think we can differentiate between
> :(a) the application ABI and (b) the kernel version to be mapped.
> :
> :>From the application point-of-view, having a fixed address is very useful,
> :because it allows the compiler to skip the overhead of Position Independent
> :Code, esp. the GOT/PLT setup. Since this should be used for sensitive
> :low-level routines, it makes sense to skip this.
>     
>     I'm not sure I understand what you mean here.  I see only three ways to do
>     this.  Using strlen() as a contrived example.  The first way I don't
>     think we can do because it makes strlen() a function pointer rather then
>     a function.  It would be something like:
> 
> 	#define __section(name) __attribute__((__section__(name))) 
> 
> 	__section(".klib-dragonfly01") size_t (* const strlen)(const char *);
> 
>     This would generate code as follows.  This code would be AS FAST as a
>     direct jump due to the branch prediction cache.  That is, the 
>     movl strlen,%ebx + call combination will take no longer then call strlen
>     would take.
> 
> 	movl strlen,%ebx
> 	call *%ebx

That's the problem. The movl strlen,%ebx only works if strlen is a static
address. Otherwise the code has to do a lookup in the GOT first. This means
typically two more instructions. Leaving out the normal function init,
it would be something like this:
	movl strlen at GOT(%ebx), %eax
	movl (%eax), %eax
	call *%eax
The normal calling sequence for PIC is:
	call strlen at PLT
with strlen at PLT being translated into a relative address, which contains:
	jmp strlen
(the real address, somewhat simplified)

>     A second way of doing this is a call/jump:
[this is what you end up with in PIC]

[skip beginning of fixed address discussion]
>     I just don't see this being viable generally without some significant
>     work.  The only way I see a direct-call model working is if the 
>     direct-call code reserved a fixed amount of space for each function
>     so the offsets are well known, and if the function is too big to fit
>     in the reserved space the space would be loaded with a JMP to the
>     actual function instead.

Exactly. The location of the mapping can be considered part of the ABI,
with the best location being at the bottom the virtual address space,
I guess.

>     So the THIRD way would be to do this:
> 
> 	.section	.klib-dragonfly01,"ax", at progbits
> 	.globl		strlen
> 	.type		strlen, at function
> strlen:
> 	[ the entire contents is replaced with actual function if the actual
> 	  function does not exceed 64 bytes, else only the jump vector is
> 	  modified ]
> 	[ the default function can be placed here directly if it does not
> 	  exceed 64 bytes ]
> 	jmp		clib_strlen	; default overrided by kernel
> 	.p2align	6,0x90		; 64 byte blocks
> 
>     Advantages: 
> 
> 	* Direct call, no jump table for simple functions.
> 
> 	* The kernel can just mmap() the replacement library right over the
> 	  top.
> 
>     Disadvantages:
> 
> 	* requires a sophisticated utility to check whether the compiled
> 	  function fits and decide whether to generate a jmp or whether
> 	  to directly embed the function.
> 
> 	* space/compactness tradeoff means that the chosen size may not
> 	  be cache friendly, or may be space friendly, but not both.

Yes, exactly. This is what Apple is doing for MacOS X. The version problem
is not that big, because like I said, the ABI would be fixed and could be
bound to the normal COMPAT handling. Having a default included in libc as
fallback would work too. IIRC the speed difference is bigger on PPC,
because you are doing PIC almost always there.

The cache friendliness is difficult, we have can do at least alignments
pretty well. I don't think there's a difference for normal cache size
length, because GCC does some padding of functions by default.

Joerg