cvs commit: src/sys/sys tls.h src/lib/libc/gen tls.c src/lib/libthread_xu/arch/amd64/amd64 pthread_md.c src/lib/libthread_xu/arch/i386/i386 pthread_md.c src/libexec/rtld-elf rtld.c rtld.h rtld_tls.h src/libexec/rtld-elf/i386 reloc.c

Mon Mar 28 09:45:00 PST 2005

:...
:>     prefer NOT to do).  I did a quick timing test on sys_set_tls_area()
:>     and it costs around 339ns on my AMD64 test cube.  But this is still
:>     going to be far higher performing then having to call __tls_get_addr
:>     all the time.  The procedure setup cost for figuring out the GOT offset
:>     alone is 17ns on the same box.
:
:It's not about calling __tls_get_addr, but
:	mov %gs:0, %eax
:	mov a at NTPOFF(%eax), %eax
:vs.
:	mov $gs:a at NTPOFF, %eax
:
:The difference is one load instruction with possible a pipe-line stale
:involved here. The difference should be zero once the base register is
:loaded.
:
:Joerg

    There's no pipeline stall there.  %gs:0 is likely to ALWAYS be in the
    L1 cache.  The %gs prefix itself can cost time verses a non-prefixed
    relative load instruction so my guess is that it turns out to be a wash.

    Also keep in mind that GCC will cache the data loaded from %gs:0, which
    makes it even less of an issue (and potentially faster then %gs:OFFSET).

    I did a quick test with both the direct and indirect %gs models and
    couldn't see any difference in timing.

					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>