cvs commit: src/sys/sys tls.h src/lib/libc/gen tls.c src/lib/libthread_xu/arch/amd64/amd64 pthread_md.c src/lib/libthread_xu/arch/i386/i386 pthread_md.c src/libexec/rtld-elf rtld.c rtld.h rtld_tls.h src/libexec/rtld-elf/i386 reloc.c
Matthew Dillon
dillon at apollo.backplane.com
Mon Mar 28 09:45:00 PST 2005
:...
:> prefer NOT to do). I did a quick timing test on sys_set_tls_area()
:> and it costs around 339ns on my AMD64 test cube. But this is still
:> going to be far higher performing then having to call __tls_get_addr
:> all the time. The procedure setup cost for figuring out the GOT offset
:> alone is 17ns on the same box.
:
:It's not about calling __tls_get_addr, but
: mov %gs:0, %eax
: mov a at NTPOFF(%eax), %eax
:vs.
: mov $gs:a at NTPOFF, %eax
:
:The difference is one load instruction with possible a pipe-line stale
:involved here. The difference should be zero once the base register is
:loaded.
:
:Joerg
There's no pipeline stall there. %gs:0 is likely to ALWAYS be in the
L1 cache. The %gs prefix itself can cost time verses a non-prefixed
relative load instruction so my guess is that it turns out to be a wash.
Also keep in mind that GCC will cache the data loaded from %gs:0, which
makes it even less of an issue (and potentially faster then %gs:OFFSET).
I did a quick test with both the direct and indirect %gs models and
couldn't see any difference in timing.
Matthew Dillon
<dillon at xxxxxxxxxxxxx>
More information about the Commits
mailing list