Buildworld loop seg-fault update -- I believe it is hardware
dillon at apollo.backplane.com
Sun Dec 25 14:19:49 PST 2011
After struggling with this issue for well over a year and really pushing
hard to track it down in the last two months I was finally able to come
up with a test case running 'cc1' from gcc-4.4 in a loop and get it
to fail in less than 60 seconds. Prior to finding this case it would
take anywhere up to 2 days with 48 cores fully loaded to reproduce
SUMMARY OF PROBLEM
* Only cc1 (from gcc-4.4.7) on AMD boxes would fail. The same code & OS
image on Intel boxes would not fail.
* Repeated runs with the same arguments etc would not reproduce the bug
with any consistency. In otherwords, it could not be a bug in CC1
itself. This was further evaluated using a static stackgap and
telling cc1 to not use a random seed with the same results.
* No other program was failing due to this issue that I could determine.
* All of cc1's failures occured approximately in the same place, moving
around a little based on numerous factors (stackgap randomization,
arguments length, etc... see below).
They all involved the fill_sons_in_loop() and/or first_dom_son()
(which is called from fill_sons_in_loop()) functions in GCC 4.4.7.
* The result is a segmentation fault, bus fault, or other fault related
to corrupt program state, causing the compiler to report an internal
* The failure has been observed on three different machines, all running
AMD cpus. A quad opteron 6168 (48 core) box, and two Phenom II x4 820
boxes. All running DragonFly. However, I believe I have discounted
the OS as a possible source of the problem (see below).
* This particular failure has never been observed on an Intel cpu
(SandyBridge consumer and SandyBridge Xeon E3 tested).
* The bug is EXTREMELY sensitive to %rip and %rsp values. Changing the
codebase under test even a little, for example adding a few nop's to
an unrelated procedure which shifts the entire code base, greatly
changed the reproducability of the bug.
* Similarly for stack addresses. DragonFly uses the initial stack
randomization patch from OpenBSD, so repeated runs of the same program
will start with a (16-byte aligned) stack pointer offset in a 0-1008
This also means reproducability is extremely sensitive to the
argument count, length of program arguments, environment count,
length of the environment, length of the path to the binary being
executed, and the signal return code the kernel throws near the
top of the stack (note: reproducability is unrelated to signals of
any sort, no signals are taken during the tests).
* The bug reproduces on DragonFly by simply doing a make -j 20 buildworld
or other similar buildworld in loop, on an AMD box with known buggy
cpu's (current known to occur with the AMD Opteron 6168 Id = 0x100f91
Stepping = 1 -- that's the 12-core opteron by the way, and the
Phenom II X4 820 (Id = 0x100f42 Stepping = 2).
* Tested with a recent BIOS as well as with all the errata MSR's
programmed, at least all the ones I could find. Nothing fixed it.
* It originally took about 2 days to reproduce the bug using the
buildworld test. Recently I was able to find a test case with a
particular cc1 line from the buildworld which, combined with a fixed
environment, command line, and stack gap could reproduce the bug in
less than 60 seconds.
* In addition, the fast-reproducing environment was single threaded.
A single cc1 was executed via vfork/exec and remained on a single
cpu, with no other machine load. The bug still reproduced.
Using different starting stack offsets via DragonFly's stackgap
feature I was able to find an offset (in my case -904 but it depends
a ton of everything else placed on the stack by exec) that reproduces
the bug within 60 seconds.
By continuing to try different offsets I've determined that high
reproducability repeats every 32K of offset. e.g. -904, -33672,
and so forth. Once the initial offset is found with high
reproducability each 32K increment produces another highly
* Finding such an environment on other OS's should be possible but I
will note that it took a lot of effort just to find a quickly
reproducable case with DragonFly. I've been tearing my hair out on
this bug for over a year now trying to find the 'OS bug' that caused
it and couldn't find an easily reproducable (DragonFly-specific) case
Probably the only way to reproduce this on another OS consistently
using my test case is to write a DragonFly syscall emulation layer
for e.g. FreeBSD. I may do this to absolutely guarantee that it isn't
an OS bug in DragonFly, but it's a bit of work and I haven't decided
if I want to spend the time or not.
* I am continuing my efforts to reproduce the bug on another OS but
it isn't going to be easy. Differences in GCC implementations,
lack of stackgap randomization, and other issues make this a fairly
large time consuming affair.
For example, the gcc-4.4 port on FreeBSD produces code that is radically
different than the code produced on DragonFly, due to a multitude of
optimizations that appear to be enabled by default as well as differences
in the ordering of libraries, code modules during link, libc, and so
on and so forth.
EFFORTS AT FINDING A KERNEL BUG THAT WASN'T A KERNEL BUG
I now believe this to be a cpu bug. The following efforts were made
to determine whether the DragonFly kernel was responsible for the bug.
Most of these efforts occured before the fast-reproducing case was
* There were other bugs in the DFly kernel related to MP operations,
the pmap code, and the VM object code. These subsystems were completely
rewritten (and committed) in the course of researching this bug.
* I tried prefaulting all stack and text pages AND also forced malloc()
to bzero() the allocated space. These changes had NO EFFECT on bug
* All user<->kernel transitions (except BIOS SMI which I have no control
over) recorded the saved %rip and %rsp and compared it against the
%rip and %rsp upon iret. This includes system calls, execptions, and
interrupts. No discrepencies were found.
* USB legacy mode was disabled in the BIOS, which greatly changes BIOS
SMI interrupt rates. This had NO EFFECT on bug reproducability.
* The thread switch code in the kernel was modified to unconditionally
invalidated the TLB. This had NO EFFECT on bug reproducability.
* The return-to-user code in the kernel was modified to intentionally
smash the AMD RAS (Return Address Stack) with a chain of 28 call
recursions and return sequences. This had NO EFFECT on bug
* All user<->kernel transitions logged the TSC value which was checked
against the failure point. Cases were found where NO kernel entry
occured for at least 8 milliseconds prior to the failure point.
* We CLI'd the thread switch code. This had NO EFFECT on bug
* The bug reproduced single-threaded and without switching cpu cores
so we know it can't be due to stale TLB entries.
USERLAND DEBUGGING EFFORTS
* Debugging the issue in userland (and kernelland) is extremely
difficult because most debugging mechanisms caused the problem
to stop occuring.
* The user stack appeared to NOT be corrupt in any way.
* Either the %rip or the %rsp was clearly wrong. The stack state
approximately 544 bytes higher up on the stack seemed to match what
the code at the failed %rip was trying to do, but the register state
In the test case I was able to reproduce most readily the program
failed at the retq at the end of fill_sons_in_loop(). The retq
failed with a bus error due to a non-canonical value on the stack
at the point it tried to retq. This value was clearly data left over
on the stack from a local variable in one of the several procedures
called by fill_sons_in_loop(). The register state that was found
was very weird... %r12, %r13, and %r14 matched thec contents of
the stack at the failure point but registers %rbx and %rbp did not.
%rbx and %rbp matched the contents of the stack +544 bytes higher
on the stack (and the stack state +544 bytes higher would have had
a correct return %rip had the program tried to pop/retq from that
point instead of where the %rsp at the failure point was).
NOTE: This assembly dump is from a case where I inserted a
MFENCE + NOP at the top of the routine. The bug still occured.
However, if I moved the MFENCE + NOP t othe end of the routine,
with no other changes (so code offsets remain the same except
for the slight shift within fill_sons_in_loop() itself),
the bug stopped occuring.
(from an x86-64 built cc1, gcc 4.4.7)
(gdb) disassemble fill_sons_in_loop
0x00000000008203cb <+0>: push %r14
0x00000000008203cd <+2>: push %r13
0x00000000008203cf <+4>: push %r12
0x00000000008203d1 <+6>: push %rbp
0x00000000008203d2 <+7>: push %rbx
0x00000000008203d3 <+8>: mov %rdi,%rbp
0x00000000008203d6 <+11>: mov %rdx,%r14
0x00000000008203d9 <+14>: mov %rcx,%r13
0x00000000008203dc <+17>: mfence
0x00000000008203df <+20>: nop
0x00000000008203e0 <+21>: mov (%rcx),%eax
0x00000000008203e2 <+23>: movslq %eax,%rdx
0x00000000008203e5 <+26>: mov %rsi,(%r14,%rdx,8)
0x00000000008203e9 <+30>: add $0x1,%eax
0x00000000008203ec <+33>: mov %eax,(%rcx)
0x00000000008203ee <+35>: mov $0x1,%edi
0x00000000008203f3 <+40>: callq 0x7c32b9 <first_dom_son>
0x00000000008203f8 <+45>: mov %rax,%rbx
0x00000000008203fb <+48>: test %rax,%rax
0x00000000008203fe <+51>: je 0x82046b <fill_sons_in_loop+160>
0x0000000000820400 <+53>: mov $0x0,%r12d
0x0000000000820406 <+59>: mov %rbx,%rsi
0x0000000000820409 <+62>: mov %rbp,%rdi
0x000000000082040c <+65>: callq 0x81fe53 <flow_bb_inside_loop_p>
0x0000000000820411 <+70>: test %al,%al
0x0000000000820413 <+72>: je 0x820440 <fill_sons_in_loop+117>
0x0000000000820415 <+74>: mov 0x10(%rbp),%rsi
0x0000000000820419 <+78>: mov %rbx,%rdx
0x000000000082041c <+81>: mov $0x1,%edi
0x0000000000820421 <+86>: callq 0x7c38b0 <dominated_by_p>
0x0000000000820426 <+91>: test %al,%al
0x0000000000820428 <+93>: je 0x82042f <fill_sons_in_loop+100>
0x000000000082042a <+95>: mov %rbx,%r12
0x000000000082042d <+98>: jmp 0x820440 <fill_sons_in_loop+117>
0x000000000082042f <+100>: mov %r13,%rcx
0x0000000000820432 <+103>: mov %r14,%rdx
0x0000000000820435 <+106>: mov %rbx,%rsi
0x0000000000820438 <+109>: mov %rbp,%rdi
0x000000000082043b <+112>: callq 0x8203cb <fill_sons_in_loop>
0x0000000000820440 <+117>: mov %rbx,%rsi
0x0000000000820443 <+120>: mov $0x1,%edi
0x0000000000820448 <+125>: callq 0x7c3269 <next_dom_son>
0x000000000082044d <+130>: mov %rax,%rbx
0x0000000000820450 <+133>: test %rax,%rax
0x0000000000820453 <+136>: jne 0x820406 <fill_sons_in_loop+59>
0x0000000000820455 <+138>: test %r12,%r12
0x0000000000820458 <+141>: je 0x82046b <fill_sons_in_loop+160>
0x000000000082045a <+143>: mov %r13,%rcx
0x000000000082045d <+146>: mov %r14,%rdx
0x0000000000820460 <+149>: mov %r12,%rsi
0x0000000000820463 <+152>: mov %rbp,%rdi
0x0000000000820466 <+155>: callq 0x8203cb <fill_sons_in_loop>
0x000000000082046b <+160>: pop %rbx
0x000000000082046c <+161>: pop %rbp
0x000000000082046d <+162>: pop %r12
0x000000000082046f <+164>: pop %r13
0x0000000000820471 <+166>: pop %r14
0x0000000000820473 <+168>: retq
(same code but with the mfence+nop moved to the end of the function)
(this adjustment causes the bug to stop occuring)
Dump of assembler code for function fill_sons_in_loop:
0x000000000082046b <+0>: push %r14
0x000000000082046d <+2>: push %r13
0x000000000082046f <+4>: push %r12
0x0000000000820471 <+6>: push %rbp
0x0000000000820472 <+7>: push %rbx
0x0000000000820473 <+8>: mov %rdi,%rbp
0x0000000000820476 <+11>: mov %rdx,%r14
0x0000000000820479 <+14>: mov %rcx,%r13
0x000000000082047c <+17>: mov (%rcx),%eax
0x000000000082047e <+19>: movslq %eax,%rdx
0x0000000000820481 <+22>: mov %rsi,(%r14,%rdx,8)
0x0000000000820485 <+26>: add $0x1,%eax
0x0000000000820488 <+29>: mov %eax,(%rcx)
0x000000000082048a <+31>: mov $0x1,%edi
0x000000000082048f <+36>: callq 0x7c3341 <first_dom_son>
0x0000000000820494 <+41>: mov %rax,%rbx
0x0000000000820497 <+44>: test %rax,%rax
0x000000000082049a <+47>: je 0x820507 <fill_sons_in_loop+156>
0x000000000082049c <+49>: mov $0x0,%r12d
0x00000000008204a2 <+55>: mov %rbx,%rsi
0x00000000008204a5 <+58>: mov %rbp,%rdi
0x00000000008204a8 <+61>: callq 0x81fee4 <flow_bb_inside_loop_p>
0x00000000008204ad <+66>: test %al,%al
0x00000000008204af <+68>: je 0x8204dc <fill_sons_in_loop+113>
0x00000000008204b1 <+70>: mov 0x10(%rbp),%rsi
0x00000000008204b5 <+74>: mov %rbx,%rdx
0x00000000008204b8 <+77>: mov $0x1,%edi
0x00000000008204bd <+82>: callq 0x7c3909 <dominated_by_p>
0x00000000008204c2 <+87>: test %al,%al
0x00000000008204c4 <+89>: je 0x8204cb <fill_sons_in_loop+96>
0x00000000008204c6 <+91>: mov %rbx,%r12
0x00000000008204c9 <+94>: jmp 0x8204dc <fill_sons_in_loop+113>
0x00000000008204cb <+96>: mov %r13,%rcx
0x00000000008204ce <+99>: mov %r14,%rdx
0x00000000008204d1 <+102>: mov %rbx,%rsi
0x00000000008204d4 <+105>: mov %rbp,%rdi
0x00000000008204d7 <+108>: callq 0x82046b <fill_sons_in_loop>
0x00000000008204dc <+113>: mov %rbx,%rsi
0x00000000008204df <+116>: mov $0x1,%edi
0x00000000008204e4 <+121>: callq 0x7c3319 <next_dom_son>
0x00000000008204e9 <+126>: mov %rax,%rbx
0x00000000008204ec <+129>: test %rax,%rax
0x00000000008204ef <+132>: jne 0x8204a2 <fill_sons_in_loop+55>
0x00000000008204f1 <+134>: test %r12,%r12
0x00000000008204f4 <+137>: je 0x820507 <fill_sons_in_loop+156>
0x00000000008204f6 <+139>: mov %r13,%rcx
0x00000000008204f9 <+142>: mov %r14,%rdx
0x00000000008204fc <+145>: mov %r12,%rsi
0x00000000008204ff <+148>: mov %rbp,%rdi
0x0000000000820502 <+151>: callq 0x82046b <fill_sons_in_loop>
0x0000000000820507 <+156>: mfence
0x000000000082050a <+159>: nop
0x000000000082050b <+160>: pop %rbx
0x000000000082050c <+161>: pop %rbp
0x000000000082050d <+162>: pop %r12
0x000000000082050f <+164>: pop %r13
0x0000000000820511 <+166>: pop %r14
0x0000000000820513 <+168>: retq
End of assembler dump.
(NOTE: the 'mfence' may not be needed. Just adding the 'nop'
seems to workaround the bug).
* So, as you can see, I could not determine whether it was the %rsp that
was wrong or the %rip that was wrong. I think it is entirely possible
for the cpu bug to cause a stale return due to the recursive call
That is, if you observe the above code, the callq at location +155
will store the address (fill_sons_in_loop+160) on the stack. When
that function (or other functions in fill_sons_in_loop()) returns,
this old return pc value is left on the stack. Thus the stack is
littered with %rip values that are no longer in use but haven't been
destroyed by local variable state.
So there are two possibilities:
(1) The %rip at the failure point is incorrect only because the cpu
bug caused a retq to return through one of these stale addresses.
Thus the bug is related to the stack pointer becoming corrupt.
(2) The %rsp at the failure point is correct but the %rip is wrong.
There are known errata where a retq returns to the wrong %rip.
These errata are related to 32<->64 bit compatibility mode
transitions, but our pure 64-bit environment has none. I wonder,
though, if the errata is not expansive enough and that the problem
can occur in a pure 64-bit environment too.
* I was able to reproduce the issue with guard variables in
first_dom_son() that pointed to themselves to resolve that
the stack pages in fact contained the proper pages for the
running program and that the data was not shifted.
With many hours of additional debugging I was able to produce a test case
where repeated runs of cc1 would fail within 60 seconds, which made
debugging a whole lot easier.
* ptrace()ing the program using the debug registers anywhere near
the failure points resulted in the failure not occuring. No failure.
* Via ptrace() I was able to remove most OS considerations by pre-faulting
all required stack pages and all program text (though not malloc'd
* I went through all the AMD as of the errata I could find online,
which was circa 2009. The cpu's involved in the tests which had
CPU: AMD Opteron(tm) Processor 6168 (1900.01-MHz K8-class CPU)
Origin = "AuthenticAMD" Id = 0x100f91 Stepping = 1
CPU: AMD Phenom(tm) II X4 820 Processor (2799.92-MHz K8-class CPU)
Origin = "AuthenticAMD" Id = 0x100f42 Stepping = 2
Again, I will note that this issue does not appear to occur on our
* Locking the process to a single cpu did not solve the problem.
* Once I was able to get a reproducable case running the same cc1 over
and over again (only some runs would fail, not all)... once I was able
to do this I noted that the register state as-of the failure seemed
consistent for all instances of the failure. In otherwords, something
very specific is breaking while cc1 is running, and nothing else.
The failure did not occur on every run of cc1 (usually at least
once out of every 4-20 runs), but every time it DID occur the failure
conditions appeared to be the same.
* No other program has ever failed this way. Only cc1, and it always
seemed to be related to fill_sons_in_loop().
What is notable about fill_sons_in_loop() is that it is executed as
part of a very deep tree recursion in CC1. This recursion is mixed
in with a few other functions which use different stack frame sizes.
The red-zone is in play but exhaustive testing and the fact that cc1
takes no signals during the test remove stack smashing from signals
as a vector for the issue.
In otherwords, these particular routines exercise the cpu hardware
through all sorts of cases... it is deep enough to blow out the
hardware Return Address Stack. It is mixed enough with other code
to potentially create major write-combining effects due to local stack
variables in prior calls and retirement edge cases.
There are many AMD erratum related to write combining, cache effects,
and the return address stack (RAS).
* I programmed all the related MSRs based on the erratum I could
find. It had no effect either way, none, individually, or all
- Errata 63 - disable the TLB flush filter
- Errata 69 - Multiprocessor Coherency Problem with Hardware
- Errata 81 - Cache Coherency Problem with Hardware Prefetching
and Streaming Stores.
- Errata 86 - DRAM Data Masking Feature Can Cause ECC Failures
- Errata 94 - Sequential Prefetch Feature May Cause Incorrect
- Errata 97 (checked that movnt* was no longer used anywhere)
- Errata 101 (made sure the dram scrubber was off)
- Errata 113 - Enhanced Write-Combing Feature Causes System
- Errata 135/136 (checked that no prefix was used for invlpg's)
- Misc MOVS/CMPS/STOS/CLD/STD related REP issues (e.g. like
errata 109). Ran a kernel and userland which removed nearly
all such except for the few MOVS sequences gcc creates for
structural copies (which I couldn't readily figure out how
to remove). No effect.
And so on.
* Some of these erratum imply that a 32<->64 bit mode transition is
required. Our cc1 and OS of course is running in pure 64 bit mode.
I do not trust that a transition is, in fact, required for the erratum
- Errata 95 RET Instruction May Return To Incorrect EIP
(involves bugs for the Return Address Stack (RAS)).
Possible cause if we ignore the 32<->64 bit mode transition
requirement in the errata.
- Errata 82 Certain Faults On Far Transfer Instructions In
64-Bit Mode Save Incorrect RIP.
Unlikely cause as one test case found that no exception,
fault, or interrupt occured for 8ms prior to the failure.
(unless a SYSCALL/SYSRET is responsible, which is also
possible but unlikely).
RESOLUTION OF THE BUG
* The bug still occurs if I place the MFENCE+NOP at the beginning
of the function.
* Placing the MFENCE+NOP at the end of the function causes the
bug to stop occuring.
The bug disappears completely in that case with testing over
* Placing just a NOP at the end of the function causes the
bug to stop occuring. As of this writing this is still undergoing
<dillon at backplane.com>
More information about the Kernel