Buildworld loop seg-fault update -- I believe it is hardware

Sun Dec 25 14:19:49 PST 2011

    After struggling with this issue for well over a year and really pushing
    hard to track it down in the last two months I was finally able to come
    up with a test case running 'cc1' from gcc-4.4 in a loop and get it
    to fail in less than 60 seconds.  Prior to finding this case it would
    take anywhere up to 2 days with 48 cores fully loaded to reproduce
    the failure.

				SUMMARY OF PROBLEM

    * Only cc1 (from gcc-4.4.7) on AMD boxes would fail.  The same code & OS
      image on Intel boxes would not fail.

    * Repeated runs with the same arguments etc would not reproduce the bug
      with any consistency.  In otherwords, it could not be a bug in CC1
      itself.  This was further evaluated using a static stackgap and
      telling cc1 to not use a random seed with the same results.

    * No other program was failing due to this issue that I could determine.
      Only cc1.

    * All of cc1's failures occured approximately in the same place, moving
      around a little based on numerous factors (stackgap randomization,
      arguments length, etc... see below).

      They all involved the fill_sons_in_loop() and/or first_dom_son()
      (which is called from fill_sons_in_loop()) functions in GCC 4.4.7.

    * The result is a segmentation fault, bus fault, or other fault related
      to corrupt program state, causing the compiler to report an internal
      compiler failure.

    * The failure has been observed on three different machines, all running
      AMD cpus.  A quad opteron 6168 (48 core) box, and two Phenom II x4 820
      boxes.  All running DragonFly.  However, I believe I have discounted
      the OS as a possible source of the problem (see below).

    * This particular failure has never been observed on an Intel cpu
      (SandyBridge consumer and SandyBridge Xeon E3 tested).


				BUG REPRODUCABILITY

    * The bug is EXTREMELY sensitive to %rip and %rsp values.  Changing the
      codebase under test even a little, for example adding a few nop's to
      an unrelated procedure which shifts the entire code base, greatly
      changed the reproducability of the bug.

    * Similarly for stack addresses.  DragonFly uses the initial stack
      randomization patch from OpenBSD, so repeated runs of the same program
      will start with a (16-byte aligned) stack pointer offset in a 0-1008
      range.

      This also means reproducability is extremely sensitive to the
      argument count, length of program arguments, environment count,
      length of the environment, length of the path to the binary being
      executed, and the signal return code the kernel throws near the
      top of the stack (note: reproducability is unrelated to signals of
      any sort, no signals are taken during the tests).

    * The bug reproduces on DragonFly by simply doing a make -j 20 buildworld
      or other similar buildworld in loop, on an AMD box with known buggy
      cpu's (current known to occur with the AMD Opteron 6168  Id = 0x100f91
      Stepping = 1 -- that's the 12-core opteron by the way, and the
      Phenom II X4 820 (Id = 0x100f42  Stepping = 2).

    * Tested with a recent BIOS as well as with all the errata MSR's
      programmed, at least all the ones I could find.  Nothing fixed it.

    * It originally took about 2 days to reproduce the bug using the
      buildworld test.  Recently I was able to find a test case with a
      particular cc1 line from the buildworld which, combined with a fixed
      environment, command line, and stack gap could reproduce the bug in
      less than 60 seconds.

    * In addition, the fast-reproducing environment was single threaded.
      A single cc1 was executed via vfork/exec and remained on a single
      cpu, with no other machine load.  The bug still reproduced.

      Using different starting stack offsets via DragonFly's stackgap
      feature I was able to find an offset (in my case -904 but it depends
      a ton of everything else placed on the stack by exec) that reproduces
      the bug within 60 seconds.

      By continuing to try different offsets I've determined that high
      reproducability repeats every 32K of offset.  e.g. -904, -33672,
      and so forth.  Once the initial offset is found with high
      reproducability each 32K increment produces another highly
      reproducable case.

    * Finding such an environment on other OS's should be possible but I
      will note that it took a lot of effort just to find a quickly
      reproducable case with DragonFly.  I've been tearing my hair out on
      this bug for over a year now trying to find the 'OS bug' that caused
      it and couldn't find an easily reproducable (DragonFly-specific) case
      until now.

      Probably the only way to reproduce this on another OS consistently
      using my test case is to write a DragonFly syscall emulation layer
      for e.g. FreeBSD.  I may do this to absolutely guarantee that it isn't
      an OS bug in DragonFly, but it's a bit of work and I haven't decided
      if I want to spend the time or not.

    * I am continuing my efforts to reproduce the bug on another OS but
      it isn't going to be easy.  Differences in GCC implementations,
      lack of stackgap randomization, and other issues make this a fairly
      large time consuming affair.

      For example, the gcc-4.4 port on FreeBSD produces code that is radically
      different than the code produced on DragonFly, due to a multitude of
      optimizations that appear to be enabled by default as well as differences
      in the ordering of libraries, code modules during link, libc, and so
      on and so forth.

	    EFFORTS AT FINDING A KERNEL BUG THAT WASN'T A KERNEL BUG

    I now believe this to be a cpu bug.  The following efforts were made
    to determine whether the DragonFly kernel was responsible for the bug.
    Most of these efforts occured before the fast-reproducing case was
    found.

    * There were other bugs in the DFly kernel related to MP operations,
      the pmap code, and the VM object code.  These subsystems were completely
      rewritten (and committed) in the course of researching this bug.

    * I tried prefaulting all stack and text pages AND also forced malloc()
      to bzero() the allocated space.  These changes had NO EFFECT on bug
      reproducability.

    * All user<->kernel transitions (except BIOS SMI which I have no control
      over) recorded the saved %rip and %rsp and compared it against the
      %rip and %rsp upon iret.  This includes system calls, execptions, and
      interrupts.  No discrepencies were found.

    * USB legacy mode was disabled in the BIOS, which greatly changes BIOS
      SMI interrupt rates.  This had NO EFFECT on bug reproducability.

    * The thread switch code in the kernel was modified to unconditionally
      invalidated the TLB.  This had NO EFFECT on bug reproducability.

    * The return-to-user code in the kernel was modified to intentionally
      smash the AMD RAS (Return Address Stack) with a chain of 28 call
      recursions and return sequences.  This had NO EFFECT on bug
      reproducability.

    * All user<->kernel transitions logged the TSC value which was checked
      against the failure point.  Cases were found where NO kernel entry
      occured for at least 8 milliseconds prior to the failure point.

    * We CLI'd the thread switch code.  This had NO EFFECT on bug
      reproducability.

    * The bug reproduced single-threaded and without switching cpu cores
      so we know it can't be due to stale TLB entries.

			    USERLAND DEBUGGING EFFORTS

    * Debugging the issue in userland (and kernelland) is extremely
      difficult because most debugging mechanisms caused the problem
      to stop occuring.

    * The user stack appeared to NOT be corrupt in any way.

    * Either the %rip or the %rsp was clearly wrong.  The stack state
      approximately 544 bytes higher up on the stack seemed to match what
      the code at the failed %rip was trying to do, but the register state
      was 'imposible'. 

      In the test case I was able to reproduce most readily the program
      failed at the retq at the end of fill_sons_in_loop().  The retq
      failed with a bus error due to a non-canonical value on the stack
      at the point it tried to retq.  This value was clearly data left over
      on the stack from a local variable in one of the several procedures
      called by fill_sons_in_loop().  The register state that was found
      was very weird...  %r12, %r13, and %r14 matched thec contents of
      the stack at the failure point but registers %rbx and %rbp did not.
      %rbx and %rbp matched the contents of the stack +544 bytes higher
      on the stack (and the stack state +544 bytes higher would have had
      a correct return %rip had the program tried to pop/retq from that
      point instead of where the %rsp at the failure point was).

      NOTE: This assembly dump is from a case where I inserted a
	    MFENCE + NOP at the top of the routine.  The bug still occured.
	    However, if I moved the MFENCE + NOP t othe end of the routine,
	    with no other changes (so code offsets remain the same except
	    for the slight shift within fill_sons_in_loop() itself),
	    the bug stopped occuring.

      (from an x86-64 built cc1, gcc 4.4.7)
      (gdb) disassemble fill_sons_in_loop
	   0x00000000008203cb <+0>:     push   %r14
	   0x00000000008203cd <+2>:     push   %r13
	   0x00000000008203cf <+4>:     push   %r12
	   0x00000000008203d1 <+6>:     push   %rbp
	   0x00000000008203d2 <+7>:     push   %rbx
	   0x00000000008203d3 <+8>:     mov    %rdi,%rbp
	   0x00000000008203d6 <+11>:    mov    %rdx,%r14
	   0x00000000008203d9 <+14>:    mov    %rcx,%r13
	   0x00000000008203dc <+17>:    mfence 
	   0x00000000008203df <+20>:    nop
	   0x00000000008203e0 <+21>:    mov    (%rcx),%eax
	   0x00000000008203e2 <+23>:    movslq %eax,%rdx
	   0x00000000008203e5 <+26>:    mov    %rsi,(%r14,%rdx,8)
	   0x00000000008203e9 <+30>:    add    $0x1,%eax
	   0x00000000008203ec <+33>:    mov    %eax,(%rcx)
	   0x00000000008203ee <+35>:    mov    $0x1,%edi
	   0x00000000008203f3 <+40>:    callq  0x7c32b9 <first_dom_son>
	   0x00000000008203f8 <+45>:    mov    %rax,%rbx
	   0x00000000008203fb <+48>:    test   %rax,%rax
	   0x00000000008203fe <+51>:    je     0x82046b <fill_sons_in_loop+160>
	   0x0000000000820400 <+53>:    mov    $0x0,%r12d
	   0x0000000000820406 <+59>:    mov    %rbx,%rsi
	   0x0000000000820409 <+62>:    mov    %rbp,%rdi
	   0x000000000082040c <+65>:    callq  0x81fe53 <flow_bb_inside_loop_p>
	   0x0000000000820411 <+70>:    test   %al,%al
	   0x0000000000820413 <+72>:    je     0x820440 <fill_sons_in_loop+117>
	   0x0000000000820415 <+74>:    mov    0x10(%rbp),%rsi
	   0x0000000000820419 <+78>:    mov    %rbx,%rdx
	   0x000000000082041c <+81>:    mov    $0x1,%edi
	   0x0000000000820421 <+86>:    callq  0x7c38b0 <dominated_by_p>
	   0x0000000000820426 <+91>:    test   %al,%al
	   0x0000000000820428 <+93>:    je     0x82042f <fill_sons_in_loop+100>
	   0x000000000082042a <+95>:    mov    %rbx,%r12
	   0x000000000082042d <+98>:    jmp    0x820440 <fill_sons_in_loop+117>
	   0x000000000082042f <+100>:   mov    %r13,%rcx
	   0x0000000000820432 <+103>:   mov    %r14,%rdx
	   0x0000000000820435 <+106>:   mov    %rbx,%rsi
	   0x0000000000820438 <+109>:   mov    %rbp,%rdi
	   0x000000000082043b <+112>:   callq  0x8203cb <fill_sons_in_loop>
	   0x0000000000820440 <+117>:   mov    %rbx,%rsi
	   0x0000000000820443 <+120>:   mov    $0x1,%edi
	   0x0000000000820448 <+125>:   callq  0x7c3269 <next_dom_son>
	   0x000000000082044d <+130>:   mov    %rax,%rbx
	   0x0000000000820450 <+133>:   test   %rax,%rax
	   0x0000000000820453 <+136>:   jne    0x820406 <fill_sons_in_loop+59>
	   0x0000000000820455 <+138>:   test   %r12,%r12
	   0x0000000000820458 <+141>:   je     0x82046b <fill_sons_in_loop+160>
	   0x000000000082045a <+143>:   mov    %r13,%rcx
	   0x000000000082045d <+146>:   mov    %r14,%rdx
	   0x0000000000820460 <+149>:   mov    %r12,%rsi
	   0x0000000000820463 <+152>:   mov    %rbp,%rdi
	   0x0000000000820466 <+155>:   callq  0x8203cb <fill_sons_in_loop>
	   0x000000000082046b <+160>:   pop    %rbx
	   0x000000000082046c <+161>:   pop    %rbp
	   0x000000000082046d <+162>:   pop    %r12
	   0x000000000082046f <+164>:   pop    %r13
	   0x0000000000820471 <+166>:   pop    %r14
	   0x0000000000820473 <+168>:   retq   

	(same code but with the mfence+nop moved to the end of the function)
	(this adjustment causes the bug to stop occuring)
	Dump of assembler code for function fill_sons_in_loop:
	   0x000000000082046b <+0>:     push   %r14
	   0x000000000082046d <+2>:     push   %r13
	   0x000000000082046f <+4>:     push   %r12
	   0x0000000000820471 <+6>:     push   %rbp
	   0x0000000000820472 <+7>:     push   %rbx
	   0x0000000000820473 <+8>:     mov    %rdi,%rbp
	   0x0000000000820476 <+11>:    mov    %rdx,%r14
	   0x0000000000820479 <+14>:    mov    %rcx,%r13
	   0x000000000082047c <+17>:    mov    (%rcx),%eax
	   0x000000000082047e <+19>:    movslq %eax,%rdx
	   0x0000000000820481 <+22>:    mov    %rsi,(%r14,%rdx,8)
	   0x0000000000820485 <+26>:    add    $0x1,%eax
	   0x0000000000820488 <+29>:    mov    %eax,(%rcx)
	   0x000000000082048a <+31>:    mov    $0x1,%edi
	   0x000000000082048f <+36>:    callq  0x7c3341 <first_dom_son>
	   0x0000000000820494 <+41>:    mov    %rax,%rbx
	   0x0000000000820497 <+44>:    test   %rax,%rax
	   0x000000000082049a <+47>:    je     0x820507 <fill_sons_in_loop+156>
	   0x000000000082049c <+49>:    mov    $0x0,%r12d
	   0x00000000008204a2 <+55>:    mov    %rbx,%rsi
	   0x00000000008204a5 <+58>:    mov    %rbp,%rdi
	   0x00000000008204a8 <+61>:    callq  0x81fee4 <flow_bb_inside_loop_p>
	   0x00000000008204ad <+66>:    test   %al,%al
	   0x00000000008204af <+68>:    je     0x8204dc <fill_sons_in_loop+113>
	   0x00000000008204b1 <+70>:    mov    0x10(%rbp),%rsi
	   0x00000000008204b5 <+74>:    mov    %rbx,%rdx
	   0x00000000008204b8 <+77>:    mov    $0x1,%edi
	   0x00000000008204bd <+82>:    callq  0x7c3909 <dominated_by_p>
	   0x00000000008204c2 <+87>:    test   %al,%al
	   0x00000000008204c4 <+89>:    je     0x8204cb <fill_sons_in_loop+96>
	   0x00000000008204c6 <+91>:    mov    %rbx,%r12
	   0x00000000008204c9 <+94>:    jmp    0x8204dc <fill_sons_in_loop+113>
	   0x00000000008204cb <+96>:    mov    %r13,%rcx
	   0x00000000008204ce <+99>:    mov    %r14,%rdx
	   0x00000000008204d1 <+102>:   mov    %rbx,%rsi
	   0x00000000008204d4 <+105>:   mov    %rbp,%rdi
	   0x00000000008204d7 <+108>:   callq  0x82046b <fill_sons_in_loop>
	   0x00000000008204dc <+113>:   mov    %rbx,%rsi
	   0x00000000008204df <+116>:   mov    $0x1,%edi
	   0x00000000008204e4 <+121>:   callq  0x7c3319 <next_dom_son>
	   0x00000000008204e9 <+126>:   mov    %rax,%rbx
	   0x00000000008204ec <+129>:   test   %rax,%rax
	   0x00000000008204ef <+132>:   jne    0x8204a2 <fill_sons_in_loop+55>
	   0x00000000008204f1 <+134>:   test   %r12,%r12
	   0x00000000008204f4 <+137>:   je     0x820507 <fill_sons_in_loop+156>
	   0x00000000008204f6 <+139>:   mov    %r13,%rcx
	   0x00000000008204f9 <+142>:   mov    %r14,%rdx
	   0x00000000008204fc <+145>:   mov    %r12,%rsi
	   0x00000000008204ff <+148>:   mov    %rbp,%rdi
	   0x0000000000820502 <+151>:   callq  0x82046b <fill_sons_in_loop>
	   0x0000000000820507 <+156>:   mfence 
	   0x000000000082050a <+159>:   nop
	   0x000000000082050b <+160>:   pop    %rbx
	   0x000000000082050c <+161>:   pop    %rbp
	   0x000000000082050d <+162>:   pop    %r12
	   0x000000000082050f <+164>:   pop    %r13
	   0x0000000000820511 <+166>:   pop    %r14
	   0x0000000000820513 <+168>:   retq   
	End of assembler dump.

		(NOTE: the 'mfence' may not be needed.  Just adding the 'nop'
		 seems to workaround the bug).

    * So, as you can see, I could not determine whether it was the %rsp that
      was wrong or the %rip that was wrong.  I think it is entirely possible
      for the cpu bug to cause a stale return due to the recursive call
      to fill_sons_in_loop().

      That is, if you observe the above code, the callq at location +155
      will store the address (fill_sons_in_loop+160) on the stack.  When
      that function (or other functions in fill_sons_in_loop()) returns,
      this old return pc value is left on the stack.  Thus the stack is
      littered with %rip values that are no longer in use but haven't been
      destroyed by local variable state.

      So there are two possibilities:

      (1) The %rip at the failure point is incorrect only because the cpu
	  bug caused a retq to return through one of these stale addresses.
	  Thus the bug is related to the stack pointer becoming corrupt.

      (2) The %rsp at the failure point is correct but the %rip is wrong.
	  There are known errata where a retq returns to the wrong %rip.
	  These errata are related to 32<->64 bit compatibility mode
	  transitions, but our pure 64-bit environment has none.  I wonder,
	  though, if the errata is not expansive enough and that the problem
	  can occur in a pure 64-bit environment too.

    * I was able to reproduce the issue with guard variables in
      first_dom_son() that pointed to themselves to resolve that
      the stack pages in fact contained the proper pages for the
      running program and that the data was not shifted.


    With many hours of additional debugging I was able to produce a test case
    where repeated runs of cc1 would fail within 60 seconds, which made
    debugging a whole lot easier.

    * ptrace()ing the program using the debug registers anywhere near
      the failure points resulted in the failure not occuring.  No failure.  
      Very frustrating.

    * Via ptrace() I was able to remove most OS considerations by pre-faulting
      all required stack pages and all program text (though not malloc'd
      space).

    * I went through all the AMD as of the errata I could find online,
      which was circa 2009.  The cpu's involved in the tests which had
      problems are:

      CPU: AMD Opteron(tm) Processor 6168 (1900.01-MHz K8-class CPU)
	   Origin = "AuthenticAMD"  Id = 0x100f91  Stepping = 1
      Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,
			  MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,
			  SSE,SSE2,HTT>
      Features2=0x802009<SSE3,MON,CX16,POPCNT>
      AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,
			      LM,3DNow!+,3DNow!>
      AMD Features2=0x837ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,
			      Prefetch,OSVW,IBS,SKINIT,WDT,NodeId>

      And also:

      CPU: AMD Phenom(tm) II X4 820 Processor (2799.92-MHz K8-class CPU)
	   Origin = "AuthenticAMD"  Id = 0x100f42  Stepping = 2
      Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,
			  MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,
			  SSE,SSE2,HTT>
      Features2=0x802009<SSE3,MON,CX16,POPCNT>
      AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,
			      LM,3DNow!+,3DNow!>
      AMD Features2=0x37ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,
			      Prefetch,OSVW,IBS,SKINIT,WDT>


      Again, I will note that this issue does not appear to occur on our
      Intel cpus.

    * Locking the process to a single cpu did not solve the problem.

    * Once I was able to get a reproducable case running the same cc1 over
      and over again (only some runs would fail, not all)... once I was able
      to do this I noted that the register state as-of the failure seemed
      consistent for all instances of the failure.  In otherwords, something
      very specific is breaking while cc1 is running, and nothing else.

      The failure did not occur on every run of cc1 (usually at least
      once out of every 4-20 runs), but every time it DID occur the failure
      conditions appeared to be the same.

    * No other program has ever failed this way.  Only cc1, and it always
      seemed to be related to fill_sons_in_loop().

    What is notable about fill_sons_in_loop() is that it is executed as
    part of a very deep tree recursion in CC1.  This recursion is mixed
    in with a few other functions which use different stack frame sizes.
    The red-zone is in play but exhaustive testing and the fact that cc1
    takes no signals during the test remove stack smashing from signals
    as a vector for the issue.

    In otherwords, these particular routines exercise the cpu hardware
    through all sorts of cases... it is deep enough to blow out the
    hardware Return Address Stack.  It is mixed enough with other code
    to potentially create major write-combining effects due to local stack
    variables in prior calls and retirement edge cases.

    There are many AMD erratum related to write combining, cache effects,
    and the return address stack (RAS).

    * I programmed all the related MSRs based on the erratum I could
      find.  It had no effect either way, none, individually, or all
      together.

	- Errata 63 - disable the TLB flush filter

	- Errata 69 - Multiprocessor Coherency Problem with Hardware
		      Prefetch Mechanism.

	- Errata 81 - Cache Coherency Problem with Hardware Prefetching
		      and Streaming Stores.

	- Errata 86 - DRAM Data Masking Feature Can Cause ECC Failures

	- Errata 94 - Sequential Prefetch Feature May Cause Incorrect
		      Processor Operation.

	- Errata 97  (checked that movnt* was no longer used anywhere)

	- Errata 101 (made sure the dram scrubber was off)

	- Errata 113 - Enhanced Write-Combing Feature Causes System
		       Hang.

	- Errata 135/136 (checked that no prefix was used for invlpg's)

	- Misc MOVS/CMPS/STOS/CLD/STD related REP issues (e.g. like 
	  errata 109).  Ran a kernel and userland which removed nearly
	  all such except for the few MOVS sequences gcc creates for
	  structural copies (which I couldn't readily figure out how
	  to remove).  No effect.

	And so on.

    * Some of these erratum imply that a 32<->64 bit mode transition is
      required.  Our cc1 and OS of course is running in pure 64 bit mode.
      I do not trust that a transition is, in fact, required for the erratum
      involved.

	- Errata 95 RET Instruction May Return To Incorrect EIP
	  (involves bugs for the Return Address Stack (RAS)).

	  Possible cause if we ignore the 32<->64 bit mode transition
	  requirement in the errata.

	- Errata 82 Certain Faults On Far Transfer Instructions In
	  64-Bit Mode Save Incorrect RIP.

	  Unlikely cause as one test case found that no exception,
	  fault, or interrupt occured for 8ms prior to the failure.
	  (unless a SYSCALL/SYSRET is responsible, which is also
	  possible but unlikely).

				RESOLUTION OF THE BUG

    * The bug still occurs if I place the MFENCE+NOP at the beginning
      of the function.

    * Placing the MFENCE+NOP at the end of the function causes the
      bug to stop occuring.

      The bug disappears completely in that case with testing over
      2 days.

    * Placing just a NOP at the end of the function causes the
      bug to stop occuring.  As of this writing this is still undergoing
      longer tests.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>