git: kernel - Add workaround support for a probable AMD cpu bug related to cc1

Sun Dec 25 14:03:44 PST 2011

commit 8e32ecc0a77082f1e232a3e6d12e2f163f9667a4
Author: Matthew Dillon <dillon at apollo.backplane.com>
Date:   Sun Dec 25 13:47:39 2011 -0800

    kernel - Add workaround support for a probable AMD cpu bug related to cc1
    
    * Add supporting inlines and a #define.  See the followup commit to
      the gcc-4.4 code in the DFly codebase.
    
    * This bit of code is used to add a single NOP instruction just prior to
      the pop/ret sequence in cc1's fill_sons_in_loop() which works around
      what we believe to be a very difficult to reproduce AMD cpu bug.  The
      bug appears to be present on contemporary AMD cpus and was replicated
      on a Phenom(tm) II X4 820 Processor (Origin = "AuthenticAMD"  Id = 0x100f42
      Stepping = 2) and on an opteron 12-core cpu AMD Opteron(tm) Processor 6168
      (Origin = "AuthenticAMD"  Id = 0x100f91  Stepping = 1).
    
    * The bug is extremely sensitive to %rip and %rsp values as well as
      stack memory use patterns and appears to cause either the %rip or the
      %rsp to become corrupt during the multi-register-pop/ret sequence at
      the end of fill_sons_in_loop() in the GCC 4.4.7 codebase.  This
      procedure is called as part of a deep tree recursion which exercises both
      the AMD RAS (Return Address Stack) hardware circuitry and probably also
      the write combining circuitry.
    
    * I have so far only been able to reproduce the bug on DragonFly but have
      to the best of my ability eliminated the OS as a possible source of the
      problem over the last few months.  I am currently attempting to reproduce
      the bug running FreeBSD on the same hardware but it's virtually impossible
      to replicate the exact environment without adding DragonFly binary emulation
      to FreeBSD (which I just might have to do to truly verify that the bug is
      not a DragonFly OS bug).
    
    * Bug reproducability: DragonFly utilizes a 0-1023 (~16 byte aligned)
      random stack gap.  Under normal buildworld -j 25 or similar conditions
      it can take anywhere up to 2 days to cause a failure.  Using a fixed
      stack gap of 904 (sysctl kern.stackgap_random=-904) on a particular cc1
      line during the compilation of gcc-4.4 using gcc-4.4, compiling gcc/mcf.c,
      with a carefully constructed environment and command path (to replicate
      a precise starting stack %rsp of  for main() of 0x7fffffffe818), I was
      able to replicate the bug in around a 60-second time frame with
      approximately one out of every 16 compiles hitting the the bug and failing.
    
    * Changing the stackgap and/or modifying the code in any way (e.g. causing a
      shift in the %rpc values) changes the characteristics of the bug, sometimes
      causing it to stop appearing entirely.
    
      It was found that an adjustment of the stackgap in 32768 byte increments
      starting at the gap known to fail also reproduces the bug with the same
      consistency as the original stackgap value.
    
    * Only the fill_sons_in_loop() function in cc1 in a few particular cases
      appears to be able to trigger the bug, across all the compiles we've
      done over a year.

Summary of changes:
 sys/cpu/i386/include/cpufunc.h   |   32 ++++++++++++++++++++++++++++++++
 sys/cpu/x86_64/include/cpufunc.h |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 0 deletions(-)

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/8e32ecc0a77082f1e232a3e6d12e2f163f9667a4


-- 
DragonFly BSD source repository