git: kernel - Fix excessive ipiq recursion (3)

Matthew Dillon dillon at crater.dragonflybsd.org
Fri Jul 22 18:48:14 PDT 2016


commit e47e3dba268d93966c871dfaa6cbbca79038d5bb
Author: Matthew Dillon <dillon at apollo.backplane.com>
Date:   Fri Jul 22 18:22:17 2016 -0700

    kernel - Fix excessive ipiq recursion (3)
    
    * Third try.  I'm not quite sure why we are still getting hard locks.  These
      changes (so far) appear to fix the problem, but I don't know why.  It
      is quite possible that the problem is still not fixed.
    
    * Setting target->gd_npoll will prevent *all* other cpus from sending an
      IPI to that target.  This should have been ok because we were in a
      critical section and about to send the IPI to the target ourselves, after
      setting gd_npoll.  The critical section does not prevent Xinvltlb, Xsniff,
      Xspuriousint, or Xcpustop from running, but of these only Xinvltlb does
      anything significant and it should theoretically run at a higher level
      on all cpus than Xipiq (and thus complete without causing a deadlock of
      any sort).
    
      So in short, it should have been ok to allow something like an Xinvltlb
      to interrupt the cpu inbetween setting target->gd_npoll and actually
      sending the Xipiq to the target.  But apparently it is not ok.
    
    * Only clear mycpu->gd_npoll when we either (1) EOI and take the IPIQ
      interrupt or (2) If the IPIQ is made pending via reqflags, when we clear
      the flag.  Previously we were clearing gd_npoll in the IPI processing
      loop itself, potentially racing new incoming interrupts before they get
      EOId by our cpu.  This also should have been just fine, because interrupts
      are enabled in the processing loop so nothing should have been able to
      back-up in the LAPIC.
    
      I can conjecture that possibly there was a race when we cleared gd_npoll
      multiple times, potentially clearing it the second (or later) times,
      allowing multiple incoming IPIs to be queued from multiple cpu sources but
      then cli'ing and entering a e.g. Xinvltlb processing loop before our cpu
      could acknowledge any of them.  And then, possibly, trying to issue an IPI
      with the system in this state.
    
      I don't really see how this can cause a hard lock because I did not observe
      any loop/counter error messages on the console which should have been
      triggered if other cpus got stuck trying to issue IPIs.  But LAPIC IPI
      interactions are not well documented so... perhaps they were being issued
      but blocked our local LAPIC from accepting a Xinvltlb due to having one
      extra unacknowledged Xipiq pending?  But then, our Xinvltlb processing loop
      *does* enable interrupts for the duration, so it should have drained if
      this were so.
    
      In anycase, we no longer gratuitously clear gd_npoll in the processing
      loop.  We only clear it when we know there isn't one in-flight heading to
      our cpu and none queued on our cpu.  What will happen now is that a second
      IPI can be sent to us once we've EOI'd the first one, and wind up in
      reqflags, but will not be acted upon until our current processing loop
      returns.
    
      I will note that the gratuitous clearing we did before *could* have allowed
      substantially all other cpus to try to Xipiq us at nearly the same time,
      so perhaps the deadlock was related to that type of situation.
    
    * When queueing an ipiq command from mycpu to a target, interrupts were
      enabled between our entry into the ipiq fifo, the setting of our cpu bit
      in the target gd_ipimask, the setting of target->gd_npoll, and our
      issuing of the actual IPI to the target.  We now disable interrupts across
      these four steps.
    
      It should have been ok for interrupts to have been left enabled across
      these four steps.  It might still be, but I am not taking any chances now.

Summary of changes:
 sys/kern/lwkt_ipiq.c                       | 120 ++++++-----------------------
 sys/platform/pc64/apic/apic_vector.s       |   2 +
 sys/platform/pc64/x86_64/genassym.c        |   1 +
 sys/platform/pc64/x86_64/global.s          |   2 +
 sys/platform/pc64/x86_64/ipl.s             |   4 +
 sys/platform/pc64/x86_64/mp_machdep.c      |  32 +++-----
 sys/platform/vkernel64/platform/machintr.c |   1 +
 sys/platform/vkernel64/x86_64/exception.c  |   1 +
 sys/sys/thread.h                           |   3 -
 sys/sys/thread2.h                          |  13 ----
 10 files changed, 44 insertions(+), 135 deletions(-)

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/e47e3dba268d93966c871dfaa6cbbca79038d5bb


-- 
DragonFly BSD source repository



More information about the Commits mailing list