Boot hangs starting postfix
    Matthew Dillon 
    dillon at apollo.backplane.com
       
    Mon Jun 14 13:59:04 PDT 2004
    
    
  
    Ah shoot, too bad.  Definitely save the postfix state (just tar it up)
    the next time it happens.
    However, we have gleaned a lot of information about this issue from your
    reports.
    We know it's a livelock issue rather then a deadlock issue, because
    otherwise the addition of the tsleep would not have allowed you to
    ssh in or kill or otherwise signal the postfix processes.
    We know it's a livelock issue because the scheduler is not getting a
    chance to deschedule the postfix processes that are bouncing between
    each other, which likely means that the livelock is occuring in the 
    kernel and neither process is returning to usermode.
    We know it's not stuck in a critical section because interrupts still
    work.
    The output lines you were getting continuously provided a surprise...
    I expected the process to be passed as 'owner' but it looks like 
    &proc0 is passed, and the flags indicate F_WAIT|F_FLOCK, so we know
    the issue is occuring with flock() rather then with POSIX locks, which
    really narrows down the code cases.
    So we know a lot now even though we haven't found the smoking gun yet.
					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>
:[joseph, if you happen to have corrupted messages in the queue, please don't
:remove it before giving other people a clue to fix this lock up.]
:
:Yes, this at least keeps ssh alive, and the following messages repeated
:until I removed corrupted messages (files in /var/spool/postfix/corrupt/)
:
:Jun 14 10:46:55 fred /kernel: lf_setlock: 0xcf8df6d4 pid 0 type 3 flags 00000030
: [00000000,7fffffffffffffff]
:Jun 14 10:46:55 fred /kernel: lf_setlock: 0xcf8df7f4 pid 0 type 3 flags 00000030
: [00000000,7fffffffffffffff]
:
:I was so stupid that I didn't keep the corrupted messages, and
:now older kernel(without your patch) doesn't lock up anymore!
:Just creating pair of empty files in the corrupt/ directory doesn't
:reproduce it.
    
    
More information about the Bugs
mailing list