Random server crashes every few weeks (smp_invltlb: endless loop […] retrysmp_invltlb: ipi sent)

Thu May 26 05:43:34 PDT 2016

Hello,

I have tried debugging this on my own, but I am out of ideas now.  I
have a server running Dragonfly BSD, not yet really in production
because it crashes semi-randomly every few weeks.  The first few crashes
were just freezes: The machine would become utterly unresponsive and
would need a hard reboot.

Since a few crashes back, kernel traces have started to appear in syslog
moments before the server freezes.  I never get a downright kernel panic
on the console, it just hangs until put out of its misery and rebooted.
I have included the last three traces below, as I have started to record
them to do some investigations myself.

It -appears- to crash when under heavy I/O load when paging is involved,
but this is not always the case (frozen ’top‘ from moments before the
crash sometimes have a huge swath of pageout, sometimes nil).  The
crashes are not reproduceable, which makes it even harder to debug
(also, even though the server is not yet fully in production, it already
hosts a variety of stuff where I cannot afford large downtimes just to
poke around).

It is not a real machine, but hosted in a virtual datacenter hosted by
ProfitBricks, a german company—perhaps somebody has heard of them.  From
what I know, they are running some sort of KVM for their virtualisation.

Increasing the number of cores seemed to help insofar as I now quite
reliably get a kernel trace in syslog right before the machine dies.
Perhaps the one core senses the problem, cries out in panic and is then
brought down as the other one pulls it down into its own endless loop,
so that they may perish together.

I am trying to set up another ‘real’ machine as a pseudo-server locally
in my office to have something to fiddle with, but so far I have not
managed to get anywhere near a crash, much less a similar one.  So it
may well be that it’s the virtualisation that is to blame—the evidence
is inconclusive, I cannot rule it out.

My -guess- is—and the lines starting with ‘endless loop…’ in the loop
seem to support it—is some sort of race condition that is triggered and
sends the server into an endless loop.  The syslog lines starting with
‘endless loop’ support this at least superficially.

Interestingly, the traces look wildly different.  At first I thought I
was running into some HAMMER bugs or whatnot—Dragonfly is clearly marked
as ‘experimental’, so it’s my own fault if the floor is full with my own
blood from running bleeding-edge software. :o)

But from the traces, it rather looks like HAMMER is just an innocent
bystander.  One of the traces was in some HAMMER function, another in
the middle of paging, the last one in who knows what.  So, basically,
the server was busy minding his own business and crashed at random, with
the trace just recording that the server was doing I/O and light
paging—i.e., was just being a -server-.

As promised, here are the traces of the last three incidends, with the
most current one on top.  All of them are from /var/log/messages; I have
redacted the timestamps and such to help fit it on one screen-width.
The console shows pretty much the same thing, but -no- panic and just
hangs in all three cases.

This was from an hour ago:
| Trace beginning at frame 0xffffffe07f458440
| smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5 
| smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5 
| pmap_qenter() at pmap_qenter+0x6d 0xffffffff80a129da 
| allocbuf() at allocbuf+0x5eb 0xffffffff8065a4cb 
| getblk() at getblk+0x484 0xffffffff8065d641 
| hammer_io_new() at hammer_io_new+0x34 0xffffffff8080ef56 
| hammer_load_buffer() at hammer_load_buffer+0x72 0xffffffff8081ab92 
| hammer_get_buffer() at hammer_get_buffer+0x4b1 0xffffffff8081b114 
| hammer_bnew() at hammer_bnew+0xa6 0xffffffff8081bde3 
| hammer_generate_undo() at hammer_generate_undo+0x114 0xffffffff808242ee 
| hammer_modify_buffer() at hammer_modify_buffer+0xb4 0xffffffff8080f3f3 
| hammer_btree_do_propagation() at hammer_btree_do_propagation+0x17b 0xffffffff807feaae 
| hammer_ip_sync_record_cursor() at hammer_ip_sync_record_cursor+0x5f7 0xffffffff80816ed8 
| hammer_sync_record_callback() at hammer_sync_record_callback+0x24d 0xffffffff80808995 
| hammer_rec_rb_tree_RB_SCAN() at hammer_rec_rb_tree_RB_SCAN+0xfa 0xffffffff80813cc0 
| hammer_sync_inode() at hammer_sync_inode+0x27e 0xffffffff8080b0ad 
| hammer_flusher_flush_inode() at hammer_flusher_flush_inode+0x55 0xffffffff808074bd 
| hammer_fls_rb_tree_RB_SCAN() at hammer_fls_rb_tree_RB_SCAN+0xfc 0xffffffff8080662f 
| hammer_flusher_slave_thread() at hammer_flusher_slave_thread+0x7a 0xffffffff80806763 
| smp_invltlb: endless loop 00000000 00000002, rflags 0000000000000282 retrysmp_invltlb: ipi sent

A few weeks back.  The server mysteriously did -not- crash on this one,
it merely froze for half a minute or so and recovered:
| Trace beginning at frame 0xffffffe05c3136f0
| smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5 
| smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5 
| pmap_qenter() at pmap_qenter+0x6d 0xffffffff80a129da 
| swap_pager_putpages() at swap_pager_putpages+0x453 0xffffffff8083836a 
| vm_pageout_flush() at vm_pageout_flush+0x120 0xffffffff80851cc2 
| vm_pageout_thread() at vm_pageout_thread+0x150a 0xffffffff80853379 
| smp_invltlb: endless loop 00000000 00000002, rflags 0000000000000286 retrysmp_invltlb: ipi sent

The oldes one from about a month ago:
| Trace beginning at frame 0xffffffe0b43c74b8
| smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5 
| smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5 
| pmap_qremove() at pmap_qremove+0x53 0xffffffff80a12a2f 
| vfs_vmio_release() at vfs_vmio_release+0x142 0xffffffff80657f5b 
| getnewbuf() at getnewbuf+0x57d 0xffffffff8065cbd9 
| getblk() at getblk+0x3b1 0xffffffff8065d56e 
| cluster_readx() at cluster_readx+0x34b 0xffffffff806665fa 
| hammer_vop_read() at hammer_vop_read+0x1cf 0xffffffff8082b0fe 
| vop_read() at vop_read+0x85 0xffffffff80682a3f 
| vn_read() at vn_read+0x1d0 0xffffffff80681c10 
| kern_preadv() at kern_preadv+0x171 0xffffffff80630abf 
| sys_read() at sys_read+0x66 0xffffffff80630c11 
| syscall2() at syscall2+0x412 0xffffffff809e9e8c 
| Xfast_syscall() at Xfast_syscall+0xcb 0xffffffff809d28db 
| smp_invltlb: endless loop 00000000 00000002, rflags 0000000000000292 retrysmp_invltlb: ipi sent

The only thing in common is the last line, with the exception of the
‘rflags’ that have a slightly different value.  I can’t make sense out
of it—perhaps someone more experienced can help me shed some light on
what is happening here?

-- 
If you can do it then why do it?