Re: Random server crashes every few weeks (smp_invltlb: endless loop […] retrysmp_invltlb: ipi sent)

Thu May 26 11:00:18 PDT 2016

It's really hard to say from something which is virtually hosted.  It kinda
sounds like the virtual host isn't assigning enough of its own cpus to the
virtual host.  The fact that DragonFly is complaining about smp_invltlb()
implies that the host's virtualized cpu threads are not getting scheduled
properly.

One thing to note is that we do not do any instruction escapes to hint to
virtual hosts when a cpu is in a tight loop waiting for synchronization.
It would be nice if we had some support for that, it would probably make
DFly play better on virtualized systems.

I suggest setting the number of cores to 1.  That will get rid of all SMP
interplay and hopefully remove the issues the virtual host is choking on.

-Matt

On Thu, May 26, 2016 at 5:43 AM, Stefan Unterweger <
232.20711 at chiffre.aleturo.com> wrote:

> Hello,
>
> I have tried debugging this on my own, but I am out of ideas now.  I
> have a server running Dragonfly BSD, not yet really in production
> because it crashes semi-randomly every few weeks.  The first few crashes
> were just freezes: The machine would become utterly unresponsive and
> would need a hard reboot.
>
> Since a few crashes back, kernel traces have started to appear in syslog
> moments before the server freezes.  I never get a downright kernel panic
> on the console, it just hangs until put out of its misery and rebooted.
> I have included the last three traces below, as I have started to record
> them to do some investigations myself.
>
> It -appears- to crash when under heavy I/O load when paging is involved,
> but this is not always the case (frozen ’top‘ from moments before the
> crash sometimes have a huge swath of pageout, sometimes nil).  The
> crashes are not reproduceable, which makes it even harder to debug
> (also, even though the server is not yet fully in production, it already
> hosts a variety of stuff where I cannot afford large downtimes just to
> poke around).
>
> It is not a real machine, but hosted in a virtual datacenter hosted by
> ProfitBricks, a german company—perhaps somebody has heard of them.  From
> what I know, they are running some sort of KVM for their virtualisation.
>
> Increasing the number of cores seemed to help insofar as I now quite
> reliably get a kernel trace in syslog right before the machine dies.
> Perhaps the one core senses the problem, cries out in panic and is then
> brought down as the other one pulls it down into its own endless loop,
> so that they may perish together.
>
>
>
> I am trying to set up another ‘real’ machine as a pseudo-server locally
> in my office to have something to fiddle with, but so far I have not
> managed to get anywhere near a crash, much less a similar one.  So it
> may well be that it’s the virtualisation that is to blame—the evidence
> is inconclusive, I cannot rule it out.
>
> My -guess- is—and the lines starting with ‘endless loop…’ in the loop
> seem to support it—is some sort of race condition that is triggered and
> sends the server into an endless loop.  The syslog lines starting with
> ‘endless loop’ support this at least superficially.
>
> Interestingly, the traces look wildly different.  At first I thought I
> was running into some HAMMER bugs or whatnot—Dragonfly is clearly marked
> as ‘experimental’, so it’s my own fault if the floor is full with my own
> blood from running bleeding-edge software. :o)
>
> But from the traces, it rather looks like HAMMER is just an innocent
> bystander.  One of the traces was in some HAMMER function, another in
> the middle of paging, the last one in who knows what.  So, basically,
> the server was busy minding his own business and crashed at random, with
> the trace just recording that the server was doing I/O and light
> paging—i.e., was just being a -server-.
>
>
>
> As promised, here are the traces of the last three incidends, with the
> most current one on top.  All of them are from /var/log/messages; I have
> redacted the timestamps and such to help fit it on one screen-width.
> The console shows pretty much the same thing, but -no- panic and just
> hangs in all three cases.
>
>
> This was from an hour ago:
> | Trace beginning at frame 0xffffffe07f458440
> | smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5
> | smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5
> | pmap_qenter() at pmap_qenter+0x6d 0xffffffff80a129da
> | allocbuf() at allocbuf+0x5eb 0xffffffff8065a4cb
> | getblk() at getblk+0x484 0xffffffff8065d641
> | hammer_io_new() at hammer_io_new+0x34 0xffffffff8080ef56
> | hammer_load_buffer() at hammer_load_buffer+0x72 0xffffffff8081ab92
> | hammer_get_buffer() at hammer_get_buffer+0x4b1 0xffffffff8081b114
> | hammer_bnew() at hammer_bnew+0xa6 0xffffffff8081bde3
> | hammer_generate_undo() at hammer_generate_undo+0x114 0xffffffff808242ee
> | hammer_modify_buffer() at hammer_modify_buffer+0xb4 0xffffffff8080f3f3
> | hammer_btree_do_propagation() at hammer_btree_do_propagation+0x17b
> 0xffffffff807feaae
> | hammer_ip_sync_record_cursor() at hammer_ip_sync_record_cursor+0x5f7
> 0xffffffff80816ed8
> | hammer_sync_record_callback() at hammer_sync_record_callback+0x24d
> 0xffffffff80808995
> | hammer_rec_rb_tree_RB_SCAN() at hammer_rec_rb_tree_RB_SCAN+0xfa
> 0xffffffff80813cc0
> | hammer_sync_inode() at hammer_sync_inode+0x27e 0xffffffff8080b0ad
> | hammer_flusher_flush_inode() at hammer_flusher_flush_inode+0x55
> 0xffffffff808074bd
> | hammer_fls_rb_tree_RB_SCAN() at hammer_fls_rb_tree_RB_SCAN+0xfc
> 0xffffffff8080662f
> | hammer_flusher_slave_thread() at hammer_flusher_slave_thread+0x7a
> 0xffffffff80806763
> | smp_invltlb: endless loop 00000000 00000002, rflags 0000000000000282
> retrysmp_invltlb: ipi sent
>
> A few weeks back.  The server mysteriously did -not- crash on this one,
> it merely froze for half a minute or so and recovered:
> | Trace beginning at frame 0xffffffe05c3136f0
> | smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5
> | smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5
> | pmap_qenter() at pmap_qenter+0x6d 0xffffffff80a129da
> | swap_pager_putpages() at swap_pager_putpages+0x453 0xffffffff8083836a
> | vm_pageout_flush() at vm_pageout_flush+0x120 0xffffffff80851cc2
> | vm_pageout_thread() at vm_pageout_thread+0x150a 0xffffffff80853379
> | smp_invltlb: endless loop 00000000 00000002, rflags 0000000000000286
> retrysmp_invltlb: ipi sent
>
> The oldes one from about a month ago:
> | Trace beginning at frame 0xffffffe0b43c74b8
> | smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5
> | smp_invltlb() at smp_invltlb+0x229 0xffffffff80a19cd5
> | pmap_qremove() at pmap_qremove+0x53 0xffffffff80a12a2f
> | vfs_vmio_release() at vfs_vmio_release+0x142 0xffffffff80657f5b
> | getnewbuf() at getnewbuf+0x57d 0xffffffff8065cbd9
> | getblk() at getblk+0x3b1 0xffffffff8065d56e
> | cluster_readx() at cluster_readx+0x34b 0xffffffff806665fa
> | hammer_vop_read() at hammer_vop_read+0x1cf 0xffffffff8082b0fe
> | vop_read() at vop_read+0x85 0xffffffff80682a3f
> | vn_read() at vn_read+0x1d0 0xffffffff80681c10
> | kern_preadv() at kern_preadv+0x171 0xffffffff80630abf
> | sys_read() at sys_read+0x66 0xffffffff80630c11
> | syscall2() at syscall2+0x412 0xffffffff809e9e8c
> | Xfast_syscall() at Xfast_syscall+0xcb 0xffffffff809d28db
> | smp_invltlb: endless loop 00000000 00000002, rflags 0000000000000292
> retrysmp_invltlb: ipi sent
>
> The only thing in common is the last line, with the exception of the
> ‘rflags’ that have a slightly different value.  I can’t make sense out
> of it—perhaps someone more experienced can help me shed some light on
> what is happening here?
>
>
> --
> If you can do it then why do it?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.dragonflybsd.org/pipermail/users/attachments/20160526/135a5a97/attachment-0002.html>