Random server crashes every few weeks (smp_invltlb: endless loop […] retrysmp_invltlb: ipi sent)

Stefan Unterweger 232.20711 at chiffre.aleturo.com
Mon Sep 12 14:51:05 PDT 2016


With a combination of tight RAM (to provoke swapping), a heavy Synth job
and an obscenely large file passing through a pipeline with almost a
dozen ‚xz‘ processes in it I sometimes manage to trigger the flash
reboots.  This time I caught something in the log, though:

| Sep 12 22:59:00 sumi kernel: pid 840971 (conftest) exit race handled
| Sep 12 22:59:51 sumi -- MARK --
| Sep 12 23:15:44 sumi syslogd: kernel boot file is /boot/kernel/kernel
| Sep 12 23:15:44 sumi kernel: DOUBLE FAULT - KERNEL STACK GUARD HIT!
| Sep 12 23:15:44 sumi kernel: 
| Sep 12 23:15:44 sumi kernel: Fatal double fault
| Sep 12 23:15:44 sumi kernel: rip = 0xffffffff8060220e
| Sep 12 23:15:44 sumi kernel: rsp = 0xffffffe0c7290000
| Sep 12 23:15:44 sumi kernel: rbp = 0xffffffe0c7290028
| Sep 12 23:15:44 sumi kernel: cpuid = 0; lapic->id = 00000000
| Sep 12 23:15:44 sumi kernel: Copyright (c) 2003-2016 The DragonFly Project.
| Sep 12 23:15:44 sumi kernel: Copyright (c) 1992-2003 The FreeBSD Project.
| Sep 12 23:15:44 sumi kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
| Sep 12 23:15:44 sumi kernel: The Regents of the University of California. All rights reserved.
| Sep 12 23:15:44 sumi kernel: DragonFly 4.7-DEVELOPMENT #0: Mon Sep 12 19:18:02 CEST 2016

The ‚DOUBLE FAULT‘… was logged right after the system was again
operational.

If I crank up the ‚xz‘-processes even more (each at about 650 MB), I
also get messages like these:
| Sep 12 23:38:38 sumi kernel: Warning: bio_page_alloc: memory exhausted during buffer cache page allocation from pkg-static

But they look rather harmless.  It feels as if only the allocation of
large chunks of memory during pressure triggers the reboot; once the
processes have what they want, the server swapd ans thrashes like a
madman, but keeps churning without any further incident.

I’ll try a much larger compilation tomorrow and’ll let the xz’s running
to see if they’ll go through.

Just for reference, I‘m running 50787d03cd0a0 right now.  If there‘s
anything to test, I can set it up much more quickly than before.

    Stefan



* Stefan Unterweger on Mon, Sep 12, 2016 at 10:33:47PM +0200:
> Hi!
> 
> I haven‘t seen your post until now.
> 
> I have finally managed to set up a test machine and am currently
> throwing as many heavy jobs at it that I can think of.
> 
> In general it feels more stable, but I still get crashes.  The crashes
> are different though.  In the last run, the machine just rebooted out of
> the blue, with no salvageable kind of trace or anything (it crashed
> again during boot, the dmesg output was lost after a hard reset).  I
> have restarted the machine and am running it again, hopefully I can
> catch something this time.


-- 
▪ Die Internetbleibe.  Schick, magisch, leistungsstark.  https://internetbleibe.de/
▪ medoly media UG (haftungsbeschränkt) | Hausburgstraße 13, 10249 Berlin
▪ info at medolymedia.de | https://medolymedia.de/ | Tel. 030 609 826‒560 | Fax …‒569
▪ Geschäftsführer: Matthias Nothhaft | HRB 131198 (Amtsgericht Berlin-Charlottenburg), Sitz: Berlin, USt-ID: DE275221203



More information about the Users mailing list