can't gdb vkernel

Sun Jul 13 03:37:57 PDT 2008

2008/7/12 Nicolas Thery <nthery at gmail.com>:
> There is indeed a deadlock:
>
> - The initial vkernel thread is sleeping on the user mutex associated with the
>  vkd cothread.
>
> - The vkd cothread sends a SIGIO (lwp_kill(2)) to the initial thread to
>  simulate an interrupt.  The initial thread's sleep is interrupted and it is
>  made runnable.
>
> - When the cothread is about to return to userland from lwp_kill(2), it is
>  preempted in userexit() and the initial thread runs.
>
> - The initial thread handles the signal (issignal() called from tsleep()).  As
>  the process is being debugged, proc_stop() is called, the process moves to
>  SSTOP  and the initial thread is stopped (tstop()).
>
> - The cothread is then awakened and goes back to userland (that's a bug, it
>  should stop too).
>
> - The cothread eventually waits on its condition variable.
>
> - Meanwhile GDB blocks on wait(2) forever because only one lwp out of two is
>  stopped (p_nstopped < p_nthreads).
>
> The kernel tests if the lwp should be stopped in userret():
>
>        if (p->p_stat == SSTOP) {
>                get_mplock();
>                tstop();
>                rel_mplock();
>                goto recheck;
>        }
>
> However, userret() is called *before* userexit() and the cothread is not
> stopped.
>
> To confirm this hypothesis, I added the above code (with the if turned into a
> while) in userexit() after the preemption points and the vkernel booted fine.
> I observed another hang during shutdown though.  I'm not sure this is a correct
> fix.  I'll study this into more details tomorrow.
>
> Another mystery remains: what change caused this regression?  Some cvs annotate
> on various files didn't point to the culprit.
>

I committed a fix.

It adds yet another location where the kernel tstop() lwps.  Some
factoring may be possible
but that will have to wait post 2.0.