can't gdb vkernel

Fri Jul 11 15:37:11 PDT 2008

2008/7/11 Simon 'corecode' Schubert <corecode at fs.ei.tum.de>:
> Nicolas Thery wrote:
>>
>> I'm looking into this.  There is a deadlock involving the gdb lwp and
>> 2 vkernel lwps. I hope to have a clearer understanding and a fix this
>> week-end.
>
> Great!  When you're saying you are looking into this, I don't have a doubt
> that you will find the cause :)

Thanks for your trust ;-)

There is indeed a deadlock:

- The initial vkernel thread is sleeping on the user mutex associated with the
  vkd cothread.

- The vkd cothread sends a SIGIO (lwp_kill(2)) to the initial thread to
  simulate an interrupt.  The initial thread's sleep is interrupted and it is
  made runnable.

- When the cothread is about to return to userland from lwp_kill(2), it is
  preempted in userexit() and the initial thread runs.

- The initial thread handles the signal (issignal() called from tsleep()).  As
  the process is being debugged, proc_stop() is called, the process moves to
  SSTOP  and the initial thread is stopped (tstop()).

- The cothread is then awakened and goes back to userland (that's a bug, it
  should stop too).

- The cothread eventually waits on its condition variable.

- Meanwhile GDB blocks on wait(2) forever because only one lwp out of two is
  stopped (p_nstopped < p_nthreads).

The kernel tests if the lwp should be stopped in userret():

	if (p->p_stat == SSTOP) {
		get_mplock();
		tstop();
		rel_mplock();
		goto recheck;
	}

However, userret() is called *before* userexit() and the cothread is not
stopped.

To confirm this hypothesis, I added the above code (with the if turned into a
while) in userexit() after the preemption points and the vkernel booted fine.
I observed another hang during shutdown though.  I'm not sure this is a correct
fix.  I'll study this into more details tomorrow.

Another mystery remains: what change caused this regression?  Some cvs annotate
on various files didn't point to the culprit.