checkpoint/restart

Wed Oct 8 21:47:01 PDT 2003

>     sigacts has the ps_sigact[] array which I believe contains
>     128 signals by default.  Since only 32 some odd signals are defined,

I clearly didn't look closely enough at it.

>     there are plenty of signals available for your checkpoint and restore
>     vectors (though I still think there is only a need for a checkpoint
>     signal, since the restore can simply resume the signal handler from

If there is extra space that would clearly be cleaner than having my own
upcall function. I can then use the underlying kernel machinery to do the
work. That will also remove the need to make any underlying changes to the
kernel. This may also answer my question about the upcall. I may just be
able to queue up a checkpoint signal.

>     the point where it called freeze()).  There are other little cleanups
>     you can do, but I'm sure you are already aware of them (e.g. converting
>     those macros like READ_CHECK into inlines and removing
>     non-argument-passed dependancies).

Oh yes, there is a fair amount of cleaning up to be done. The first
weekend when I did the majority of the work I was just trying to convince
myself that I could get it to work.

>
>     We need to add a stty signal similar to how ^\ core's a program, to
>     checkpoint a program.  That would be so cool.

That would be very easy.

>
>     Am I correct in that currently the program binary that was checkpointed
>     must do the restore?  I tried writing a little program that restore'd

Yes. However, I have two more (untested) system calls in there. One
allows you to checkpoint an arbitrary pid (yes this has security issues)
and the other allows to restore a process from another program.

So one can write a simple program ckpt and do the following:
ckpt -f 1345 foobar
ckpt -t myprogram foobar

>     the next step is to save and restore filesystem-based file descriptors
>     so as to reacquire the original text image.  That's a fairly tough nut
>     to crack, but the result will be incredibly powerful.

You're approach is much cleaner from the user's perspective in that it
doesn't require the path to to the original binary. But mine was easier to
implement as intermediate step. Oops, looking in /proc it appears that one
can trivially get the path to the binary:

> cd /proc/102/
> ls -l file
lr-xr-xr-x  1 root  wheel  15 Oct  8 21:37 file -> /sbin/adjkerntz
>

I'll see where this information is coming from and append it as a note
after the program headers. It is a little harder to do this way because
there is nothing in the elf header telling you how big the initial core
file is. So I have to read the elf-header and then read through all the
Elf_Phdrs before doing any real work.

We're actually extremely close to having what you're talking about.
It is largely a matter of cleanup.

Thanks for your enthusiasm. I need it, as mine has waned as other things
of interest (Xen) crop up. I will however polish it up over the next few
days to the point where it can be checked in and used by others.

			-Kip