The time has come for a kernel interfacing library layer

Sat May 7 10:00:24 PDT 2005

    Ok.  The time has come to implement the idea some of us have been 
    talking about on and off for the last year, and that is to implement a
    kernel-interfacing layer between libc and the system (rather then
    building the system call stubs into libc).

    This layer is going to take over the job of providing the system call
    API to the program.   It will also allow us to safely use more complex
    kernel interfaces, such as a shared-memory userland critical section
    for signal interlocks, shared memory access to things like the 'pid',
    and so forth, without breaking long-term forwards and backwards
    compatibility due to structural changes made in the system.

    This will give us the following features:

    * The ability to change system structures and system call effects without
      having to renumber system calls.

    * The ability to use shared memory between userland the kernel without
      breaking forwards and backwards compatibility.  i.e. only the layer
      itself would access the shared memory, program binaries would not.
      Again, an all-userland path.

    * The ability to implement 'system calls' that actually run entirely in
      userspace (the layer would JMP to code in the layer rather then JMP
      to code that calls int 0x80). 

    * The ability to implement (future) asynchronous-messaged system calls
      without userland being aware that the physical ABI into the kernel
      has changed.

    * And many other things.

    --

    How will this work?  The concept is simple:  Instead of implementing
    system calls directly, all userland programs instead implement a 
    special named-section containing system call stubs.  This will be a
    BSS section (not contain actual code).  The kernel loader (and ld-elf
    when it loads things) will automatically detect the existance of the
    section and automatically mmap() the actual syscall layer into the 
    BSS space, as well as mmap() anything else that it needs for system
    interfacing (any additional mmap()'d sections will be not be directly
    visible to userland.  Userland only sees the stub table).

    The kernel will select the layer file that it maps in based on the ABI
    version of the userland program verses the ABI version of the kernel.

    This theoretically means that we can make any old program work with any
    new kernel by building the correct layer, independant of both the
    original program's and the original kernel's compilation.  This also
    means that we can make any new program work with any old kernel through
    the same means.

    Joerg, to make this work I need two other things:

    * We need to have the kernel automatically setup the initial TLS
      space.

    * We need to reserve some fixed positive-offset space in the TLS
      to hold a pointer to errno and other things that the layer might
      need to manage.  Since the layer is simply going to be mmap()'d and
      not dynamically linked, it cannot use the standard TLS variable space
      itself.  The layer may need some meta-data (basically one generic
      pointer's worth) to e.g. store a pointer to the shared memory area
      or to other interfacing aspects, private to the layer.

      The biggest piece of the puzzle here is storing a pointer to errno
      at a fixed %gs:POSITIVE_OFFSET so the syscall layer can actually
      set errno.

    The benefits of this are huge.

					-Matt
					Matthew Dillon 
					<dillon at xxxxxxxxxxxxx>