Capsicum API design

Joris Giovannangeli joris at giovannangeli.fr
Tue Sep 24 13:33:10 PDT 2013


Hi,

This is the current design of the new capsicum API for DragonFly. It's
as much compatible as possible with FreeBSD, and I tried to address
comments and suggestions from Matt and Alex.

                                USER API

In userspace, a capability right is just a signed integer. We
distinguish two types of  rights :

 1. predefined rights, which are statically created by the kernel. It's
    the CAP_* rights defined in sys/capability.h in FreeBSD.
 2. composed rights dynamically created by composing the predefined
    ones.

Representation of rights stays in kernel, and are opaque in userland.
The type cap_rights_t from freebsd is just an integer descriptor used to
store these descriptors.

FreeBSD provides the following functions, which are implemented like
that :

 * cap_rights_t *cap_rights_init(cap_rights_t *rights, ...) : this
   function is a syscall (wrapper) which will create a new capability
   right in the kernel, and fill it with the list of rights provided.
   Rights are the integer descriptors described earlier. The syscall
   returns a descriptor to the new right.

 * void cap_rights_destroy(cap_rights_t *rights) : FreeBSD has no such
   function, but we need one to destroy the capability right in the
   kernel (or at least drop its refcount).

 * cap_rights_get/cap_rights_limit : same semantic than freeBSD, get a
   descriptor to the capability rights attached to the given
   filedescriptor, or limit the rights of a filedescriptor.

 * cap_rights_merge, cap_rights_contain, cap_rights_is_set,
   cap_rights_clear : all these functions are syscalls, with same
   semantic than freeBSD ones.

 * cap_rights_is_valid : we don't really need that, but it might check
   if the descriptor points to a rights or if it's not valid.

Variadic functions are libc wrapper to syscalls, the syscalls have the
following prototype : for instance, cap_rights_init(int *rights, size_t
count)

Rationale :
  * Pros : Using opaque API let the rights representation evolve without
    having to bother breaking the ABI.

  * Cons : cost overhead, a lot of functions become syscalls in this
    implementation. However, it's not believed to be in a critical path.
    There is also a complexity cost to implement the descriptors.

Open questions :
 * Using global or per-process descriptors for rights ? Per process
   descriptors could reduce contention.

                                 KERNEL

In the kernel, the base representation of a rights is a bitfield.

typedef cap_kern_right_t uint64_t[CAP_RIGHTS_SIZE]


These rights are stored in a hashtable (or RB-tree, not really decided
yet).  When a new rights is created, for instance with cap_rights_merge,
the rights is lookup'ed in the hashtable or RB-tree. If the rights
exists, a new id is alloced, and points to the existing rights. If it
doesn't exist, it is built and added to the hashtable. The new id is the
alloced and points to the rights.

Filedescriptors contains pointers to the rights they have.

Rationale :
 * Pros : it will scale well if CAP_RIGHTS_SIZE grows. It maps well with
   the idea of keeping rights bitfields in the kernel

 * Cons : is it really needed ? Currently, dragonfly uses 58 atomic
   rights, and freeBSD has room for only 4 elements in the array. A
   pointer is needed which cost the same space than one entry, and it
   needs one more memory access, and a refcount.

cap_rights_merge can be implement by iterating on bitfield array entries
and doing | for each entry. cap_rights_contains is the same function
using & and equality test.

Open questions :
 * The obvious design here is to copy on write. But it might be worth
   optimizing a bit. For instance, if the application does the following
   calls

   cap_rights_t rights;
   /*
    * first sycall, creates a rights bitfield array if it does not exist
    */
   rights = cap_rights_init(rights, CAP_FOO, CAP_BAR, CAP_FOO2);
   if (cond1)
        /* Another syscall, copy the bitfield */
        rights = cap_rights_merge(rights, someotherights);
    if (cond1)
        /* Another syscall, copy the bitfield */
        rights = cap_rights_merge(rights, someotherights);
    /* Another syscall, limit the rights of a fd */
    cap_rights_limit(fd, rights);

the intermediate copys are not used and are privates. Adding a flag
to tell that the copy is exclusive, and then modifiable could be a good
idea. But it adds even more complexity, and we might end up with a quite
large structure to store two longs...

cap_rights_limit takes a file descriptor, a rights descriptor, and do
the following operations :

  1. lookup the bitfield array associated with the capability descriptor
  2. calls cap_rights_contains to check that the new rights is a subset
     of the old one
  3. update the rights of the file descriptor

holdfp is the function which gets the file pointer referenced by a file
descriptor. It's where capability rights are check during any syscall
dealing with a file descriptor at some point (that means : a lot of
them). holdfp must check that the rights it needs are a subset of the
rights attached to the filedescriptor.

The first idea is to generate all the rights needed by all syscalls,
initiliazed them at boot, and then use pointer to a bitfield array
representing the needed rights when calling holdfp. To check rights
inclusion, we need to iterate the whole bitfield (the contains
function). If the bitfield array is short, like it's now (1 element for
now), it's ok. If it becomes larger, it can be quite inefficient.

A possible optimization is to regroup atomic rights used on a same
syscall in a unique bitfield entry, and then only check for this entry.
holdfp needs two arguments, the index of the entry in the bitfield
array, and the bitmask for this entry.

Open question :
    How to implement capabilities descriptors ? With idr ?

                                    SUMMARY

This is a hell :). This is maybe over-engineered, there is a clear
trade-off here : future proofness, and flexibility versus performance.

regards,
Joris



More information about the Kernel mailing list