New directions of the dragonfly capsicum implementation

Sun Oct 27 13:00:24 PDT 2013

Hi,

this is a follow up of my previous mail about the new design directions
of capsicum in dragonfly. After a long talk, Matt Dillon convinced me to
enlarge my thinking to handle more than the basic capsicum framework.

DragonFLy has no security framework yet in addition to the traditional
unix DAC model. I've no knowledge of previous work (yeah, I know, shame
on me !), and I tried to stick to capsicum core principle : check
permissions at name resolution. But name resolution is not only
performed for filedescriptors. An obvious place is VFS path lookup (in
dragonfly, path resolution uses another structure on top of vnodes), but
you can also imagine sysctl lookup, etc.

So today, I wrote a generic framework to attach capability to any kernel
object using a common api. Capability type is opaque for userspace
applications, and are only manipulated in kernel. Each process has a set
of capability descriptors, which point to a kernel capability structure.
There is a set of predefined capabilities, and they are referenced by
predefined descriptor value. For instance, CAP_READ will be associated
with descriptor 1, CAP_WRITE with descriptor 2, etc.

A set of sycalls let the user build new capabilities using existing ones :

cap_set(cap1, cap2) which builds a new cap as the union of the previous two.
cap_clear(cap1, cap2) which removes cap2 from cap1

and a syscall to test if a cap is granted by another :

cap_test(cap1, cap2)

cap_new() creates a new right in kernel and return a descriptor for this
right.
cap_close(cap) destroy a capability and close its descriptor.

To limit the number of syscalls performed, the previous syscalls take an
array of cap descriptor as second parameter, to perform multiple
operations at a time.

Capability are arranged in namespaces :
#define CAP_HASHTABLE_SIZE      1024

struct capability_namespace {
        cap_create_t            *cap_create;
        cap_destroy_t           *cap_destroy;
        cap_alloc_t             *cap_alloc;
        cap_hash_t              *cap_hash;
        cap_set_t               *cap_set;
        cap_test_t              *cap_test;
        cap_clear_t             *cap_clear;
        struct lwkt_token      cap_hashtable_tokens[CAP_HASHTABLE_SIZE];
        struct caplist          cap_hashtable[CAP_HASHTABLE_SIZE];
};

A namespace is a type of capability. For instance, capabilities attached
to filedescriptors, the ones used by capsicum, are in a common
namespace. Only capabilities from the same namespace can be merged, or
cleared or tested. Capabilities from a same namespace are shared in a
hashtable, to reduce memory footprint. The cost in locking will be quite
small I hope since the critical-path does not use the hashtable.
Capabilities are shared when being attached to a kernel object, like a
filedescriptor (cap_rights_limit in the freebsd API). At this point, an
immutable refcounted copy is made if the capability is not yet in the
hashtable, and attached to the object.

This is the type of the capabilities attached to filedescriptors :

typedef uin64_t cap_right_t;

struct filedesc_capability {
        /* per-descriptor allowed fcntls */
        int32_t                fc_fcntls;
        /* allowed ioctls */
        struct ioctls_list     *fc_ioctls;
        /* atomic rights of the file */
        cap_right_t            fc_rights;
};

Since the structure is opaque, I ended up refactoring the CAP_* bitfield
by splitting rights per object type. For instance, CAP_GETSOCKOPTS has
no meaning in a vnode filedescriptor, and hence it's not store for this
filedescriptor type. Some orthogonal rights share the same bit index.
This is not an issue if it gets a meaning in the future, since the
bitfield can be re-organized without breaking userspace compatibility.
That's why I'm able to stick with an uint64_t for now, and I hope forever.

Like I said earlier, capability check is performed at name lookup. In
holdfp (like fget), or in nlookup (like namei). Capability check can be
stacked. I decided that the first check is the one which prevails.
Hence, if a descriptor is granted with CAP_ALL and the namecache entry
it needs (for instance, the namecache entry referenced by the
filedescriptor for openat) has a more restricted right, the operation
will be allowed since the filedescriptor is the more authoritative name
reference. By default, filedescriptor are not created with CAP_ALL, but
with a NULL pointer, that means, no cap at all.

This way, non capsicum applications can still be sandboxed. In the long
term, I hope to be able to adapt jails to use this mecanism. For
instance, I could write a new empty FS, which can mount an arbitrary
directory hierarchy read-only. Then, null mount could be made to any
directory of this hierarchy, and a capability attached to the root
namecache entry of each mount. The capability works the same as
capability attached to directory descriptors, it delegate a right to its
sub hierarchy. A sandboxed application in such a jail can run normally
in an ambient authority context, but still be sandboxed with fine grain
capabilities.

Let's have a deeper look at holdfp : holdfp is passed a pointer to a
pointer to a capability structure. It performs the capability check. If
it can decide (the capability associated with the descriptor is not
NULL), then, it returns normally, and set the cap pointer to NULL. If
the filedescriptor does not grant the needed capability, it returns
ENOTCAPABLE. But if the filedescriptor has no capability (pointer NULL),
it will return the file pointer but not clear the capability pointer
taken as argument.

Then, the pointer to pointer to a capability is passed to the underlying
lookup code, in this case nlookup (namei), which will perform a check.
Again, if the pointer is NULL, it will assume it does not need to
perform a capcheck. But if the pointer is not NULL and the namecache
entry has a capability attached, it will perform a capability check.

Using this approach, an application can grant more powerful rights than
the sandbox ambient authority to a capsicum application running inside
the sanbox (I don't know if it's useful, but it's quite nice I think).

In the future, we can imagine attaching capability to more kernel
objects, for instance we could adapt sysctl to this design, or create
more descriptor types which are not filedescriptor (filedescriptor have
a fileops vector constraint, which does not map to each kernel object).

Capability could also be more that rights bitfield, for instance Matt
Dillon talked about a decrytion keys attached to process to perform
encryption/descrytion of a filesystem subvolume.

And it's not hard to implement freeBSD API on top of this framework. At
this point, I'm hitting only to compatibility issues : the lack of
cap_right_destroy function in freeBSD, and the fact that if you try to
set for instance CAP_GETSOCKOPTS and CAP_MKDIRAT on the same capability,
you'll get an error in my implementation.

Best regards,
Joris