Capsicum is a lightweight operating system (OS) capability and sandbox framework planned for inclusion in FreeBSD 9. Capsicum extends, rather than replaces, UNIX APIs, providing new kernel primitives (sandboxed capability mode and capabilities) and a userspace sandbox API. These tools support decomposition of monolithic UNIX applications into compartmentalized logical applications, an increasingly common goal that is supported poorly by existing OS access control primitives. We demonstrate our approach by adapting core FreeBSD utilities and Google's Chromium Web browser to use Capsicum primitives, and compare the complexity and robustness of Capsicum with other sandboxing techniques.
Capsicum is an API that brings capabilities, unforgeable tokens of authority, to UNIX. Fine-grained capabilities have long been the province of research operating systems (OSs) such as EROS.17 UNIX systems have less fine-grained access control, but are widely deployed. Capsicum's additions to the UNIX API suite give application authors an adoption path for one of the ideals of OS security: least-privilege operation. We validate Capsicum through a prototype built on (and planned for inclusion in) FreeBSD 9.0.
Today, many security-critical applications have been decomposed into sandboxed parts in order to mitigate vulnerabilities. Privilege separation,12 or compartmentalization, has been adopted for applications such as OpenSSH, Apple's SecurityServer, and Google's Chromium Web browser. Sandboxing is enforced using various access control techniques, but only with significant programmer effort and limitations: current OSes are simply not designed for this purpose.
Conventional (non-capability-oriented) OSes primarily use Discretionary Access Control (DAC) and Mandatory Access Control (MAC). DAC allows the owner of an object (such as a file) to specify the permissions other users have for it, which are checked when the object is accessed. MAC enforces systemic policies: administrators specify requirements (e.g., "users cleared to Secret may not read Top Secret documents"), which are checked when objects are accessed.
Neither approach was designed to address the case of a single application processing many types of information on behalf of one user. Modern Web browsers must parse HTML, scripting languages, images, and video from many untrusted sources, but act with the ambient authority of the user, having access to all their resources. In order to protect user data from malicious JavaScript, Flash, etc., the Chromium Web browser operates as several OS processes sandboxed using DAC or MAC. Both require significant programmer effort (from hundreds of lines of code to, in one case, 22,000 lines of C++) and often elevated privilege to use them. Our analyses show significant vulnerabilities in all of these sandbox models due to inherent flaws or incorrect use (Section 5).
Capsicum addresses these problems by introducing new (and complementary) security primitives to support compartmentalization: capability mode and capabilities. Capabilities extend UNIX file descriptors, encapsulating rights on specific objects, such as files or sockets; they may be delegated from process to process. Capability mode processes can access only resources that they have been explicitly delegated. Capabilities should not be confused with OS privileges, occasionally described as POSIX capabilities, which are exemptions from access control or system integrity protections, such as the right to override file permissions.
We have modified several applications, including UNIX utilities and Chromium, to use Capsicum. No special privilege is required, and code changes are minimal: the tcp-dump
utility, plagued with past security vulnerabilities, can be sandboxed with Capsicum in around 10 lines of C, and Chromium in just 100 lines. In addition to being more secure and easier to use than other sandboxing techniques, Capsicum performs well: unlike pure capability systems that employ extensive message passing, Capsicum system calls are just a few percent slower than their UNIX counterparts.
Capsicum blends capabilities with UNIX, achieves many of the benefits of least-privilege operation, preserves existing UNIX APIs and performance, and offers application authors an adoption path for capability-oriented software design. Capsicum extends, rather than replaces, standard UNIX APIs by adding kernel-level primitives (a sandboxed capability mode, capabilities, and others) and userspace support code (libcapsicum and a capability-aware runtime linker). These extensions support application compartmentalization, the decomposition of monolithic applications into logical applications whose components run in sandboxes (Figure 1).
Capsicum requires application modification to exploit new security functionality, but this may be done gradually, rather than requiring a wholesale conversion to a pure capability model. Developers can select the changes that maximize positive security impact while minimizing unacceptable performance costs; where Capsicum replaces existing sandbox technology, a performance improvement may even be seen.
Capsicum incorporates many pragmatic design choices, emphasizing compatibility and performance over capability purism, not least by eschewing microkernel design. While applications may adopt message-passing, and indeed will need to do so to fully benefit from the Capsicum architecture, we provide "fast paths" direct system calls operating on delegated file descriptors. This allows native UNIX I/O performance, while leaving the door open to techniques such as message-passing system calls if that proves desirable.
2.1. Capability mode
Capability mode is a process credential flag set by a new system call, cap_enter()
; once set, the flag cannot be cleared, and it is inherited by all descendent processes. Processes in capability mode are denied access to global namespaces such as absolute filesystem paths and PIDs (Figure 1). Several system management interfaces must also be protected to maintain UNIX process isolation (including /dev
device nodes, some ioctl()
operations, and APIs such as reboot()
).
Capability mode system calls are restricted: those requiring global namespaces are blocked, while others are constrained. For instance, sysctl()
can be used not only to query process-local information such as address space layout, but also to monitor a system's network connections. Roughly 30 of 3000 sysctl()
MIB entries are permitted in capability mode.
Other constrained system calls include shm_open()
, which is permitted to create anonymous memory objects but not named ones, and the openat()
family. These calls accept a directory descriptor argument relative to which open(), rename()
, etc. path lookups occur; in capability mode, operations are limited to objects "under" the passed directory.
2.2. Capabilities
The most critical choice in adding capability support to a UNIX system is the relationship between capabilities and file descriptors. Some systems, such as Mach, maintain entirely independent notions: Mac OS X provides each task with both capabilities (Mach ports) and BSD file descriptors. Separating these concerns is logical, as ports have different semantics from file descriptors; however, confusing results can arise for application developers dealing with both Mach and BSD APIs, and we wish to reuse existing APIs as much as possible. Instead, we extend the file descriptor abstraction, and introduce a new file descriptor type, the capability, to wrap and protect raw file descriptors.
File descriptors already have some properties of capabilities: they are unforgeable tokens of authority, and can pass between processes via inheritance or inter-process communication (IPC). Unlike "pure" capabilities, they confer very broad rights: even if a file descriptor is read-only, meta-data writes such as fchmod()
are permitted. In Capsicum, we restrict file descriptor operations by wrapping it in a capability that masks available operations (Figure 2).
There are roughly 60 capability mask rights, striking a balance between message-passing (two rights: send and receive) and MAC systems (hundreds of access control checks). Capability rights align with methods on underlying objects: system calls implementing similar operations require the same rights, and calls may require multiple rights. For example, pread()
(read file data) requires CAP_READ
, and read()
(read data and update the file offset) requires CAP_READ|CAP_SEEK
.
The cap_new()
system call creates a new capability given an existing file descriptor and rights mask; if the original descriptor is a capability, the new rights must be a subset of the original rights. Capabilities can wrap any type of file descriptor including directories, which can be passed as arguments to openat()
and related system calls. Directory capabilities delegate namespace subtrees, which may be used with *at()
system calls (Figure 3). As a result, sandboxed processes can access multiple files in a directory without the performance overhead or complexity of proxying each open()
to a process with ambient authority via IPC.
Many past security extensions have composed poorly with UNIX security leading to vulnerabilities; thus, we disallow privilege elevation via fexecve()
using setuid
and setgid
binaries in capability mode. This restriction does not prevent setuid
binaries from using sandboxes.
2.3. Runtime environment
Manually creating sandboxes without leaking resources via file descriptors, memory mappings, or memory contents is difficult, libcapsicum
provides a high-level API for managing sandboxes, hiding the implementation details of cutting off global namespace access, closing file descriptors not delegated to the sandbox, and flushing the address space via fexecve(). libcapsicum
returns a socket that can be used for IPC with the sandbox, and to delegate further capabilities (Table 1).
3.1. Kernel changes
Most constraints are applied in the implementation of kernel services, rather than by filtering system calls. The advantage of this approach is that a single constraint, such as denying access to the global file system (FS) namespace, can be implemented in one place, namei()
, which is responsible for processing all path lookups. For example, one might not have expected the fexecve()
call to cause global namespace access, since it takes a file descriptor for a binary as its argument rather than a path. However, the binary passed by file descriptor specifies its runtime linker via an embedded path, which the kernel will implicitly open and execute.
Similarly, capability rights are checked by the kernel function fget()
, which converts a numeric descriptor into a struct file
reference. We have added a new rights
argument, allowing callers to declare what capability rights are required to perform the current operation. If the file descriptor is a raw file descriptor, or wrapped by a capability with sufficient rights, the operation succeeds. Otherwise, ENOTCAPABLE
is returned. Changing the signature of fget()
allows us to use the compiler to detect missed code paths, giving us greater confidence that all cases have been handled.
One less trivial global namespace to handle is the process ID (PID) namespaceused for process creation, signaling, debugging, and exit statuscritical operations for a logical application. A related problem is that libraries cannot create and manage worker processes without interfering with process management in the application itselfunexpected process IDs may be returned by wait()
. Process descriptors address these problems in a manner similar to Mach task ports: creating a process with pdfork()
returns a file descriptor suitable for process management tasks, such as monitoring for exit via poll()
. When a process descriptor is closed, its process is terminated, providing a user experience consistent with that of monolithic processes: when the user hits Ctrl-C, all processes in the logical application exit.
3.2. The Capsicum runtime environment
Removing access to global namespaces forces fundamental changes to the UNIX runtime environment. Even the most basic UNIX operations for starting processes and running programs are restricted: fork()
and exec()
rely on global PID and FS namespaces, respectively.
Responsibility for launching a sandbox is split between libcapsicum
and rtld-elf-cap.libcapsicum
is invoked by the application, forks a new process using pdfork()
, gathers delegated capabilities from the application and libraries, and directly executes the runtime linker, passing target binary as a capability. Directly executing the capability-aware runtime linker avoids dependence on fexecve
loading a runtime linker via the global FS namespace. Once rtld-elf-cap
is executing in the new process, it links the binary using libraries loaded via directory capabilities. The application is linked against normal C libraries and has access to all of the full C run-time, subject to sandbox restrictions.
Programs call lcs_get()
to look up delegated capabilities and retrieve an IPC handle so that they can process RPCs. Capsicum does not specify an Interface Description Language (IDL), as existing compartmentalized or privilege-separated applications have their own, often hand-coded, RPC marshalling already. Here, our design differs from historic microkernel systems, which universally have specified IDLs, such as the Mach Interface Generator (MIG).
libcapsicum's fdlist
(file descriptor list) abstraction allows modular applications to declare a set of capabilities to be passed into sandboxes. This avoids hard-coding file descriptor numbers into the ABI between applications and their sandboxed components, a technique used in Chromium that we felt was likely to lead to bugs. Instead, application and library components bind file descriptors to names before creating a sandbox; corresponding code in the sandbox retrieves file descriptors using the same names.
Adapting applications for sandboxing is a nontrivial task, regardless of the framework, as it requires analyzing programs to determine their resource dependencies and adopting a distributed system programming style in which components use message passing or explicit shared memory rather than relying on a common address space. In Capsicum, programmers have access to a number of programming models; each model has its merits and costs in terms of starting point, development complexity, performance, and security:
Modify applications to use cap_enter()
directly in order to place an existing process with ambient authority in capability mode, retaining selected capabilities and virtual memory mappings. This works well for applications with simple structures such as "open all resources, process them in an I/O loop," e.g., programs in a UNIX pipeline or that use a network single connection. Performance overhead is extremely low, as changes consist of encapsulating converting file descriptor rights into capabilities, followed by entering capability mode. We illustrate this approach with tcpdump
.
Reinforce existing compartmentalization with cap_enter()
. Applications such as dhclient
and Chromium are already structured for message passing, and so benefit from Capsicum without performance or complexity impact. Both programs have improved vulnerability mitigation under Capsicum.
Modify the application to use the libcapsicum
API, possibly introducing new compartmentalization. libcapsicum
offers a simpler and more robust API than handcrafted separation, but at a potentially higher performance cost: residual capabilities and virtual memory mappings are rigorously flushed. Introducing new separation in an application comes at a significant development cost: boundaries must be identified such that not only it is security improved (i.e., code processing risky data is isolated), but also resulting performance is acceptable. We illustrate this technique with gzip
.
Compartmentalized application development is distributed application development, with components running in different processes and communicating via message passing. Commodity distributed debugging tools are, unfortunately, unsatisfying and difficult to use. While we have not attempted to extend debuggers, such as gdb
, to better support compartmentalization, we have modified several FreeBSD tools to understand Capsicum, and take some comfort in the synchronous nature of compartmentalized applications.
The procstat
command inspects kernel state of running processes, including file descriptors, memory mappings, and credentials. In Capsicum, these resource lists become capability lists, representing the rights available to the process. We have extended procstat
to show Capsicum-related information, such as capability rights masks on file descriptors and a process credential flag indicating capability mode.
When adapting existing software to run in capability mode, identifying capability requirements can be tricky; often the best technique is to discover them through dynamic analysis, identifying missing dependencies by tracing real-world use. To this end, capability-related failures are distinguished by new errno
values, ECAPMODE
, and ENOTCAPABLE
. System calls such as open()
are blocked in namei
, rather than at the kernel boundary, so that paths are available in ktrace
and DTrace
.
Another common compartmentalized debugging strategy is to allow the multiprocess logical application to be run as a single process for debugging purposes, libcapsicum
provides an API to query sandbox policy, making it easy to disable sandboxing for testing. As RPCs are generally synchronous, the thread stack in a sandbox is logically an extension of the thread stack in the host process, making the distributed debugging task less fraught than it might otherwise appear.
4.1. tcpdump
tcpdump
provides not only an excellent example of Capsicum offering immediate security improvement through straightforward changes, but also the subtleties that arise when sandboxing software not written with that in mind. tcpdump
has a simple model: compile a Berkeley Packet Filter (BPF) rule, configure a BPF device as an input source, and loop reading and printing packets. This structure lends itself to capability mode: resources are acquired early with ambient authority, and later processing requires only held capabilities. The bottom three lines of Figure 4 implement this change.
This change significantly improves security, as historically fragile packet-parsing code now executes with reduced privilege. However, analysis with the procstat
tool is required to confirm that only desired capabilities are exposed, and reveals unconstrained access to a /dev/pts/0
, which would permit improper access to user input. Adding lc_limitfd
calls as in Figure 4 prevents reading stdin
while still allowing output. Figure 5 illustrates procstat
, including capabilities wrapping file descriptors to narrow delegated rights.
ktrace
reveals another problem: the DNS resolver depends on FS access, but only after cap_enter()
(Figure 6). This illustrates a subtle problem with sandboxing: modular software often emplous on-demand initialization scattered throughout its components. We correct this by proxying DNS via a local resolver daemon, addressing both FS and network address namespace concerns.
Despite these limitations, this example shows that even minor changes can lead to dramatic security improvements, especially for a critical application with a long history of security problems. An exploited buffer overflow, for example, will no longer yield arbitrary FS or network access.
4.2. dhclient
FreeBSD ships with the privilege-separated OpenBSD DHCP client. DHCP requires substantial privilege to open BPF descriptors, create raw sockets, and configure network interfaces, so is an appealing target for attackers: complex network packet processing while running with root privilege. Traditional UNIX proves only weak tools for sandboxing: the DHCP client starts as the root user, opens the resources its unprivileged component requires (raw socket, BPF descriptor, lease configuration file), forks a process to continue privileged network configuration, and then confines the parent process using chroot()
and setuid()
. Despite hardening of the BPF ioctl()
interface to prevent reprogramming the filter, this confinement is weak: chroot()
limits only FS access, and switching credentials offers poor protection against incorrectly configured DAC on System V IPC.
The two-line addition of cap_enter()
reinforces existing sandboxing with Capsicum, limiting access to previously exposed global namespaces. As there has been no explicit flush of address space or capabilities, it is important to analyze what capabilities are retained by the sandbox (Figure 7). dhclient
has done an effective job at eliminating directory access, but continues to allow sandboxes to submit arbitrary log messages, modify the lease database, and use a raw socket. It is easy to imagine extending dhclient
to use capabilities to further constrain file descriptors inherited by the sandbox, for example, by limiting the IP raw socket to send()
and recv()
, disallowing ioctl()
. I/O interposition could be used to enforce log message and lease file constraints.
4.3. Gzip
gzip
presents an interesting target for several reasons: it implements risky compression routines that have suffered past vulnerabilities, executes with ambient user authority, yet is uncompartmentalized. UNIX sandboxing techniques, such as chroot()
and sandbox UIDs, are a poor match not only because of their privilege requirement, but also because the notion of a single global application sandbox is inadequate. Many simultaneous gzip
sessions can run independently for many different users, and placing them in the same sandbox provides few of the desired security properties.
The first step is to identify natural fault lines in the application: for example, code that requires ambient authority (e.g., opening files or network connections) and code that performs more risky activities (e.g., decoding data). In gzip
, this split is obvious: the main run loop of the application opens input and output files, and supplies file descriptors to compression routines. This suggests a partitioning in which pairs of capabilities are passed to a sandbox for processing.
We modified gzip
to optionally proxy compression and decompression to a sandbox. Each RPC passes input and output capabilities into a sandbox, as well as miscellaneous fields such as size, original filename, and modification time. By limiting capability rights to combinations of CAP_READ, CAP_WRITE
, and CAP_SEEK
, a tightly constrained sandbox is created, preventing access to globally named resources, in the event a vulnerability in compression code is exploited.
This change adds 409 lines (16%) to the gzip
source code, largely to marshal RPCs. In adapting gzip
, we were initially surprised to see a performance improvement; investigation of this unlikely result revealed that we had failed to propagate the compression level (a global variable) into the sandbox, leading to the incorrect algorithm selection. This serves as a reminder that code not originally written for decomposition requires careful analysis. Oversights such as this one are not caught by the compiler: the variable was correctly defined in both processes, but values were not properly propagated.
Compartmentalization of gzip
raises an important design question: is there a better way to apply sandboxing to applications most frequently used in pipelines? Seaborn has suggested one possibility: a Principle of Least Authority Shell (PLASH), in which the shell runs with ambient privilege but places pipeline components in sandboxes.16 We have begun to explore this approach on Capsicum, but observe that the design tension exists here as well: gzip
's non-pipeline mode performs a number of application-specific operations requiring ambient privilege, and logic like this is equally awkwardly placed in the shell. On the other hand, when operating purely in a pipeline, the PLASH approach offers the possibility of near-zero application modification.
We are also exploring library self-compartmentalization, in which library code sandboxes itself transparently to the host application. This has motivated several of our process model design choices: masking SIGCHLD
delivery to the parent when using process descriptors avoids disturbing application state. This approach would allow sandboxed video processing in unmodified Web browsers. However, library APIs are often not crafted for sandbox-friendliness: one reason we placed separation in gzip
rather than libz
is that whereas gzip
internal APIs used file descriptors, libz
APIs acted on buffers. Forwarding capabilities offers full I/O performance, whereas the cost of transferring buffers via RPCs grows with file size. This approach does not help where vulnerabilities lie in library API use; for example, historic vulnerabilities in libjpeg
have centered on callbacks into applications.
4.4. Chromium
Google's Chromium Web browser uses a multiprocess logical application model to improve robustness.13 Each tab is associated with a renderer process that performs the complex and risky task of rendering page contents through parsing, image rendering, and JavaScript execution. More recently, Chromium has integrated sandboxing to improve resilience to malicious attacks using a variety of techniques (Section 5).
The FreeBSD port of Chromium did not include sandboxing, and the sandboxing facilities provided as part of the similar Linux and Mac OS X ports bear little resemblance to Capsicum. However, existing compartmentalization was a useful starting point: Chromium assumes sandboxes cannot open files, certain services were already forwarded to renderers (e.g., font loading via passed file descriptors and renderer output via shared memory).
Roughly 100 lines of code were required to constrain file descriptors passed to sandboxes, such as Chromium pak
files, stdio,/dev/random
, and font files, to call cap_enter()
, and to configure capability-oriented POSIX Shared memory instead of System V IPC shared memory. This compares favorably with 4.3 million lines of Chromium source code, but would not have been possible without existing sandbox support.
Chromium provides an ideal context for a comparison with existing sandboxing mechanisms, as it employs six different sandboxing technologies (Table 2). Of these, two are DAC-based, two MAC-based, and two capability-based.
5.1. Windows ACLs and SIDs
On Windows, Chromium employs DAC to create sand-boxes.13 The unsuitability of inter-user protections for the intra-user context is well demonstrated: the model is both incomplete and unwieldy. Chromium uses Access Control Lists (ACLs) and Security Identifiers (SIDs) to sandbox renderers on Windows. Chromium creates a SID with reduced privilege, which does not appear in the ACL of any object, in effect running the renderer as an anonymous user, and attaches renderers to an "invisible desktop," isolating them from the user's desktop environment. Many legitimate system calls are denied to sandboxed processes. These calls are forwarded to a trusted process responsible for filtering and processing, which comprises most of the 22,000 lines of code in the sandbox module.
Objects without ACL support are not protected, including FAT FSs and TCP/IP sockets. A sandbox may be unable to read NTFS files, but it can communicate with any server on the Internet or use a configured VPN. USB sticks present a significant concern, as they are used for file sharing, backup, and robustness against malware.
5.2. Linux chroot
Chromium's Linux suid
model also attempts to create a sandbox using legacy access control; the result is similarly porous, but with the additional risk posed by the need for OS privilege to create the sandbox. In this model, access to the filesystem is limited to a directory via chroot()
: the directory becomes the sandbox's virtual root directory. Access to other namespaces, including System V shared memory (where the user's X window server can be contacted) and network access, is unconstrained, and great care must be taken to avoid leaking resources when entering the sandbox.
Invoking chroot()
requires a setuid
binary helper with full system privilege. While similar in intent to Capsicum's capability mode, this model suffers from significant weakness (e.g., permitting full access to the System V shared memory as well as all operations on passed file descriptors).
5.3. Mac OS X Sandbox
On Mac OS X, Chromium uses Apple's Sandbox system. Sandbox constrains processes according to a scheme-based policy language5 implemented via the MAC Framework.19 Chromium uses three policies for different components, allowing access to font directories while restricting access to the global FS namespace. Chromium can create stronger sandboxes than is possible with DAC, but rights granted to renderer processes are still very broad, and policy must be specified independently from code.
As with other techniques, resources are acquired before constraints are imposed, so care must be taken to avoid leaking resources into the sandbox. Fine-grained file system constraints are possible, but other namespaces such as POSIX IPC are an all-or-nothing affair. The Seatbelt-based sandbox model is less verbose than other approaches, but like all MAC systems, policy must be expressed separately from code. This can lead to inconsistencies and vulnerabilities.
5.4. SELinux
Chromium's SELinux sandbox employs a Type Enforcement (TE) policy.9 SELinux provides fine-grained rights management, but in practice, broad rights are often granted as fine-grained TE policies are difficult to write and maintain. SELinux requires that an administrator be involved in defining new policy, which is a significant inflexibility: application policies are effectively immutable.
The Fedora reference policy for Chromium creates a single SELinux domain, chrome_sandbox_t
, shared by all renderers, risking potential interference. The domain is assigned broad rights, such as the ability to read the terminal device and all files in /etc
. Such policies are easier to craft than fine-grained ones, reducing the impact of the dual-coding problem, but are less effective, allowing leakage between sandboxes and broad access to resources outside of the sandbox.
In contrast, Capsicum eliminates dual-coding by combining policy with code, with both benefits and drawbacks. Bugs cannot arise due to inconsistencies between policy and code, but there is no easily statically analyzable policy. This reinforces our belief that MAC and capabilities are complementary, filling different security niches.
5.5. Linux seccomp
Linux has an optionally compiled capability mode-like facility called seccomp
. Processes in seccomp
mode are denied access to all system calls except read(), write()
, and exit()
. On face value this seems promising, but software infrastructure is minimal, so application writers must write their own. In order to allow other system calls within sandboxes, Chromium constructs a process in which one thread executes in seccomp
mode, and another thread shares the same address space and has full system call access. Chromium rewrites glibc
to forward system calls to the trusted thread, where they are filtered to prevent access to inappropriate memory objects, opening files for write, etc. However, this default policy is quite weak, as read of any file is permitted.
The Chromium seccomp
sandbox contains over a thousand lines of handcrafted assembly to set up sandboxing, implement system call forwarding, and craft a security policy. Such code is difficult to write and maintain, with any bugs likely leading to security vulnerabilities. Capsicum's approach resembles seccomp
, but offers a rich set of services to sandboxes, so is easier to use correctly.
5.6. Summary of Chromium isolation models
Table 2 compares the security properties of the different sandbox models. Capsicum offers the most complete isolation across various system interfaces: FS, IPC, and networking (Net), as well as isolating sandboxes from one another (S ≠ S'), and avoiding the requirement for OS privilege to instantiate new sandboxes (Priv). Exclamation marks indicate cases where protection does exist in a model, but is either incomplete (FAT protection in Windows) or improperly used (while seccomp
blocks open
, Chromium re-enables it with excessive scope via forwarding).
Typical OS security benchmarks try to illustrate near-zero overhead in the hopes of selling general applicability of a technology. Our thrust is different: application authors already adopting compartmentalization accept significant overheads for mixed security return. Our goal is to accomplish comparable performance with significantly improved security. We summarize our results here; detailed exploration may be found in our USENIX Security paper.18
We evaluated performance by characterizing the overhead of Capsicum's new primitives through API micro-benchmarks and more broad application benchmarks. We were unable to measure a performance change in our adapted tcpdump
and dhclient
due to the negligible cost of entering capability mode; on turning our attention to gzip
, we found an overhead of 2.4 ms to decompress an empty file. Micro-benchmarks revealed a cost of 1.5 ms for creating and destroying a sandbox, largely attributable to process management. This cost is quickly amortized with growth in data file size: by 512K, performance overhead was <5%.
Capsicum provides an effective platform for capability work on UNIX platforms. However, further research and development are required to bring this project to fruition.
Refinement of the Capsicum primitives would be useful. Performance might be improved for sandbox creation by employing Bittau's S-thread primitive.2 A formal "logical application" construct might improve termination properties.
Another area for research is in integrating user interfaces and OS security; Shapiro has proposed that capability-centered window systems are a natural extension to capability OSs. It is in the context of windowing systems that we have found capability delegation most valuable: gesture-based access control can be investigated through Capsicum enhancements to UI elements, such as Powerboxes (file dialogues with ambient authority) and drag-and-drop. Improving the mapping of application security into OS sandboxes would improve the security of Chromium, which does not consistently assign Web security domains to OS sandboxes.
Finally, it is clear that the single largest problem with Capsicum and similar approaches is programmability: converting local development into de facto distributed system development hampers application-writers. Aligning security separation with application structure is also important if such systems are to mitigate vulnerabilities on a large scale: how can the programmer identify and correctly implement compartmentalizations with real security benefits?
Saltzer and Schroeder's 1975 exploration of Multics-era OS protection describes the concepts of hardware capabilities and ACLs, and observes that systems combine the two approaches in order to offer a blend of protection and performance.14 Neumann et al.'s Provably Secure Operating System (PSOS),11 and successor LOCK, propose a tight integration of MAC and capabilities; TE is extended in LOCK to address perceived shortcomings in the capability model,15 and later appears in systems such as SELinux.9 We adopt a similar philosophy in Capsicum, supporting DAC, MAC, and capabilities.
Despite experimental hardware such as Wilkes' CAP computer,20 the eventual dominance of page-oriented virtual memory over hardware capabilities led to exploration of microkernel object-capability systems. Hydra,3 Mach,1 and later L48 epitomize this approach, exploring successively greater extraction of historic kernel components into separate tasks, and integrating message passing-based capability security throughout their designs. Microkernels have, however, been largely rejected by commodity OS vendors in favor of higher-performance monolithic kernels. Microkernel capability research has continued in the form of systems such as EROS,17 inspired by KEYKOS.6 Capsicum is a hybrid capability system, observably not a microkernel, and retains support for global namespaces (outside of capability mode), emphasizing compatibility over capability purism.
Provos's OpenSSH privilege separation12 and Kilpatrick's Privman7 in the early 2000s rekindled interest in microkernel-like compartmentalization projects, such as the Chromium Web browser13 and Capsicum's logical applications. In fact, large application suites compare formidably with the size and complexity of monolithic kernels: the FreeBSD kernel is composed of 3.8 million lines of C, whereas Chromium and WebKit come to a total of 4.1 million lines of C++. How best to decompose monolithic applications remains an open research question; Bittau's Wedge offers a promising avenue through automated identification of software boundaries.2
Seaborn and Hand have explored application compartmentalization on UNIX through capability-centric Plash,16 and Xen,10 respectively. Plash offers an intriguing layering of capability security over UNIX semantics by providing POSIX APIs over capabilities, but is forced to rely on the same weak UNIX primitives analyzed in Section 5. Hand's approach suffers from similar issues to seccomp
, in that the runtime environment for Xen-based sandboxes is functionality-poor. Garfinkel's Ostia4 proposes a delegation-centric UNIX approach, but focuses on providing sandboxing as an extension, rather than a core OS facility.
We have described Capsicum, a capability security extension to the POSIX API to appear in FreeBSD 9.0 (with ports to other systems, including Linux, under way). Capsicum's capability mode and capabilities appear a more natural fit to application compartmentalization than widely deployed discretionary and mandatory schemes. Adaptations of real-world applications, from tcpdump
to the Chromium Web browser, suggest that Capsicum improves the effectiveness of OS sandboxing. Unlike research capability systems, Capsicum implements a hybrid capability model that supports commodity applications. Security and performance analyses show that improved security is not without cost, but that Capsicum improves on the state of the art. Capsicum blends immediate security improvements to current applications with long-term prospects of a more capability-oriented future. More information is available at: http://www.cl.cam.ac.uk/research/security/capsicum/
We thank Mark Seaborn, Andrew Moore, Joseph Bonneau, Saar Drimer, Bjoern Zeeb, Andrew Lewis, Heradon Douglas, Steve Bellovin, Peter Neumann, Jon Crowcroft, Mark Handley, and the anonymous reviewers for their help.
1. Accetta, M., Baron, R., Golub, D., Rashid, R., Tevanian, A., Young, M. Mach: A New Kernel Foundation for UNIX Development. Technical report, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, Aug. 1986.
2. Bittau, A., Marchenko, P., Handley, M., Karp, B. Wedge: Splitting applications into reduced-privilege compartments. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (2008), USENIX Association, 309322.
3. Cohen, E., Jefferson, D. Protection in the Hydra operating system. In SOSP'75: Proceedings of the Fifth ACM Symposium on Operating Systems Principles (1975), ACM, NY, 141160.
4. Garfinkel, T., Pfa, B., Rosenblum, M. Ostia: A delegating architecture for secure system call interposition. In Proceedings of the Internet Society (2003).
5. Google, Inc. The Chromium Project: Design Documents: OS X Sandboxing Design. http://dev.chromium.org/developers/design-documents/sandbox/osx-sandboxing-design, Oct. 2010.
6. Hardy, N. KeyKos architecture. SIGOPS Oper. Syst. Rev. 19, 4 (1985), 825.
7. Kilpatrick, D. Privman: A library for partitioning applications. In Proceedings of USENIX Annual Technical Conference (2003), USENIX Association, 273284.
8. Liedtke, J. On microkernel construction. In SOSP'95: Proceedings of the 15th ACM Symposium on Operating System, Principles (Copper Mountain resort, CO, Dec. 1995).
9. Loscocco, P.A., Smalley, S.D. Integrating flexible support for security policies into the Linux operating system. In Proceedings of the USENIX Annual Technical Conference (June 2001), USENIX Association, 2942.
10. Murray, D.G., Hand, S. Privilege separation made easy. In Proceedings of the ACM SIGOPS European Workshop on System, Security (EUROSEC) (2008), ACM, 4046.
11. Neumann, P.G., Boyer, R.S., Feiertag, R.J., Levitt, K.N., Robinson, L. A Provably Secure Operating System: The System, Its Applications, and Proofs, Second Edition. Technical Report CSL-116, Computer Science Laboratory, SRI International, Menlo Park, CA, May 1980.
12. Provos, N., Friedl, M., Honeyman, P. Preventing privilege escalation. In Proceedings of the 12th USENIX Security Symposium (2003), USENIX Association.
13. Reis, C., Gribble, S.D. Isolating web programs in modern browser architectures. In EuroSys'09: Proceedings of the 4th ACM European Conference on Computer Systems (2009), ACM, NY, 219232.
14. Saltzer, J.H., Schroeder, M.D. The protection of information in computer systems. In Proceedings of the IEEE 63, 9 (Sep. 1975), 12781308.
15. Sami Saydjari, O. Lock: An historical perspective. In Proceedings of the 18th Annual Computer Security Applications Conference (2002), IEEE Computer Society.
16. Seaborn, M. Plash: Tools for practical least privilege, 2007. http://plash.beasts.org/
17. Shapiro, J., Smith, J., Farber, D. EROS: A fast capability system. In SOSP'99: Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, Dec. 1999.
18. Watson, R.N.M., Anderson, J., Laurie, B., Kennaway, K. Capsicum: Practical capabilities for UNIX. In Proceedings of the 19th USENIX Security Symposium (2010), USENIX Association, Berkeley, CA.
19. Watson, R.N.M., Feldman, B., Migus, A., Vance, C. Design and implementation of the TrustedBSD MAC framework. In Proceedings of the Third DARPA Information Survivability Conference and Exhibition (DISCEX) (April 2003), IEEE.
20. Wilkes, M.V., Needham, R.M. The Cambridge CAP Computer and Its Operating System (Operating and Programming Systems Series). Elsevier North-Holland, Inc., Amsterdam, the Netherlands, 1979.
b. Supported by the Rothermere Foundation and the Natural Sciences and Engineering Research Council of Canada.
The original version of this paper "Capsicum: Practical Capabilities for UNIX" was published in the Proceedings of the 19th USENIX Security Symposium, 2010.
Figure 1. Application self-compartmentalization.
Figure 2. Capabilities "wrap" normal file descriptors.
Figure 3. FS delegation to sandboxes.
Figure 4. Capsicum changes to tcpdump
.
Figure 5. procstat - C
displays a process's capabilities.
©2012 ACM 0001-0782/12/0300 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2012 ACM, Inc.
No entries found