A seccomp overview

By Jake Edge
September 2, 2015

In the "refereed talks" track at the Linux Plumbers Conference, Michael Kerrisk looked at the "secure computing" (seccomp) facility in the kernel and how it can be used to reduce the kernel's attack surface. Seccomp is a topic that has come up fairly frequently here at LWN, but we have mostly looked at the development process of the feature, while Kerrisk provided a nice overview and some ideas about how it can be used. As he put it, seccomp has a long history in the kernel, but it has gotten much more interesting in the last few years—and it is still expanding.

Kerrisk introduced himself as the maintainer of the Linux man-pages project. He also reviews and tests new kernel APIs, with an eye toward documenting them. His "day job" is as a programmer, trainer, and writer, he said.

The idea behind seccomp is to restrict the system calls that can be made from a process, he said. The Linux kernel has a few hundred system calls, but most of them are not needed by any given process. If a process can be compromised and tricked into making other system calls, though, it may lead to a security vulnerability that could result in the compromise of the whole system. By restricting what system calls can be made, seccomp is a key component for building application sandboxes.

History

The first version of seccomp was merged in 2005 into Linux 2.6.12. It was enabled by writing a "1" to /proc/PID/seccomp. Once that was done, the process could only make four system calls: read(), write(), exit(), and sigreturn(). The latter is a call made "behind the scenes" for signal handlers, he said. Any other system call made by the process would result in a SIGKILL. The idea and patches came from Andrea Arcangeli as a way to securely run other people's code, so that there could be a marketplace for selling unused CPU cycles. The idea never really took off, however.

In 2007, the interface to enable seccomp changed in kernel 2.6.23. A prctl() operation (PR_SET_SECCOMP with the SECCOMP_MODE_STRICT argument) was added and the /proc interface was removed. The corresponding PR_GET_SECCOMP operation had rather interesting behavior: it would return zero if the process was not in seccomp mode, but it would raise a SIGKILL if it was (since prctl() is not a permitted system call). It is, Kerrisk said, evidence that kernel developers do have a sense of humor.

Things were calm in seccomp land for the next five years or so until "seccomp mode 2" (or "seccomp filter mode") was added to Linux 3.5 in 2012. It added a second mode for seccomp: SECCOMP_MODE_FILTER. Using that mode, processes can specify which system calls are permitted. By using a mini-program in the Berkeley packet filter (BPF) language, processes could restrict system calls entirely or only for certain argument values. There are now a number of tools that are using seccomp filters, including the Chrome/Chromium browser, OpenSSH, vsftpd, and Firefox OS. It is "not quite" in Docker yet, he said.

By 3.8 in 2013, "the joke is getting old", so a numeric "Seccomp" field was added to /proc/PID/status. Reading that file will allow a process to discover its seccomp mode (0 for disabled, 1 for strict, and 2 for filter). Kerrisk noted that the process may need to obtain a file descriptor for that file from elsewhere in order to be sure it won't receive a SIGKILL.

The seccomp() system call was added in 2014 (3.17) rather than to further multiplex the "monster" prctl() system call. The seccomp() system call provides a superset of the existing functionality. It also adds the ability to synchronize all threads of a process to the same set of filters, which is useful to ensure that even threads created before the filters are installed are still subject to them.

BPF

The seccomp filter mode allows developers to write BPF programs that determine whether a given system call will be allowed or not. That decision can be based on the system call number and on the argument values, which come from the registers (up to six) in which they are passed. Only the values passed are available, as any pointer arguments are not dereferenced by the BPF virtual machine.

Filters are installed using either seccomp() or prctl(). The BPF program must be constructed first, then installed in the kernel; after that, every system call triggers the filter code. Also, filters cannot be removed once they have been installed, since installing a filter is a effectively a declaration that any subsequently executed code is not trusted.

The BPF language nearly predates Linux, Kerrisk said. It came about in 1992 for the tcpdump program, which is a monitoring tool for network packets. But the volume of packets can be enormous, so transferring all of them to user space for filtering there is quite expensive. BPF provided a way to do in-kernel filtering so that user space only needed to handle those packets it was interested in.

The seccomp filter developers realized that they wanted to do a similar task, so BPF was generalized to allow system call filtering. There is a small in-kernel virtual machine that interprets a simple set of BPF instructions.

BPF allows branches, but only forward branches so there can be no loops. That guarantees that the program will complete, which is important when accepting code to run inside the kernel. BPF programs are limited to 4096 instructions and their validity can be verified at load time. In addition, the verifier can ensure that the program will always terminate with a return instruction that tells the kernel what action to take for the system call.

Further generalization of BPF is ongoing. Extended BPF (eBPF) has been added to the kernel, as have filters for tracepoints (Linux 3.18) and for raw sockets (3.19). Also, filtering for perf events with eBPF was merged for the 4.1 kernel.

BPF has a single accumulator register, a data area (for seccomp, that contains information about the system call), and an implicit program counter. All instructions are 64 bits in length, with 16 bits dedicated to the opcode, two 8-bit fields for jump destinations, and a 32-bit field to hold values with an interpretation that is opcode-dependent.

The basic set of instructions that you would expect are available in BPF: load, store, jump, arithmetic and logic operations, and return. There are conditional and unconditional jump instructions, with the latter using the 32-bit field as its offset. Conditional jumps use the two jump destination fields in the instruction—each holds an offset to jump to depending on whether the condition is true or false.

Due to having two jump destinations, BPF can get by with a simpler set of conditional jump instructions (for example, it has "jump if equal", but no "jump if not equal"), since the two offsets can be swapped if the other sense of the comparison is needed. The destinations are offsets, so 0 means "no jump" (execute the next instruction) and, because they are 8-bit values, the maximum jump is 255 instructions. As mentioned earlier, negative offsets are not allowed to avoid any possibility of looping.

The BPF data area for seccomp (described by struct seccomp_data) has a few different fields that describe the system call being made: system call number (which is architecture-dependent), architecture, instruction pointer, and system call arguments. It is a read-only buffer that the program can use but not change.

Writing filters

If one is "feeling masochistic", they could write BPF programs numerically, but there are some constants and macros available to make it easier. For example:

    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             (offsetof(struct seccomp_data, arch)))

That would create a BPF load operation (BPF_LD), for a word (BPF_W), using the value in the instruction as an offset into the data area (BPF_ABS). That value is the offset of the architecture field from the data area, so the end result is an instruction that loads the accumulator with the architecture (from the AUDIT_ARCH_* values in audit.h). The next instruction might look something like:

    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K ,
             AUDIT_ARCH_X86_64 , 1, 0)

That would create a jump-if-equal instruction (BPF_JMP | BPF JEQ) that compares the value in the instruction, which is known as "k", (BPF_K) to the value in the accumulator. So, if the architecture is x86-64, this jump will skip the next instruction (the offset of "1" for the jump true destination), otherwise it will execute it ("0" for jump false).

Kerrisk stressed that BPF programs should check the architecture as their first step to ensure that the system-call numbers match what is expected by the program. The BPF program could have been created on different architecture than what it is being run on.

Once the filter is installed, it will be run for every system call, which does have a performance impact, he said. Each program must have a return instruction on all exit paths; otherwise, the verifier will fail with an EINVAL when the filter is installed. The return value is a 32-bit quantity, with the most significant 16 bits specifying the action the kernel should take. The other bits are available to return data associated with the action.

There are five actions that the program can return. SECCOMP_RET_ALLOW indicates that the system call is allowed. SECCOMP_RET_KILL terminates the process as though it had been killed with a SIGSYS (though the process can't actually catch the signal). SECCOMP_RET_ERRNO tells the kernel not to execute the system call and to return the data value specified as errno. SECCOMP_RET_TRACE indicates that the kernel should try to notify a ptrace() tracer, which gives it the opportunity to take control. Finally, SECCOMP_RET_TRAP tells the kernel to immediately send a real SIGSYS that the process could catch if that was desired.

To install a BPF program, one either uses seccomp() (since Linux 3.17) or prctl(). In both cases, a struct sock_fprog pointer is passed; it contains a number of instructions and a pointer to the program. In order for the installation to succeed, either the caller must have the CAP_SYS_ADMIN capability or the process must have set the PR_SET_NO_NEW_PRIVS process attribute (which effectively ignores set-UID, set-GID, and file capabilities when new programs are run with execve()).

He showed an example (which can be seen in his slides [PDF]) that would not allow a call to open() (the BPF program would return SECCOMP_RET_KILL). It was installed by a program that did indeed call open(), which failed with a "Bad system call" error, indicating that it received a SIGSYS. Due to time constraints, he skipped over a more complicated example, but it is worth looking at in the slides.

If the filters allow a program to call prctl() or seccomp(), it can install further filters. They will all be run in the reverse order that they were added. The highest priority value returned by any of the filters is what gets returned to the kernel (with KILL as the highest priority and ALLOW as the lowest). Filters are preserved across calls to fork(), clone(), and execve() if those calls are allowed by the filters at all.

The performance cost for the filters is not insubstantial. He tested his simple "deny open" example, which is six BPF instructions, in a program that continually called getppid()—one of the cheapest system calls. That resulted in 25% more execution time than running it without the filter.

The two main uses for seccomp filters are sandboxing and failure-mode testing. The former restricts programs, especially those that handle untrusted input, to a (small) subset of system calls, typically with a whitelist. For failure-mode testing, one can inject various kinds of unexpected failures into programs using seccomp, which may be useful to find bugs in error paths and the like.

There are a number of tools and resources that can make it easier to work with seccomp filters and BPF. Libseccomp provides a higher-level API for creating filters. He noted that the project also has man pages (for example, seccomp_init()) with lots of examples.

There is also a BPF compiler (bpfc) that is part of the netsniff-ng toolkit project. LLVM has a BPF backend as of its 3.7 release that compiles a subset of C to BPF, though he noted that there is little documentation as yet.

Finally, the kernel has a just-in-time (JIT) compiler that turns the BPF bytecode into native machine code, which can achieve 2-3x performance (or even better in some cases). The JIT compiler is disabled by default, but it can be enabled by writing a "1" to:

    /proc/sys/net/core/bpf_jit_enable

Kerrisk's slides have a wealth of information, including additional resources for more information.

[I would like to thank the Linux Plumbers Conference organizing committee for travel assistance to Seattle for LPC.]

Index entries for this article
Security	Linux kernel
Security	Sandboxes
Conference	Linux Plumbers Conference/2015

A seccomp overview

Posted Sep 4, 2015 19:27 UTC (Fri) by kjp (guest, #39639) [Link] (2 responses)

Well, at least it seems simpler than selinux. It would be great if the article estimated the LOC of filter code vs the LOC of "surface" it is trying to protect.

A seccomp overview

Posted Sep 6, 2015 21:49 UTC (Sun) by robert_s (subscriber, #42402) [Link]

It is trying to do a very different thing from selinux though.

A seccomp overview

Posted Sep 9, 2015 10:44 UTC (Wed) by nix (subscriber, #2304) [Link]

It would be great if the article estimated the LOC of filter code vs the LOC of "surface" it is trying to protect

Since this is protecting other processes from buggy userspace code, which can be any length, that's impossible.

(The disparity can be enormous -- e.g. perhaps the first thing ever sandboxed with seccomp was the Chromium renderer process: *part* of that is WebKit (now Blink) which takes about an hour just to compile...)

A seccomp overview

Posted Sep 7, 2015 12:47 UTC (Mon) by gebi (guest, #59940) [Link] (2 responses)

Was the getppid() performance test with JIT enabled?

A seccomp overview

Posted Sep 9, 2015 12:22 UTC (Wed) by mkerrisk (subscriber, #1978) [Link] (1 responses)

> Was the getppid() performance test with JIT enabled?

No it was with the JIT compiler disabled. With the JIT compiler enabled, the cost of the filter was around 15%.

But don't overread those numbers. They're just examples to indicate that there is overhead. The relative figures will vary according to the complexity/cost of the system call, the cost of the filter, and other factors. In some simple experiments that I ran with very large (but pretty "dumb") filters, the JIT compiler could improve performance by nearly 10x in some cases.

My getppid() test program can be found at http://man7.org/tlpi/code/online/dist/seccomp/seccomp_per...

A seccomp overview

Posted Sep 9, 2015 16:18 UTC (Wed) by gebi (guest, #59940) [Link]

oh nice, thx for the numbers with JIT enabled.

ACK, about the absolute numbers, but it's imho nice that even for this fast syscall case JIT is faster.

eBPF with seccomp

Posted Nov 28, 2019 14:32 UTC (Thu) by krishnatayal (guest, #135817) [Link]

Can we use eBPF with seccomp. We need to use map (key value pair) provided in eBPF.