A journey on evaluating Control-Flow Integrity (CFI): LLVM-CFI versus RAP

This post started out of the need to provide a little more clarification after a long and heated discussions on Twitter (initial discussion and follow up) about the origins of Control-Flow Integrity (CFI), the contributions of academia, and the precision, performance, and compatibility of different existing implementations.

CFI is a stop the exploit defense that protects the control-flow of processes in the presence of memory corruption. The threat model assumes that an attacker can modify (read, write, update) all of the address space according to the read/write permissions of the corresponding pages. The mitigation restricts execution to valid control flows by checking the targets of indirect control flow transfers (indirect calls and indirect jumps on the forward edge or function returns on the backward edge). While many different CFI proposals and implementations exist, most leverage a conceptual set check (e.g., checking a set hash or checking a set id) to test if the observed target is a valid target. Most CFI implementations are static, i.e., they rely on an analysis pass to build the target sets and, except for the target, do not need any writable data (that could be modified by the adversary) in the process.

Without memory corruption, only a single target is valid for each executed indirect control-flow transfer but due to the imprecision through the static CFI property, all targets in the set must be accepted (i.e., the implementations are neither context nor path sensitive). Similarly, due to imprecision in the analysis, the target sets are an over-approximation. In practice, most implementations use a type-based analysis and group all functions with the same prototype into a target set. These two imprecisions allow the attacker to replace the original target with another one from the target set without triggering the defense. The power of CFI as a mitigation is therefore tied to the size of the sets. Intuitively, the smaller the sets, the stronger the defense.

For the backward edge, a set check is insufficient due to the massive amount of over-approximation. In its most precise form, set checking leverages the function symbol to match the return sites for each call of that function. Most implementations similarly leverage the function prototype to match the return site. This means that from a void(*)(void) function you can return to after the call of any void(*)(void) function, often resulting in huge target sets. These over-approximations (both for function name and function prototype matching) result in an opportunity for the attacker to short-circuit different parts of the program. For example, two calls to printf in the same function allow the attacker to implement a loop where the return in the second call is overwritten to the first call. This shortcutting becomes a powerful primitive, often resulting in arbitrary computation (see Control-Flow Bending for details). To protect the backward edge, we therefore recommend stack integrity through a shadow stack or safe stack. Intel CET is a promising upcoming hardware implementation of a shadow stack and the LLVM shadow stack is a strong software implementation for aarch64.

We have extensively discussed CFI in an earlier blog post, a survey, and also as part of our BOPC, CFIXX, Kernel CFI, Control-Flow Bending, Lockdown, and CPI papers. For a quick overview, I recommend the blog post, for a deeper comparison, I recommend the survey.

History of taming control flow

Multiple sources claim ownership of the idea of restricting control-flow based on set checks. The original 2005 CFI paper coined the term control-flow integrity, formalized the property for forward and backward edge, and evaluated a prototype implementation (that was not released openly). The same authors revisited CFI in 2007 and proposed a shadow stack to protect return targets.

In 2003 PaXTeam wrote in their future ideas text file about protecting function pointers and returns. For function pointers (c1), they propose to make the function pointers themselves read-only. This approach would fundamentally solve the problem. Read-only function pointers are precise as an attacker can no longer overwrite the function pointer itself. This approach unfortunately does not scale due to transitivity: an attacker can still modify a pointer to the function pointer. Similar to how non-executable data pages did not prohibit ROP, read-only function pointers do not stop control-flow hijacking. Control-flow hijacking becomes harder as the attacker now must control an additional layer of indirection. For returns (c2), they propose a set check based on the function prototype, similar to the original CFI paper. Unfortunately, this idea was not implemented and the future ideas text file was not well known and therefore not cited in the Abadi CFI paper.

After we were made aware of this text file in about 2016, we started giving credit for the idea of restricting dynamic control flow. Given the very short description, exclusive focus on return for the control-flow check, and the lack of an implementation or evaluation, the future ideas file can serve as inspiration but does not provide a convincing case for CFI and should therefore not serve as the main citation for CFI. The PaX future ideas text file may be credited for the idea of restricting control-flow as a defense together with other related work from the same time such as program shepherding from 2002.

But as it turns out, the idea of taming control-flow is much older than PaX. Hoelzle, Chambers, and Ungar proposed the idea of polymorphic inline caches in 1991 that replaces indirect calls with a type check and a direct call, similar to the CFI set check. A type mismatch could be detected and result in termination of the program. Going back even further, Deutsch and Schiffman explored inline caches for Smalltalk-80 in 1984. (Thanks to Stefan Brunthaler for the references and discussion.)

The idea of taming indirect control flow is ancient with research in runtime systems, programming languages, hardware architectures, compilers, and defenses. This old legacy is rarely cited in the newer papers but may deserve a revisit. It may be worthwhile to start conducting computer science archaeology as academics regularly miss related work.

Evaluating CFI

The defense power of CFI is program dependent. Under the powerful attacker model, the usefulness of CFI depends on the question if the target set contains the necessary gadget(s) that are useful for the attacker. In addition to security, two other properties that can be evaluated are performance and compatibility. LLVM-CFI has negligible (less than 1%) performance overhead on standard benchmarks such as SPEC CPU2006 and we will not repeat the measurements here.

We evaluate LLVM-CFI (version 4.0.1) and RAP (via the 4.9.24-test7 patch), two CFI mechanisms that implement prototype-based set checks for the forward edge. RAP also offers prototype-based set checks on the backward edge but, due to the security concerns mentioned above, we will restrict the evaluation to the forward edge. Our tests focus on user-space code.

Using LLVM-CFI

LLVM-CFI is extremely easy to use. Install your favorite (recent) LLVM through your favorite package manager, e.g., through apt install clang-3.8 for the current Debian default. To activate LLVM-CFI runtime checking, simply compile your software with clang -flto -fsanitize=cfi test.c. There's also plenty of documentation available if you need to know more.

Using RAP

The story for RAP is a little more complicated and compiling the underlying GCC plugin is slightly more complicated (yes, this is sarcastic). The first challenge is to discover the actual source. The history behind RAP is a little obscure as no official write up or code repository exists. The only hints are @paxteam on twitter and a H2H presentation. Without an official release, I resorted to search the old PaX Linux patches as they apparently contained a "public" version of RAP. The most recent patch I found was pax-linux-4.9.24-test7.patch. Download the patch and apply it against the linux-4.9.24.tar.gz tarball. Enter the tarball and run make menuconfig to select a reasonable configuration. For some reason, the patch did not change Kconfig and the options for RAP did not show up. I therefore had to manually edit the .config file and make sure the following entries were selected:

CONFIG_HAVE_GCC_PLUGINS=y
CONFIG_GCC_PLUGINS=y
CONFIG_PAX_RAP=y

After selecting this configuration compile the plugin as a side effect of compiling the kernel through make. After the compilation finishes, you can grab the plugin from linux-4.9.24-pax/scripts/gcc-plugins/rap_plugin/rap_plugin.so and RAP is ready to use. If you know the command line arguments. As no documentation exists for RAP, I once again resorted to the code and discovered the following command line switches in the Makefiles:

CFLAGS=-DRAP_PLUGIN -fplugin-arg-rap_plugin-typecheck=call,ret
CFLAGS+=-fplugin-arg-rap_plugin-hash=abs-finish
CFLAGS+=-fplugin-arg-rap_plugin-hash=abs-ops
CFLAGS+=-fplugin-arg-rap_plugin-hash=abs-attr
CFLAGS+=-fplugin-arg-rap_plugin-report=func,fptr,abs
CFLAGS+=-DX86_RAP_CALL_VECTOR=0x82
CFLAGS+=-DX86_RAP_RET_VECTOR=0x83
CFLAGS+= '-fplugin-arg-rap_plugin-callabort=int $$0x82'
CFLAGS+= '-fplugin-arg-rap_plugin-retabort=int $$0x83'
CFLAGS+= -DRAP_PLUGIN

My educated guess is that the first line activates RAP for calls and returns while lines two, three, four activate higher precision depending on parts of the function prototype. The report switch on line five prints debug information about the hashes (and can be incredibly helpful to debug RAP). The remaining lines select how traps are handled. With this information, we are ready to use RAP for test software.

Getting to this stage required about 10-15 hours of software archaeology and pointers from @lazytyped, @raistolo, and @stevecheckoway who reached out to help after a heated discussion on Twitter.

Orthogonally, I received a lot of love from the Twitter "hacker" community, e.g., I now know that I'm a useless incompetent academic who is too stupid to read code and several other things that I was not aware of before. Thanks folks, I love you too! A core hacker aspect is to share information and to help, not to attack others who are trying to reproduce and understand.

Precision measurements

Precision is an important property for CFI. Strict CFI defenses may result in false positives where an execution of a program is stopped even without an attacker modifying memory. For example, if a function pointer is cast to a different type, dereferencing it may result in a CFI violation. LLVM-CFI enforces strict prototype checking and may therefore cause incompatibilities with existing code due to loose checks.

Both LLVM-CFI and RAP implement CFI checks based on function prototype matching, function prototypes are encoded to some type mask (in the case of RAP) or to some ID (for LLVM-CFI). Both policies detect if any aspect of the function prototype is changed, e.g., the type of a parameter or the return type. Both policies are precise and check for specific pointer types.

There is one key difference between RAP and LLVM-CFI: LLVM-CFI only allows calls to functions that are address-taken, i.e., only functions can be called indirectly that had their address taken through the address-of operator. Functions that are not address taken cannot be targets. The underlying idea behind this check is that only a small subset of all functions are called indirectly and LLVM-CFI reduces the size of their target sets this way (instead of all void(*)(void) functions only the void(*)(void) functions that have their address taken are in the set of valid targets. This vastly reduces the size of the sets. The power to distinguish between address-taken and not-address-taken functions comes at a price: LLVM-CFI requires -flto (link time optimization) to decide if a function is address taken anywhere in the program. RAP here trades precision for simplicity.

Both LLVM-CFI and RAP fail to detect a compromise of the function pointer if it is overwritten with an address-taken function. Note that this imprecision is expected from any type-based CFI mechanism.

To conclude, both LLVM-CFI and RAP implement CFI through type-based prototype matching. LLVM-CFI only matches address-taken functions while RAP matches all functions (resulting in some imprecision).

Performance

We run the performance test by indirectly dispatching a tiny function many times, then leveraging the rdtsc time stamp to count the number of cycles for each dispatch (including the execution of the function). We run the benchmark on a mobile Intel Core i7-7500 at 2.7GHz running the performance governor. We compile code with O3 and average 5 executions after one warmup. The dispatch code is:

__attribute__((noinline)) int quickfun(int a, int b) {
  __asm__ volatile("nop\n");
  return a*b;
}
...
int (*ptr)(int, int) = &quickfun;
__asm__ volatile ( "rdtsc" : "=a" (lo), "=d" (hi));
start = lo | (hi << 32);
for (unsigned long i =0; i < NRSPEED; i++)
  ptr(a, b);
__asm__ volatile ( "rdtsc" : "=a" (lo), "=d" (hi));
end = lo | (hi << 32);

Note that this microbenchmark only provides a very rough estimate of the performance for one kind of dispatch. As a microbenchmark, it tries to show worst case performance overhead for the protected dispatch.

Amazingly, RAP has no measureable performance overhead. Without RAP, gcc compiles the code to execute in 4.172 cycles per dispatch. With RAP, the same code executes in 4.173 cycles per dispatch. This is less than 0.04% overhead.

LLVM-CFI results in some overhead. Without CFI, LLVM compiles the code to execute in 4.179 cylces per dispatch. With LLVM-CFI, the code executes in 5.012 cycles per dispatch. This is about 19.93% overhead for a pure dispatch.

Let's dive into the compiled code to see where the overhead comes from. Let's start with RAP. The hot loop is compiled to:

2181: 0f 31                 rdtsc
2183: 48 c1 e2 20           shl    $0x20,%rdx
2187: bb 00 ca 9a 3b        mov    $0x3b9aca00,%ebx
218c: 48 09 c2              or     %rax,%rdx
218f: 49 89 d6              mov    %rdx,%r14
2192: 66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)
2198: 48 81 7d f8 a8 2f bd  cmpq   $0x17bd2fa8,-0x8(%rbp)
219f: 17
21a0: 75 61                 jne    2203 <speedfun+0x93>
21a2: 44 89 e6              mov    %r12d,%esi
21a5: 44 89 ef              mov    %r13d,%edi
21a8: ff d5                 callq  *%rbp
21aa: 48 83 eb 01           sub    $0x1,%rbx
21ae: 75 e8                 jne    2198 <speedfun+0x28>
21b0: 0f 31                 rdtsc

We see how the type check is executed in each loop through the cmpq instruction, dispatching after a successful comparison. The target in *%rbp directly points to the function quickfun.

LLVM-CFI reshuffles code quite a bit. The hot loop is compiled to:

4013cc: 0f 31                 rdtsc
4013ce: 49 89 d6              mov    %rdx,%r14
4013d1: b9 18 15 40 00        mov    $0x401518,%ecx
4013d6: 49 39 cc              cmp    %rcx,%r12
4013d9: 75 79                 jne    401454 <main+0x244>
4013db: 49 c1 e6 20           shl    $0x20,%r14
4013df: 49 09 c6              or     %rax,%r14
4013e2: bb 00 ca 9a 3b        mov    $0x3b9aca00,%ebx
4013e7: 66 0f 1f 84 00 00 00  nopw   0x0(%rax,%rax,1)
4013ee: 00 00
4013f0: 44 89 ff              mov    %r15d,%edi
4013f3: 89 ee                 mov    %ebp,%esi
4013f5: 41 ff d4              callq  *%r12
4013f8: 48 ff cb              dec    %rbx
4013fb: 75 f3                 jne    4013f0 <main+0x1e0>
4013fd: 0f 31                 rdtsc

Interestingly, the check itself is hoisted outside of the loop. The target 0x401518 contains a single jump to the real implementation of quickfun.

0000000000401518 <quickfun>:
401518: e9 03 fc ff ff        jmpq   401120 <quickfun.cfi>
40151d: cc                    int3
40151e: cc                    int3
40151f: cc                    int3

LLVM-CFI implements the CFI check as a range check that maps into a table of trampolines. All address taken functions of the same type are translated into a range where they are placed next to each other. A CFI dispatch is then matched into a jump into an 8-byte aligned jump into the area of targets for the corresponding type. This indirection results in a 1 cycle penalty for each dispatch due to the double indirection.

Given this simple measurement on a single CPU, the RAP-style instrumentation is about 1 cycle per dispatch faster than the LLVM-CFI based one. The LLVM-CFI instrumentation has the advantage that all valid targets are tightly encoded in a single table, potentially helping a reverse engineer. There is no reason why LLVM-CFI could not implement simple prototype-based checking that is encoded inline next to the function as proposed in the original CFI paper.

Summary

In summary, in the current implementations, RAP is slightly faster due to direct encoding of target sets but LLVM-CFI is more precise by limiting indirect dispatch to only address-taken functions. Both mechanisms implement a prototype-based CFI policy.

Just looking at how easy it is to run LLVM-CFI and how complicated it is to run RAP, I don't think any reasonable person would use RAP in their code due to concerns about compatibility and long term maintainability as it is not clear if or how RAP will be supported in the future. Also, I have no way of knowing how complete the RAP plugin is or if it is even the most recent version as PaXteam turned closed-source and no longer openly shares the plugin (but offered to run my tests on his machine).

I wonder why LLVM-CFI relies on a trampoline table to dispatch targets but does not prepend each function with a type identifier. The double indirection for LLVM-CFI seems excessive and unneeded.

Get the full code for the benchmark and play with it yourself

Edits, corrections, and updates

The blog post initially claimed that there may be a TOCTTOU window against RAP. This was a mistake that happened when reading the assembly.

Added a note about the microbenchmark results.

links

social