Improving Fuzzing Speed with userfaultfd

About the same time I wrote up my previous post about snapshot fuzzing, I was thinking about other ways to restore program state for fuzzing, ideally in userland for ease of use.

There are of course many program side effects that need to be accounted for to restore program state perfectly: threads, files, timers, etc. However those all have to interact with the kernel in a way that can be intercepted with either libc hooks or with syscall (seccomp) hacks. And better yet for the purpose of fuzzing, they can often be disregarded and cleaned up in bulk after some large number of runs - for example, having extra open files shouldn’t break well-written programs.

The biggest challenge is restoring memory state since there’s no easy way to determine what memory has changed between runs from userland. The kernel can do this without too much effort (see the previous post), but this information isn’t easily accessible to userland. You could duplicate and restore all memory regions, however this doubles the running memory overhead of any fuzz target since it has to keep the pristine copy as well as the working copy. That may also end up taking a significant amount of time to reset between runs if there are hundreds of megabytes of shared libraries to be restored for example.

While looking at something entirely unrelated, I had the idea to use userfaultfd to do this memory dirtyness tracking, which could then be restored on a page-granular level after the program finished running.

userfaultfd?

In short, userfaultfd is a newer Linux-specific interface for userland pagefault handlers. Instead of having a single SIGSEGV handler and tweaking memory protections, userfaultfd allows memory to be registered to a userfaultfd object which is then polled by another thread (or even another process) to respond to those faults. It’s a much more flexible and performant way of handling page faults compared to a signal handler, and is perfect for our needs (except for a few small hindrances which can be worked around).

Implementation

To test the viability of this, I first created a minimal proof of concept which:

  1. Duplicates all program memory (besides the relocation stub itself) into anonymous pages since userfaultfd cannot hook file-backed pages.
  2. Registers each writeable (now anonymous) page with a userfaultfd object in write-protect mode. When one of these pages is written to, its address and contents are added to a simple statically allocated array for later restoration and its write protect bit is disabled.
  3. Calls a target function/program.
  4. Restores memory by iterating over the array, copying out the previously saved “pristine” content.
  5. GOTO 3

I’m glossing over some details here (e.g. having to switch stacks when entering the snapshot/restore code so that the program doesn’t fault on its own stack and hang), but that’s the high level overview.

Benefits

This approach has many nice benefits, perhaps one of the most significant is actually something I didn’t realize until later. With this method, since the dirty page list is kept between runs and the write protect bit is cleared after the first write to the page, the overhead of intercepting memory writes goes down as more iterations are run since the commonly written pages are already in the restore list and aren’t hooked in subsequent runs. This means fewer kernel context switches and more time spent actually running the target code.

After a few days of hacking on this I got it working, targeting a simple program which did a malloc and printed out the returned address to show that heap restoration worked. It also benchmarked quite nicely with restoration taking under 2 microseconds, encouraging me on.

This proof-of-concept code is available here.

A Real Benchmark

As seems to be fuzzing tradition, I decided to ensure this worked on “real” programs by wrapping libjpeg-turbo. Specifically I was targeting djpeg converting an image of Tux into decompressed form and printing out the output.

In addition to the userfaultfd proof of concept, I also wrote up samples for a few other common ways of doing fuzzing to do comparisons against:

  1. A simple fork server which just calls fork in a loop and then exec’s the target
  2. The same as above, but using vfork
  3. An “improved” fork server which is inline in the target program, allowing initialization to happen once and forking just before main is called - you may know this as persistent mode in AFL

Code for the userfaultfd and persistent mode versions is avilable in the branches of this repo.

Results

For 10,000 iterations of djpeg /tmp/tux.jpg:

Method Median (ns) Min (ns) Max (ns)
fork 475737 457088 898557
vfork 442299 427317 815917
persistent 321325 311980 1610773
userfaultfd 174630 166653 734212

As these results show, even on a more complex program the userfaultfd technique resulted in a ~1.8x median performance increase over persistent mode, validating the idea!

Limitations

As with everything there are some limitations to the proof of concept:

Most notably, the write protect userfaultfd mode is currently implemented only for x86_64, meaning this approach will not work at all for ARM systems. I don’t believe there’s any technical reason for this however, so it could be added in the future. Additionally, this technique could be implemented by clearing PROT_WRITE on every page and setting a normal SIGSEGV handler, however this is notably slower (and much more annoying to implement) than userfaultfd.

Second, the fuzz framework would need to intercept mmap (or really any syscall which could alter memory mappings) and “do the right thing.” mprotect also needs to be hooked, and any pages being marked PROT_WRITE would need to be added to the userfaultfd before the mprotect returns. None of these are done in the proof of concept, but these wouldn’t be too hard to add.

Conclusion

While there is still more to be done to create a full implementation, this proof of concept shows that the strategy of using userfaultfd to reset program memory is viable, and even works as-is on moderately complex software. Being fully in userland, it should be possible to adopt this technique into source-available fuzzers like AFL(++) with relatively little maintenance work (compared to custom kernel modifications). I don’t have the time to do that much unfortunately, but hopefully someone does!

As always, feel free to reach out with any questions, suggestions, or if you happen to implement this technique in a real fuzzer :)