| .. SPDX-License-Identifier: GPL-2.0 |
| .. Copyright (C) 2020, Google LLC. |
| |
| Kernel Electric-Fence (KFENCE) |
| ============================== |
| |
| Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety |
| error detector. KFENCE detects heap out-of-bounds access, use-after-free, and |
| invalid-free errors. |
| |
| KFENCE is designed to be enabled in production kernels, and has near zero |
| performance overhead. Compared to KASAN, KFENCE trades performance for |
| precision. The main motivation behind KFENCE's design, is that with enough |
| total uptime KFENCE will detect bugs in code paths not typically exercised by |
| non-production test workloads. One way to quickly achieve a large enough total |
| uptime is when the tool is deployed across a large fleet of machines. |
| |
| Usage |
| ----- |
| |
| To enable KFENCE, configure the kernel with:: |
| |
| CONFIG_KFENCE=y |
| |
| To build a kernel with KFENCE support, but disabled by default (to enable, set |
| ``kfence.sample_interval`` to non-zero value), configure the kernel with:: |
| |
| CONFIG_KFENCE=y |
| CONFIG_KFENCE_SAMPLE_INTERVAL=0 |
| |
| KFENCE provides several other configuration options to customize behaviour (see |
| the respective help text in ``lib/Kconfig.kfence`` for more info). |
| |
| Tuning performance |
| ~~~~~~~~~~~~~~~~~~ |
| |
| The most important parameter is KFENCE's sample interval, which can be set via |
| the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The |
| sample interval determines the frequency with which heap allocations will be |
| guarded by KFENCE. The default is configurable via the Kconfig option |
| ``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0`` |
| disables KFENCE. |
| |
| The sample interval controls a timer that sets up KFENCE allocations. By |
| default, to keep the real sample interval predictable, the normal timer also |
| causes CPU wake-ups when the system is completely idle. This may be undesirable |
| on power-constrained systems. The boot parameter ``kfence.deferrable=1`` |
| instead switches to a "deferrable" timer which does not force CPU wake-ups on |
| idle systems, at the risk of unpredictable sample intervals. The default is |
| configurable via the Kconfig option ``CONFIG_KFENCE_DEFERRABLE``. |
| |
| .. warning:: |
| The KUnit test suite is very likely to fail when using a deferrable timer |
| since it currently causes very unpredictable sample intervals. |
| |
| By default KFENCE will only sample 1 heap allocation within each sample |
| interval. *Burst mode* allows to sample successive heap allocations, where the |
| kernel boot parameter ``kfence.burst`` can be set to a non-zero value which |
| denotes the *additional* successive allocations within a sample interval; |
| setting ``kfence.burst=N`` means that ``1 + N`` successive allocations are |
| attempted through KFENCE for each sample interval. |
| |
| The KFENCE memory pool is of fixed size, and if the pool is exhausted, no |
| further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default |
| 255), the number of available guarded objects can be controlled. Each object |
| requires 2 pages, one for the object itself and the other one used as a guard |
| page; object pages are interleaved with guard pages, and every object page is |
| therefore surrounded by two guard pages. |
| |
| The total memory dedicated to the KFENCE memory pool can be computed as:: |
| |
| ( #objects + 1 ) * 2 * PAGE_SIZE |
| |
| Using the default config, and assuming a page size of 4 KiB, results in |
| dedicating 2 MiB to the KFENCE memory pool. |
| |
| Note: On architectures that support huge pages, KFENCE will ensure that the |
| pool is using pages of size ``PAGE_SIZE``. This will result in additional page |
| tables being allocated. |
| |
| Error reports |
| ~~~~~~~~~~~~~ |
| |
| A typical out-of-bounds access looks like this:: |
| |
| ================================================================== |
| BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234 |
| |
| Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72): |
| test_out_of_bounds_read+0xa6/0x234 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| kfence-#72: 0xffff8c3f2e292000-0xffff8c3f2e29201f, size=32, cache=kmalloc-32 |
| |
| allocated by task 484 on cpu 0 at 32.919330s: |
| test_alloc+0xfe/0x738 |
| test_out_of_bounds_read+0x9b/0x234 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7 |
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 |
| ================================================================== |
| |
| The header of the report provides a short summary of the function involved in |
| the access. It is followed by more detailed information about the access and |
| its origin. Note that, real kernel addresses are only shown when using the |
| kernel command line option ``no_hash_pointers``. |
| |
| Use-after-free accesses are reported as:: |
| |
| ================================================================== |
| BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143 |
| |
| Use-after-free read at 0xffff8c3f2e2a0000 (in kfence-#79): |
| test_use_after_free_read+0xb3/0x143 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| kfence-#79: 0xffff8c3f2e2a0000-0xffff8c3f2e2a001f, size=32, cache=kmalloc-32 |
| |
| allocated by task 488 on cpu 2 at 33.871326s: |
| test_alloc+0xfe/0x738 |
| test_use_after_free_read+0x76/0x143 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| freed by task 488 on cpu 2 at 33.871358s: |
| test_use_after_free_read+0xa8/0x143 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| CPU: 2 PID: 488 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 |
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 |
| ================================================================== |
| |
| KFENCE also reports on invalid frees, such as double-frees:: |
| |
| ================================================================== |
| BUG: KFENCE: invalid free in test_double_free+0xdc/0x171 |
| |
| Invalid free of 0xffff8c3f2e2a4000 (in kfence-#81): |
| test_double_free+0xdc/0x171 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| kfence-#81: 0xffff8c3f2e2a4000-0xffff8c3f2e2a401f, size=32, cache=kmalloc-32 |
| |
| allocated by task 490 on cpu 1 at 34.175321s: |
| test_alloc+0xfe/0x738 |
| test_double_free+0x76/0x171 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| freed by task 490 on cpu 1 at 34.175348s: |
| test_double_free+0xa8/0x171 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| CPU: 1 PID: 490 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 |
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 |
| ================================================================== |
| |
| KFENCE also uses pattern-based redzones on the other side of an object's guard |
| page, to detect out-of-bounds writes on the unprotected side of the object. |
| These are reported on frees:: |
| |
| ================================================================== |
| BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184 |
| |
| Corrupted memory at 0xffff8c3f2e33aff9 [ 0xac . . . . . . ] (in kfence-#156): |
| test_kmalloc_aligned_oob_write+0xef/0x184 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| kfence-#156: 0xffff8c3f2e33afb0-0xffff8c3f2e33aff8, size=73, cache=kmalloc-96 |
| |
| allocated by task 502 on cpu 7 at 42.159302s: |
| test_alloc+0xfe/0x738 |
| test_kmalloc_aligned_oob_write+0x57/0x184 |
| kunit_try_run_case+0x61/0xa0 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x176/0x1b0 |
| ret_from_fork+0x22/0x30 |
| |
| CPU: 7 PID: 502 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 |
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 |
| ================================================================== |
| |
| For such errors, the address where the corruption occurred as well as the |
| invalidly written bytes (offset from the address) are shown; in this |
| representation, '.' denote untouched bytes. In the example above ``0xac`` is |
| the value written to the invalid address at offset 0, and the remaining '.' |
| denote that no following bytes have been touched. Note that, real values are |
| only shown if the kernel was booted with ``no_hash_pointers``; to avoid |
| information disclosure otherwise, '!' is used instead to denote invalidly |
| written bytes. |
| |
| And finally, KFENCE may also report on invalid accesses to any protected page |
| where it was not possible to determine an associated object, e.g. if adjacent |
| object pages had not yet been allocated:: |
| |
| ================================================================== |
| BUG: KFENCE: invalid read in test_invalid_access+0x26/0xe0 |
| |
| Invalid read at 0xffffffffb670b00a: |
| test_invalid_access+0x26/0xe0 |
| kunit_try_run_case+0x51/0x85 |
| kunit_generic_run_threadfn_adapter+0x16/0x30 |
| kthread+0x137/0x160 |
| ret_from_fork+0x22/0x30 |
| |
| CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 |
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 |
| ================================================================== |
| |
| DebugFS interface |
| ~~~~~~~~~~~~~~~~~ |
| |
| Some debugging information is exposed via debugfs: |
| |
| * The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics. |
| |
| * The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects |
| allocated via KFENCE, including those already freed but protected. |
| |
| Implementation Details |
| ---------------------- |
| |
| Guarded allocations are set up based on the sample interval. After expiration |
| of the sample interval, the next allocation through the main allocator (SLAB or |
| SLUB) returns a guarded allocation from the KFENCE object pool (allocation |
| sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and |
| the next allocation is set up after the expiration of the interval. |
| |
| When using ``CONFIG_KFENCE_STATIC_KEYS=y``, KFENCE allocations are "gated" |
| through the main allocator's fast-path by relying on static branches via the |
| static keys infrastructure. The static branch is toggled to redirect the |
| allocation to KFENCE. Depending on sample interval, target workloads, and |
| system architecture, this may perform better than the simple dynamic branch. |
| Careful benchmarking is recommended. |
| |
| KFENCE objects each reside on a dedicated page, at either the left or right |
| page boundaries selected at random. The pages to the left and right of the |
| object page are "guard pages", whose attributes are changed to a protected |
| state, and cause page faults on any attempted access. Such page faults are then |
| intercepted by KFENCE, which handles the fault gracefully by reporting an |
| out-of-bounds access, and marking the page as accessible so that the faulting |
| code can (wrongly) continue executing (set ``panic_on_warn`` to panic instead). |
| |
| To detect out-of-bounds writes to memory within the object's page itself, |
| KFENCE also uses pattern-based redzones. For each object page, a redzone is set |
| up for all non-object memory. For typical alignments, the redzone is only |
| required on the unguarded side of an object. Because KFENCE must honor the |
| cache's requested alignment, special alignments may result in unprotected gaps |
| on either side of an object, all of which are redzoned. |
| |
| The following figure illustrates the page layout:: |
| |
| ---+-----------+-----------+-----------+-----------+-----------+--- |
| | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | |
| | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | |
| | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | |
| | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | |
| | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | |
| | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | |
| ---+-----------+-----------+-----------+-----------+-----------+--- |
| |
| Upon deallocation of a KFENCE object, the object's page is again protected and |
| the object is marked as freed. Any further access to the object causes a fault |
| and KFENCE reports a use-after-free access. Freed objects are inserted at the |
| tail of KFENCE's freelist, so that the least recently freed objects are reused |
| first, and the chances of detecting use-after-frees of recently freed objects |
| is increased. |
| |
| If pool utilization reaches 75% (default) or above, to reduce the risk of the |
| pool eventually being fully occupied by allocated objects yet ensure diverse |
| coverage of allocations, KFENCE limits currently covered allocations of the |
| same source from further filling up the pool. The "source" of an allocation is |
| based on its partial allocation stack trace. A side-effect is that this also |
| limits frequent long-lived allocations (e.g. pagecache) of the same source |
| filling up the pool permanently, which is the most common risk for the pool |
| becoming full and the sampled allocation rate dropping to zero. The threshold |
| at which to start limiting currently covered allocations can be configured via |
| the boot parameter ``kfence.skip_covered_thresh`` (pool usage%). |
| |
| Interface |
| --------- |
| |
| The following describes the functions which are used by allocators as well as |
| page handling code to set up and deal with KFENCE allocations. |
| |
| .. kernel-doc:: include/linux/kfence.h |
| :functions: is_kfence_address |
| kfence_shutdown_cache |
| kfence_alloc kfence_free __kfence_free |
| kfence_ksize kfence_object_start |
| kfence_handle_page_fault |
| |
| Related Tools |
| ------------- |
| |
| In userspace, a similar approach is taken by `GWP-ASan |
| <http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and |
| a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is |
| directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another |
| similar but non-sampling approach, that also inspired the name "KFENCE", can be |
| found in the userspace `Electric Fence Malloc Debugger |
| <https://linux.die.net/man/3/efence>`_. |
| |
| In the kernel, several tools exist to debug memory access errors, and in |
| particular KASAN can detect all bug classes that KFENCE can detect. While KASAN |
| is more precise, relying on compiler instrumentation, this comes at a |
| performance cost. |
| |
| It is worth highlighting that KASAN and KFENCE are complementary, with |
| different target environments. For instance, KASAN is the better debugging-aid, |
| where test cases or reproducers exists: due to the lower chance to detect the |
| error, it would require more effort using KFENCE to debug. Deployments at scale |
| that cannot afford to enable KASAN, however, would benefit from using KFENCE to |
| discover bugs due to code paths not exercised by test cases or fuzzers. |