xnu-kernel-restricted.md - doc/allocators/xnu-kernel-restricted.md - Xnu source code xnu-12377.101.15

# Protecting Kernel-Private Memory

## Intro

We can classify all kernel-allocated memory into two high-level categories:

1. *Kernel-private memory*
2. *Kernel-shareable memory*

*Kernel-private memory* covers all the memory used exclusively by the kernel,
that is never meant to be shared with external domains. Therefore, such memory
should never be mapped into different address spaces — neither to userspace nor
to coprocessors via IOMMUs/DARTs. All zone/`kalloc_type()` managed memory which
contains pointers is de facto kernel private — as sharing kernel pointers with
other domains would be a security violation. It is however worth noting that
some data allocations are never going to be shared, and can be considered
kernel-private memory.

*Kernel-shareable memory* covers allocations made by the kernel that are meant
to be shared with external address spaces by-design. Such memory is not allowed
to contain kernel pointers nor any kernel-private information, and as a result
is always pure data allocations.

A lot of work has been done in our type-segregated allocators that we can
leverage so that the kernel can enforce appropriate mapping policies to make
sure that kernel-private memory actually stays private even in the presence of
bugs. Without such enforcement, attackers could attempt exploiting various
kernel interfaces to gain access to kernel-private memory into their address
space, which would bypass most of the state-of-the-art mitigations in the
kernel.

This document covers the problem space, the security boundaries we defend, and
the technical details of the mitigation.

## Problem space

The security boundaries we consider here are:

1. **user → kernel**: we consider attackers that have successfully compromised
a userspace process, and attempt to compromise the kernel via any form of
kernel vulnerability, including Mach VM logic bugs;
2. **coprocessors → kernel**: we consider attackers that have successfully
compromised a coprocessor, and attempt to compromise the kernel via any RPC
interface exposed by kernel extensions to these coprocessors.

These boundaries are special, because they often comprise APIs to map or share
memory between the kernel and userspace or coprocessors, that could be misused:

* The Mach VM subsystem manages virtual address spaces; therefore, bugs in this
subsystem could be abused to create illegal mappings to kernel-private memory.
* Many coprocessors operate on memory shared with the Application Processor, and
need to access memory owned by userspace tasks as well as memory managed by the
kernel. Because of that, some kernel extensions expose RPC interfaces to their
counterpart coprocessor, that allow for mapping memory via their IOMMU/DART.
This exposes a wide — and usually bespoke — attack surface that can lead to
illegal mappings to kernel-private memory to be created.

If attackers could gain the ability to map kernel-private memory into an
address space they control, they effectively defeat the boundary. This allows
them to access kernel pointers freely, which at least gives them a way to guide
attacks — if the mapping is read-only — but could even give them arbitrary
kernel read-write right away. At this point, most kernel mitigations can more
easily be bypassed, and exploitation becomes significantly easier.

## XNU_KERNEL_RESTRICTED

The Secure Page Table Monitor (SPTM) is highly privileged component that
defines and enforces all the policies that govern page table management,
for both the kernel and user applications, on behalf of XNU. Its goal is
to protect the overall system by securing the page tables against bad
actors, even in the presence of a compromised kernel.

The SPTM has a *type system*, which sits at the heart of the SPTM security
policies and primarily comprises the *frame table*, a data structure that
stores metadata associated with every managed physical page in the system,
alongside an immutable security policies that described what is allowed or
disallowed for that specific physical page at any given time. For each frame
type, there is a very clear set of policies that governs the permitted states
for a given physical page and restricts which transitions are allowed.

To address the above, the SPTM introduced a dedicated frame type for
kernel-private memory: `XNU_KERNEL_RESTRICTED` (X_K_R). This type has three
special policies that the SPTM enforces even in the presence of an XNU
compromise:

1. `XNU_KERNEL_RESTRICTED` pages can only be mapped in the kernel address
space — hence never in any user process.
2. `XNU_KERNEL_RESTRICTED` pages are not allowed be mapped via IOMMU/DART.
3. `XNU_KERNEL_RESTRICTED` pages are only allowed a single mapping beyond the
physical aperture static one.

Because all transitions that would affect mappings have to go through the SPTM,
these policies can be enforced, and will lead to a panic if an
`XNU_KERNEL_RESTRICTED` page is being involved in an illegal transition.


```

 ┌──────────────────────────────────────────────────────┐
 │                                                      │
 │                                                      │
 │                     ┌────────────┐   ┌────────────┐  │
 │                     │            │   │            │  │
 │   userspace         │   Task A   │   │   Task B   │  │
 │                     │            │   │            │  │
 │                     └──────▲─────┘   └─────▲──────┘  │
 │                                            │         │
 │                            │               │         │
 │                                            │         │
 │                            │               │         │
 │                                            │         │
 ├────────────────────────────┼───────────────┼─────────┤
 │                                            │         │
 │                   ┌────────┴────────┐      │         │        ┌─────────────┐
 │                   │   X_K_R page    │      │         │        │             │
 │                   │   refcnt == 1   │─ ─ ─ ┼ ─ ─ ─ ─ ┼ ─ ─ ─ ▶│ Coprocessor │
 │                   │                 │      │         │  ┌────▶│      C      │
 │  kernelspace      └─────────────────┘      │         │  │     │             │
 │                                            │         │  │     └─────────────┘
 │                                  ┌─────────┴───────┐ │  │
 │                                  │ non-X_K_R page  │ │  │
 │                                  │   refcnt == 3   │─┼──┘
 │                                  │                 │ │
 │                                  └─────────────────┘ │
 └──────────────────────────────────────────────────────┘


  ─────────▶    Legal mapping


  ─ ─ ─ ─ ─▶   Illegal mapping

```

## Security value

### Deterministic runtime mitigation

This mitigation stops **any** exploitation technique that involves mapping
kernel-private memory outside of the kernel address space, and forces attackers
to go down the path of full classic kernel exploitation. This means facing all
the kernel mitigations, including MTE. On top of mitigating all the attacks that
rely on using sharing/mapping interfaces, there is an immediate impact on
another class of MachVM security bugs: *Physical Use-after-free*.

The Mach VM manages the lifecycle of physical pages on the system. VM maps are
the source of truth of the system, and the pmap and page-tables are a live
cache of that state. The Mach VM has had bugs where the page tables would have
dangling page table entries (PTEs) — where these PTEs represented mappings that
should not exist anymore, and that the VM lost track of.

We call this class of inconsistency bugs *Physical Use-after-free (PUAF)*. When
the VM thinks a page became unused, it adds it to a freelist of physical pages
in order to repurpose it to hold new content, for possibly a completely
different security domain. In the case of a PUAF, the VM leaves a dangling
mapping that an attacker can take abuse to still access the content of a page
after it has been repurposed.

`XNU_KERNEL_RESTRICTED` forms a guarantee around this bug-class. SPTM requires
that a page has no active mappings when it is retyped, and while the VM has
lost track of the dangling mapping, SPTM will not. As a result, it becomes
impossible for an attacker to maintain access via a dangling PTE to a page
that was or would become `XNU_KERNEL_RESTRICTED`: the SPTM would detect the
illegal retyping operation and would panic the system immediately. Gaining
access to `XNU_KERNEL_RESTRICTED` memory via PUAF is hence deterministically
stopped.

However, attackers can try to exploit PUAFs on the same frame type, which
would not go through an SPTM retyping operation. For example, a page that was
used for user data in a task A that gets reused to hold completely different
data into a task B is such a scenario, and leads to an attacker breaking the
process address space isolation the VM is meant to provide. To address that,
we use a runtime check each time the Mach VM moves a physical page into a
“freed” state. We simply utilize SPTM’s precise tracking of mappings and
use it to assert that the page indeed has no active mapping. As a result,
any direct attempt to recycle a physical page with active mappings
deterministically panic the system.

### Protecting MTE

We apply MTE to the kernel to any dynamic memory that contains kernel pointers,
in order to mitigate use-after-free and out-of-bounds bugs. This can also be
extended to all kernel-private memory, not just the part that contains kernel
pointers — but isn’t at this time. The more memory we tag, the larger the
attack surface we protect.

However, if there is any way to access tagged memory without going through tag
checks, MTE is bypassed. Which is why we have to disallow any attempt of
mapping MTE tagged pages outside of the kernel address space, which completely
coincide with the purpose of `XNU_KERNEL_RESTRICTED`. In the end, this is no
different than the motivation described above; it is just that MTE makes it
even more appealing for attackers to reach for said primitives, which makes
`XNU_KERNEL_RESTRICTED` even more important.

## Typing memory

To know what is kernel-private with pointers, kernel-private without pointers
and kernel-shareable, we use the type-based segregation provided by in
[kalloc_type](https://security.apple.com/blog/towards-the-next-generation-of-xnu-memory-safety/).
The zone allocator (*zalloc*) and *kmem* already propagate metadata describing
the allocation, via the `KMEM_DATA`, or `KMEM_DATA_SHARED` flags:

* All *kmem* based allocations, besides `KMEM_COMPRESSOR` and
`KMEM_DATA_SHARED` pages, are typed as `XNU_KERNEL_RESTRICTED`.
* All *zalloc* allocations, besides the shareable data heap and ROAllocator,
are typed `XNU_KERNEL_RESTRICTED`.