guard-objects.md diff - doc/allocators/guard-objects.md - Xnu source code xnu-12377.101.15

doc/allocators/guard-objects.md xnu-12377.101.15 ⇄ /dev/null
--- xnu/xnu-12377.101.15/doc/allocators/guard-objects.md
+++ /dev/null
@@ -1,286 +0,0 @@
-# Guard-object allocation policy
-
-The contemporary techniques we use to group and protect smaller allocations,
-such as zalloc and kalloc_type, are premised around type-isolation and VA
-sequestering. These techniques are impractical for larger allocations because
-they consume outsized chunks of virtual address space, and could lead to
-exhaustion. We therefore use a different strategy for larger allocations, which
-we call guard objects.
-
-
-## Algorithm
-
-Allocation policies are assigned to particular regions of kernel address space.
-The strategy specified by this document is used by regions such as the various
-pointer ranges of `kmem_alloc()`, which previously used a first-fit allocator.
-Under the guard-objects policy, the virtual address space is divided in
-*chunks*. A chunk has a size class of the form $2^k \times \mathtt{PAGE\_SIZE}$,
-and is made of $\mathcal{S}$ *slots* of that size. $\mathcal{S}$ varies with the
-size class: smaller size-classes will have more slots, and larger size classes
-will have fewer. Chunks are also configured to have $\mathcal{G}$ _guards_ and
-up to $\mathcal{Q}$ _quarantines_.
-
-Chunks maintain several other important pieces of information, such as:
-
-* which slots are allocated and which slots are free;
-* a dynamic count of quarantined slots within $[0, \mathcal{Q})$;
-* a count of *available* slots, which are the number of free slots in excess
-  of guards $\mathcal{G}$ and the number of currently quarantined slots.
-
-A chunk can be in three states: *empty* if all its slots are free, *partial* if
-it has at least one available slot, *full* if it has no available slots. Chunks
-are maintained in lists segregated by size class and state.
-
-### Allocating memory
-
-Memory requests for a given size are rounded up to the next size class of the
-form $2^k$. The allocator must first select an appropriate chunk. Partial chunks
-are preferred, and if no partial chunk exists, an empty chunk is allocated from
-unused virtual address space.
-
-A random slot is then chosen from any of the free slots in that chunk, and the
-available count of the chunk is decremented. If the chunk has now exhausted its
-available slots — only quarantined slots and guards are left — it's placed on
-its corresponding size class's full list.
-
-Visually, let’s consider an example with two partial chunks A and B, with 8
-slots each, for $\mathcal{G} = \mathcal{Q} = 2$:
-
-```
-                A───────────────┬─┬─┬─┬─┐     B───────────────┬─┬─┬─┬─┐
-                │ Available:  1 │X│ │ │X│     │ Available:  4 │ │ │X│ │
-             ┌─▶│ Quarantine: 0 ├─┼─┼─┼─┤────▶│ Quarantine: 0 ├─┼─┼─┼─┤
-┌─────────┐  │  │ Guards:     2 │X│X│X│ │     │ Guards:     2 │ │X│ │ │
-│ partial │──┘  └───────────────┴─┴─┴─┴─┘     └───────────────┴─┴─┴─┴─┘
-├─────────┤
-│  full   │
-└─────────┘
-
-Legend:
-┌─┐                  ┌─┐
-│ │ free slot        │X│ allocated slot
-└─┘                  └─┘
-```
-
-The first allocation will be performed from chunk A, using its last available
-slot and moving it to the full list:
-
-```
-                B───────────────┬─┬─┬─┬─┐
-                │ Available:  4 │ │ │X│ │
-             ┌─▶│ Quarantine: 0 ├─┼─┼─┼─┤
-┌─────────┐  │  │ Guards:     2 │ │X│ │ │
-│ partial │──┘  └───────────────┴─┴─┴─┴─┘
-├─────────┤
-│  full   │──┐  A───────────────┬─┬─┬─┬─┐
-└─────────┘  │  │ Available:  0 │X│ │X│X│
-             └─▶│ Quarantine: 0 ├─┼─┼─┼─┤
-                │ Guards:     2 │X│X│X│ │
-                └───────────────┴─┴─┴─┴─┘
-```
-
-When the next allocation request in this size class arrives, the allocator will
-select B because A is now full:
-
-```
-                B───────────────┬─┬─┬─┬─┐
-                │ Available:  3 │ │ │X│ │
-             ┌─▶│ Quarantine: 0 ├─┼─┼─┼─┤
-┌─────────┐  │  │ Guards:     2 │ │X│X│ │
-│ partial │──┘  └───────────────┴─┴─┴─┴─┘
-├─────────┤
-│  full   │──┐  A───────────────┬─┬─┬─┬─┐
-└─────────┘  │  │ Available:  0 │X│ │X│X│
-             └─▶│ Quarantine: 0 ├─┼─┼─┼─┤
-                │ Guards:     2 │X│X│X│ │
-                └───────────────┴─┴─┴─┴─┘
-```
-
-### Deallocating memory
-
-Deallocating a virtual memory range works in reverse. First, we evaluate which
-chunk and slot correspond to the range being freed. Since the semantics of the
-virtual memory subsystem mandate that we must support partial deallocations, we
-next consider whether the slot has become only partially free. If so, we have
-nothing more to do for now; the slot remains in use.
-
-If however the slot is now entirely free, then the quarantine count of the chunk
-is incremented. If at least $\mathcal{G} + \mathcal{Q}$ are free, then the
-quarantine is cleared. The idea behind this policy is that maintaining a good
-entropy requires enough free slots to choose from. As a result, once the free
-slot count dips below $\mathcal{G} + \mathcal{Q}$, freed slots are quarantined
-rather than made immediately available.
-
-Finally, we evaluate whether the chunk needs to be moved:
-
-* if a chunk's slots are all fully free, the chunk is marked as empty, and is
-  typically returned to the system as free space;
-* if the chunk previously had no slots available, but has any available now,
-  the chunk is moved to the partially-free chunks list.
-
-Visually, let’s consider an example with a chunk using 16 slots,
-under a configuration in which $\mathcal{G} = \mathcal{Q} = 4$:
-
-```
-                 A───────────────┬─┬─┬─┬─┬─┬─┬─┬─┐
-┌─────────┐      │ Available:  1 │ │X│X│X│ │ │X│ │
-│ partial │─────▶│ Quarantine: 1 ├─┼─┼─┼─┼─┼─┼─┼─┤
-├─────────┤      │ Guards:     4 │ │X│X│X│X│X│ │X│
-│  full   │      └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┘
-└─────────┘
-
-Legend:
-┌─┐                  ┌─┐
-│ │ free slot        │X│ allocated slot
-└─┘                  └─┘
-```
-
-If we now free an element, its slot is marked free, and the quarantine count
-is increased:
-
-```
-                 A───────────────┬─┬─┬─┬─┬─┬─┬─┬─┐
-┌─────────┐      │ Available:  1 │ │X│X│X│ │ │X│ │
-│ partial │─────▶│ Quarantine: 2 ├─┼─┼─┼─┼─┼─┼─┼─┤
-├─────────┤      │ Guards:     4 │ │X│X│ │X│X│ │X│
-│  full   │      └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┘
-└─────────┘
-```
-
-If we allocate the last available element, a slot is now marked used,
-the available count drops to 0, and causes the chunk to now be full,
-and the quarantine stays untouched:
-
-```
-┌─────────┐
-│ partial │      A───────────────┬─┬─┬─┬─┬─┬─┬─┬─┐
-├─────────┤      │ Available:  0 │X│X│X│X│ │ │X│ │
-│  full   │─────▶│ Quarantine: 2 ├─┼─┼─┼─┼─┼─┼─┼─┤
-└─────────┘      │ Guards:     4 │ │X│X│ │X│X│ │X│
-                 └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┘
-```
-
-Freeing just one element would return just one slot, and bump the quarantine
-count to 3. It takes freeing two elements for more than $\mathcal{G} +
-\mathcal{Q}$ slots to be free, leading to clearing the quarantine:
-
-```
-                 A───────────────┬─┬─┬─┬─┬─┬─┬─┬─┐
-┌─────────┐      │ Available:  4 │X│X│X│X│ │ │X│ │
-│ partial │─────▶│ Quarantine: 0 ├─┼─┼─┼─┼─┼─┼─┼─┤
-├─────────┤      │ Guards:     4 │ │X│X│ │ │ │ │X│
-│  full   │      └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┘
-└─────────┘
-```
-
-### Operation clamping to a slot
-
-As long as a VM operation does not exceed slot bounds, any VM call is permitted,
-whether it is configuration altering calls such as `vm_protect()` or
-`vm_inherit()`, taking copies of the range with `mach_make_memory_entry()`, or
-replacing part of the mapping with `vm_allocate(..., VM_FLAGS_FIXED |
-VM_FLAGS_OVERWRITE)`.
-
-However, operations that cross a slot boundary are not permitted, and lead to
-termination. When guard object policies are in effect, allocations are
-randomized, and even in a single threaded context, clients can't assume that
-consecutive allocations will be served in address order, and as a result,
-operations crossing slot boundaries are always bugs.
-
-
-## Security motivation
-
-With the algorithm explanation, the “guard object” moniker should start to make
-sense: the goal is that unused free slots are object-sized guards — in the
-guard-page sense. Unlike usage of traditional guard pages, which only protect
-against linear buffer overflows, this scheme also adds a probabilistic
-mitigation against use-after-free or non-linear buffer overflows, forcing
-attackers to contend with a high probability of hitting these guards.
-
-Due to visible timing differences, it is assumed that an attacker can observe
-when a chunk is being allocated or freed. However, this doesn't give them an
-edge because the system maintains a guarantee that at least
-$\mathcal{G}/\mathcal{S}$ of the address space causes a crash when accessed at
-any given time. This is why we can let go of any form of sequestering of the
-address space for ranges managed with the guard-object allocation policy.
-
-### Use-after-Free exploitation strategies and failure rates
-
-Attackers attempting to exploit a use-after-free will be able to cause an
-element to be freed, knowing that a dangling pointer to it remains. They will
-then try to replace this freed element with another one of a different type
-that they control, creating a type confusion.
-
-Because allocating new chunks causes visible timing differences, we can assume
-that attackers are able to fill chunks with all slots corresponding to elements
-that they control. They also know which elements are in the same chunk, but
-don't know which element corresponds to which slot.
-
-**In the absence of the quarantine**, the best strategy for attackers trying to
-exploit a use-after free is to perform $\mathcal{S} − \mathcal{G}$ rounds
-of freeing then reallocating each element in the chunk, where the first free is
-to the element they are trying to use-after-free, so that they retain a dangling
-pointer to the original slot. Each round, the allocator will choose one slot
-among $\mathcal{G} + 1$, when only one corresponds to the slot that was freed by
-triggering the bug. The failure rate the attackers face with this strategy is
-
-$$failure\_rate = \left(
-\frac{\mathcal{G}}{\mathcal{G+1}}\right)^{\mathcal{S} - \mathcal{G}}$$
-
-* $\mathcal{S} = 8, \mathcal{G} = 2$ yields a 8.8% failure rate;
-* $\mathcal{S} = 16, \mathcal{G} = 4$ yields a 6.9 failure rate.
-
-**Using the quarantine** further reduces an attacker's odds of success. Unlike
-before, they need to free at least $\mathcal{Q}$ elements before elements are
-made available for allocations again. As a result, a round now needs to be
-$\mathcal{Q}$ deallocations followed by $\mathcal{Q}$ allocations. As before,
-the very first deallocation is to the element the attacker tries to
-use-after-free. A round gives attackers $\frac{\mathcal{G}}{
-\mathcal{G}+\mathcal{Q}}$ chances of failure. The aggregate failure rate for
-this strategy is
-
-$$failure\_rate =\left( 1 - \frac{(\mathcal{S} - \mathcal{G)} \bmod \mathcal{Q}
-}{\mathcal{G} + \mathcal{Q}} \right)
-\left(\frac{\mathcal{G}}{\mathcal{G} + \mathcal{Q}}\right)^{
-\left\lfloor \frac{\mathcal{S} -\mathcal{G}}{\mathcal{Q}} \right\rfloor}$$
-
-* $\mathcal{S}=8, \mathcal{G}=1, \mathcal{Q}=4$ yields a 8% failure rate;
-* $\mathcal{S}=8, \mathcal{G}=2, \mathcal{Q}=2$ yields a 12.5% failure rate;
-* $\mathcal{S}=16, \mathcal{G}=4, \mathcal{Q}=3$ yields a 10.7% failure rate;
-* $\mathcal{S}=16, \mathcal{G}=4, \mathcal{Q}=4$ yields a 12.5% failure rate.
-
-### Out-of-bound exploitation strategies
-
-Exploiting out-of-bound bugs requires knowing the distance between allocated
-objects, which an attacker a priori doesn’t know without some information leak.
-The regions protected by guard objects are coarsely randomized in the address
-space, so that attackers can’t predict how far an allocation is from other juicy
-exploitation targets in the address space such as the `__DATA` segment.  Lastly,
-guard objects are combined in XNU with type isolation and allocation fronts. It
-makes cross-type attacks unreliable and unpredictable — as the various buckets
-of types are randomized per boot.
-
-The last remaining avenue of attack targets types that fall in the same
-allocation front. However, attackers still have to contend with the uniform
-$\mathcal{G}/\mathcal{S}$ density of holes, making out-of-bound unreliable with
-a probability of failure close to $\mathcal{G}/\mathcal{S}$. This probability of
-failure is typically worse than use-after-free for attackers, and intuitively,
-this is precisely because they need to know more information — the distance
-between objects — unlike use-after-free where that distance is obviously known
-to be zero.
-
-### Choice of parameters
-
-In the actual implementation, $\mathcal{S}$ scales up with the size of the
-allocations going up — as a way to limit the amount of metadata needed.
-Maintaining the $\mathcal{G}/\mathcal{S}$ and $\mathcal{Q}/\mathcal{S}$ ratios
-constant for any $\mathcal{S}$ allows for all probabilities to become
-independent of $\mathcal{S}$.
-
-Our goal is to make attackers face at least 10% chance of failure — in the
-absence of information disclosure — which is why we chose numbers where
-$\mathcal{G} = \mathcal{Q} = \mathcal{S}/4$, yielding:
-
-* 25% failure rates for out-of-bound exploitation;
-* 12.5% failure rates for use-after-free exploitation.
-