Loading...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 | # Guard-object allocation policy The contemporary techniques we use to group and protect smaller allocations, such as zalloc and kalloc_type, are premised around type-isolation and VA sequestering. These techniques are impractical for larger allocations because they consume outsized chunks of virtual address space, and could lead to exhaustion. We therefore use a different strategy for larger allocations, which we call guard objects. ## Algorithm Allocation policies are assigned to particular regions of kernel address space. The strategy specified by this document is used by regions such as the various pointer ranges of `kmem_alloc()`, which previously used a first-fit allocator. Under the guard-objects policy, the virtual address space is divided in *chunks*. A chunk has a size class of the form $2^k \times \mathtt{PAGE\_SIZE}$, and is made of $\mathcal{S}$ *slots* of that size. $\mathcal{S}$ varies with the size class: smaller size-classes will have more slots, and larger size classes will have fewer. Chunks are also configured to have $\mathcal{G}$ _guards_ and up to $\mathcal{Q}$ _quarantines_. Chunks maintain several other important pieces of information, such as: * which slots are allocated and which slots are free; * a dynamic count of quarantined slots within $[0, \mathcal{Q})$; * a count of *available* slots, which are the number of free slots in excess of guards $\mathcal{G}$ and the number of currently quarantined slots. A chunk can be in three states: *empty* if all its slots are free, *partial* if it has at least one available slot, *full* if it has no available slots. Chunks are maintained in lists segregated by size class and state. ### Allocating memory Memory requests for a given size are rounded up to the next size class of the form $2^k$. The allocator must first select an appropriate chunk. Partial chunks are preferred, and if no partial chunk exists, an empty chunk is allocated from unused virtual address space. A random slot is then chosen from any of the free slots in that chunk, and the available count of the chunk is decremented. If the chunk has now exhausted its available slots — only quarantined slots and guards are left — it's placed on its corresponding size class's full list. Visually, let’s consider an example with two partial chunks A and B, with 8 slots each, for $\mathcal{G} = \mathcal{Q} = 2$: ``` A───────────────┬─┬─┬─┬─┐ B───────────────┬─┬─┬─┬─┐ │ Available: 1 │X│ │ │X│ │ Available: 4 │ │ │X│ │ ┌─▶│ Quarantine: 0 ├─┼─┼─┼─┤────▶│ Quarantine: 0 ├─┼─┼─┼─┤ ┌─────────┐ │ │ Guards: 2 │X│X│X│ │ │ Guards: 2 │ │X│ │ │ │ partial │──┘ └───────────────┴─┴─┴─┴─┘ └───────────────┴─┴─┴─┴─┘ ├─────────┤ │ full │ └─────────┘ Legend: ┌─┐ ┌─┐ │ │ free slot │X│ allocated slot └─┘ └─┘ ``` The first allocation will be performed from chunk A, using its last available slot and moving it to the full list: ``` B───────────────┬─┬─┬─┬─┐ │ Available: 4 │ │ │X│ │ ┌─▶│ Quarantine: 0 ├─┼─┼─┼─┤ ┌─────────┐ │ │ Guards: 2 │ │X│ │ │ │ partial │──┘ └───────────────┴─┴─┴─┴─┘ ├─────────┤ │ full │──┐ A───────────────┬─┬─┬─┬─┐ └─────────┘ │ │ Available: 0 │X│ │X│X│ └─▶│ Quarantine: 0 ├─┼─┼─┼─┤ │ Guards: 2 │X│X│X│ │ └───────────────┴─┴─┴─┴─┘ ``` When the next allocation request in this size class arrives, the allocator will select B because A is now full: ``` B───────────────┬─┬─┬─┬─┐ │ Available: 3 │ │ │X│ │ ┌─▶│ Quarantine: 0 ├─┼─┼─┼─┤ ┌─────────┐ │ │ Guards: 2 │ │X│X│ │ │ partial │──┘ └───────────────┴─┴─┴─┴─┘ ├─────────┤ │ full │──┐ A───────────────┬─┬─┬─┬─┐ └─────────┘ │ │ Available: 0 │X│ │X│X│ └─▶│ Quarantine: 0 ├─┼─┼─┼─┤ │ Guards: 2 │X│X│X│ │ └───────────────┴─┴─┴─┴─┘ ``` ### Deallocating memory Deallocating a virtual memory range works in reverse. First, we evaluate which chunk and slot correspond to the range being freed. Since the semantics of the virtual memory subsystem mandate that we must support partial deallocations, we next consider whether the slot has become only partially free. If so, we have nothing more to do for now; the slot remains in use. If however the slot is now entirely free, then the quarantine count of the chunk is incremented. If at least $\mathcal{G} + \mathcal{Q}$ are free, then the quarantine is cleared. The idea behind this policy is that maintaining a good entropy requires enough free slots to choose from. As a result, once the free slot count dips below $\mathcal{G} + \mathcal{Q}$, freed slots are quarantined rather than made immediately available. Finally, we evaluate whether the chunk needs to be moved: * if a chunk's slots are all fully free, the chunk is marked as empty, and is typically returned to the system as free space; * if the chunk previously had no slots available, but has any available now, the chunk is moved to the partially-free chunks list. Visually, let’s consider an example with a chunk using 16 slots, under a configuration in which $\mathcal{G} = \mathcal{Q} = 4$: ``` A───────────────┬─┬─┬─┬─┬─┬─┬─┬─┐ ┌─────────┐ │ Available: 1 │ │X│X│X│ │ │X│ │ │ partial │─────▶│ Quarantine: 1 ├─┼─┼─┼─┼─┼─┼─┼─┤ ├─────────┤ │ Guards: 4 │ │X│X│X│X│X│ │X│ │ full │ └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┘ └─────────┘ Legend: ┌─┐ ┌─┐ │ │ free slot │X│ allocated slot └─┘ └─┘ ``` If we now free an element, its slot is marked free, and the quarantine count is increased: ``` A───────────────┬─┬─┬─┬─┬─┬─┬─┬─┐ ┌─────────┐ │ Available: 1 │ │X│X│X│ │ │X│ │ │ partial │─────▶│ Quarantine: 2 ├─┼─┼─┼─┼─┼─┼─┼─┤ ├─────────┤ │ Guards: 4 │ │X│X│ │X│X│ │X│ │ full │ └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┘ └─────────┘ ``` If we allocate the last available element, a slot is now marked used, the available count drops to 0, and causes the chunk to now be full, and the quarantine stays untouched: ``` ┌─────────┐ │ partial │ A───────────────┬─┬─┬─┬─┬─┬─┬─┬─┐ ├─────────┤ │ Available: 0 │X│X│X│X│ │ │X│ │ │ full │─────▶│ Quarantine: 2 ├─┼─┼─┼─┼─┼─┼─┼─┤ └─────────┘ │ Guards: 4 │ │X│X│ │X│X│ │X│ └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┘ ``` Freeing just one element would return just one slot, and bump the quarantine count to 3. It takes freeing two elements for more than $\mathcal{G} + \mathcal{Q}$ slots to be free, leading to clearing the quarantine: ``` A───────────────┬─┬─┬─┬─┬─┬─┬─┬─┐ ┌─────────┐ │ Available: 4 │X│X│X│X│ │ │X│ │ │ partial │─────▶│ Quarantine: 0 ├─┼─┼─┼─┼─┼─┼─┼─┤ ├─────────┤ │ Guards: 4 │ │X│X│ │ │ │ │X│ │ full │ └───────────────┴─┴─┴─┴─┴─┴─┴─┴─┘ └─────────┘ ``` ### Operation clamping to a slot As long as a VM operation does not exceed slot bounds, any VM call is permitted, whether it is configuration altering calls such as `vm_protect()` or `vm_inherit()`, taking copies of the range with `mach_make_memory_entry()`, or replacing part of the mapping with `vm_allocate(..., VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE)`. However, operations that cross a slot boundary are not permitted, and lead to termination. When guard object policies are in effect, allocations are randomized, and even in a single threaded context, clients can't assume that consecutive allocations will be served in address order, and as a result, operations crossing slot boundaries are always bugs. ## Security motivation With the algorithm explanation, the “guard object” moniker should start to make sense: the goal is that unused free slots are object-sized guards — in the guard-page sense. Unlike usage of traditional guard pages, which only protect against linear buffer overflows, this scheme also adds a probabilistic mitigation against use-after-free or non-linear buffer overflows, forcing attackers to contend with a high probability of hitting these guards. Due to visible timing differences, it is assumed that an attacker can observe when a chunk is being allocated or freed. However, this doesn't give them an edge because the system maintains a guarantee that at least $\mathcal{G}/\mathcal{S}$ of the address space causes a crash when accessed at any given time. This is why we can let go of any form of sequestering of the address space for ranges managed with the guard-object allocation policy. ### Use-after-Free exploitation strategies and failure rates Attackers attempting to exploit a use-after-free will be able to cause an element to be freed, knowing that a dangling pointer to it remains. They will then try to replace this freed element with another one of a different type that they control, creating a type confusion. Because allocating new chunks causes visible timing differences, we can assume that attackers are able to fill chunks with all slots corresponding to elements that they control. They also know which elements are in the same chunk, but don't know which element corresponds to which slot. **In the absence of the quarantine**, the best strategy for attackers trying to exploit a use-after free is to perform $\mathcal{S} − \mathcal{G}$ rounds of freeing then reallocating each element in the chunk, where the first free is to the element they are trying to use-after-free, so that they retain a dangling pointer to the original slot. Each round, the allocator will choose one slot among $\mathcal{G} + 1$, when only one corresponds to the slot that was freed by triggering the bug. The failure rate the attackers face with this strategy is $$failure\_rate = \left( \frac{\mathcal{G}}{\mathcal{G+1}}\right)^{\mathcal{S} - \mathcal{G}}$$ * $\mathcal{S} = 8, \mathcal{G} = 2$ yields a 8.8% failure rate; * $\mathcal{S} = 16, \mathcal{G} = 4$ yields a 6.9 failure rate. **Using the quarantine** further reduces an attacker's odds of success. Unlike before, they need to free at least $\mathcal{Q}$ elements before elements are made available for allocations again. As a result, a round now needs to be $\mathcal{Q}$ deallocations followed by $\mathcal{Q}$ allocations. As before, the very first deallocation is to the element the attacker tries to use-after-free. A round gives attackers $\frac{\mathcal{G}}{ \mathcal{G}+\mathcal{Q}}$ chances of failure. The aggregate failure rate for this strategy is $$failure\_rate =\left( 1 - \frac{(\mathcal{S} - \mathcal{G)} \bmod \mathcal{Q} }{\mathcal{G} + \mathcal{Q}} \right) \left(\frac{\mathcal{G}}{\mathcal{G} + \mathcal{Q}}\right)^{ \left\lfloor \frac{\mathcal{S} -\mathcal{G}}{\mathcal{Q}} \right\rfloor}$$ * $\mathcal{S}=8, \mathcal{G}=1, \mathcal{Q}=4$ yields a 8% failure rate; * $\mathcal{S}=8, \mathcal{G}=2, \mathcal{Q}=2$ yields a 12.5% failure rate; * $\mathcal{S}=16, \mathcal{G}=4, \mathcal{Q}=3$ yields a 10.7% failure rate; * $\mathcal{S}=16, \mathcal{G}=4, \mathcal{Q}=4$ yields a 12.5% failure rate. ### Out-of-bound exploitation strategies Exploiting out-of-bound bugs requires knowing the distance between allocated objects, which an attacker a priori doesn’t know without some information leak. The regions protected by guard objects are coarsely randomized in the address space, so that attackers can’t predict how far an allocation is from other juicy exploitation targets in the address space such as the `__DATA` segment. Lastly, guard objects are combined in XNU with type isolation and allocation fronts. It makes cross-type attacks unreliable and unpredictable — as the various buckets of types are randomized per boot. The last remaining avenue of attack targets types that fall in the same allocation front. However, attackers still have to contend with the uniform $\mathcal{G}/\mathcal{S}$ density of holes, making out-of-bound unreliable with a probability of failure close to $\mathcal{G}/\mathcal{S}$. This probability of failure is typically worse than use-after-free for attackers, and intuitively, this is precisely because they need to know more information — the distance between objects — unlike use-after-free where that distance is obviously known to be zero. ### Choice of parameters In the actual implementation, $\mathcal{S}$ scales up with the size of the allocations going up — as a way to limit the amount of metadata needed. Maintaining the $\mathcal{G}/\mathcal{S}$ and $\mathcal{Q}/\mathcal{S}$ ratios constant for any $\mathcal{S}$ allows for all probabilities to become independent of $\mathcal{S}$. Our goal is to make attackers face at least 10% chance of failure — in the absence of information disclosure — which is why we chose numbers where $\mathcal{G} = \mathcal{Q} = \mathcal{S}/4$, yielding: * 25% failure rates for out-of-bound exploitation; * 12.5% failure rates for use-after-free exploitation. |