Loading...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 | # CPU Counters How xnu manages CPU performance counters, hardware registers that count events for monitoring CPU efficiency. This document describes the ARM hardware and interfaces available. Counters that measure events closely correlated with each CPU's execution pipeline are managed by the Core Performance Monitoring Unit (CPMU). The CPMU contains fixed instructions and cycles counters, as well as configurable counters that can be programmed to count any of several hundred possible events. In addition to the CPMU, the Last Level Cache (LLC) hosts the Uncore Performance Monitoring Unit (UPMU), which measures effects that aren't necessarily correlated to a single CPU. All counters in the UPMU are configurable. Counters are typically used in one of two ways. In "counting" mode, their counts are periodically queried and tallied up for a duration of interest. In "sampling" mode, the counters are programmed to generate a Performance Monitor Interrupt (PMI) periodically, during which the currently running code can be sampled, like a time profiler. In xnu, the CPU Performance Counter (CPC) subsystem manages the CPMU counters and enforces the security policy and ownership for the CPMU and UPMU. ## CPC Kernel Interfaces See the header documentation in `<kern/cpc.h>` for information about how these interfaces can be called and their expected arguments. ### Configuration To configure counters, fill in an array of `struct cpc_event_select` structures to create a `cpc_set_t` with `cpc_set_alloc`. This can be applied system-wide with `cpc_set_apply`. This is currently only used by kpc for backwards compatibility. ### Counting Convenience functions for reading the fixed counters of the current CPU's CPMU are offered because they don't require an expensive cross-call: - `cpc_cycles` and `cpc_instructions` return the cycles elapsed and instructions retired on the current CPU, using the CPMU. - `cpc_cycles_instrs` returns both cycles and instructions together for correlation between the two values as "cycles-per-instruction." - `cpc_cycles_instrs_spec` does not include a synchronization barrier, allowing code to execute speculatively around the call. These functions are used throughout xnu when CPU usage is required, but predominantly in Recount. To read the counters of the current unit on either the CPMU or UPMU, use `cpc_hw_counts`. ### Sampling Repeatedly sampling what is executing when a counter overflows is handled by `cpc_cyclic_t`. These can be created in one of two ways: - `cpc_cyclic_alloc` allocates and initializes a repeating call for a well-known `cpc_slot_t`, like instructions or cycles. The client can then use `cpc_cyclic_activate` and `cpc_cyclic_cancel` to start and stop the periodic calls. These are heavyweight operations since they apply to the system as a whole, requiring a broadcast cross-call. This is how the telemetry subsystem triggers a microstackshot using the cycle counter. - `cpc_set_alloc` accepts a list of `struct cpc_cyclic_info` structures, which turn into `cpc_cyclic_t`s attached to the returned `cpc_set_t`. When the settings are applied, each cyclic is also activated. This is how kpc triggers kperf samples. The interfaces are relatively basic: aside from the counter slot, they take three values: - A function to call. - A `void *` context pointer for the client. - The period of (i.e. number of events between) each function call. If created outside of a `cpc_set_t`, a cyclic must be activated with `cpc_cyclic_activate`, which involves an expensive broadcast cross-call, but only needs to be done once. Unlike a traditional `timer_call_t`, these interfaces are repeating and do not need to be re-armed (i.e. with `timer_call_enter`) after each firing. Use the low-level `cpc_call_t` for a one-shot interface. ### Performance Considerations - Recount calls `cpc_cycles_instrs_spec` on every user/kernel transition, a very hot path in xnu. - ~20 cycles to read a CPMU PMC. - Currently ~5 cycles to read/write a CPMU control register. - Broadcast cross-calls to do local MSR reads and writes when applying settings or entering cyclics, as PIO is unavailable. ### Security Considerations - No PIO access to CPMU or UPMU counters or control registers. - Restrict events available to user space with an allow-list. - Prevent PMIs that fire too quickly to prevent single-stepping interrupts-enabled kernel code. ### Low-level Interfaces A `cpc_call_t` is the lowest-level interface to configure the counters for sampling. It operates on hardware counters, instead of slots, and should only be used for implementing higher-level constructs like the repeating cyclics, a quantum timer (local to a given thread), or a critical section watchdog (temporarily on the same CPU). - `cpc_call_init` initializes a `cpc_call_t` structure with a function to call and the counter to select. - `cpc_call_enter` arms the call with a given period. Unlike `timer_call_t`, an invocation of the call will re-arm the call after returning from the callback. - `cpc_call_cancel` disables the call and returns it to its initialized state. ## Other Subsystems In addition to CPC, several other subsystems provide access to CPU counter hardware: - kpc: The Kernel Performance Counter system provides backwards-compatible shim interfaces for user space clients of kperf.framework. It can use PMIs from these counters to trigger kperf samples and counter values can be recorded in kperf samples. - Monotonic: The UPMU counters are provided by a dev node interface under `/dev/monotonic/uncore`. Eventually, CPC will manage the UPMU counters as well, replacing Monotonic. - Closed Loop Performance Controller (CLPC): The per-CPU and per-cluster AON counters are managed by CLPC, which can also use the CPMU and UPMU if necessary and no other client is active. ## Integrations - The Recount subsystem makes extensive use of the fixed CPMU counters to attribute CPU resources back to threads and processes. - The kperf profiling system can trigger samples of thread states and call stacks using CPMU PMIs, allowing it to sample thread states and call stacks. And CPU counter values can be sampled by kperf on other triggers, like timers or kdebug events. - Stackshot includes cycles and instructions for each thread container in its kcdata. - Microstackshot telemetry is sampled periodically using the CPMU's cycle PMI trigger. - Scheduler hygiene diagnostics, which enforce an upper bound on time spent with interrupts or preemption disabled, include CPMU cycles and instructions in the panic report. ## CPC Internals Counter hardware, like the CPMU, is identified in a `cpc_hw_t` enumeration (e.g. `CPC_HW_CPMU`). Within each hardware class, there are multiple `cpc_unit_t` structures that store per-PMU information. For instance, the CPMU support uses a `cpc_unit_t` for each CPU on the system. Each hardware counter has a corresponding `cpc_counter_t` that maintains its state, like the current total tally of the counter and what the last read value was. The unit's data structure holds any fixed counters, while configurable counters are stored inside the active `cpc_set_t`. The support for sampling the counters mimics the interfaces of `timer_call_t`, but the internals are significantly more primitive, as there is a hard-coded maximum of `CPC_MAX_CALLS` (currently 8) active at once. This avoids needing a priority queue object, since a linear scan is sufficiently quick to determine the earliest deadline. Each counter has an array of deadlines with indices that match another array of calls. When a PMI fires for a counter, its deadlines are scanned for any that are prior to the current counter's sum. Any deadlines that have expired have their calls invoked. Cyclics protect against interrupt storms with a `struct cpc_storm_watch` structure. This ensures that the number of interrupts within a configurable time window is not excessive. If it does, the cyclic is disabled. ### CPU Transitions CPC needs to preserve the CPMU state in the face of CPUs idling or going offline, clusters powering down, and system sleep. To achieve this, there are hooks across the machine-dependent code that notify CPC about state transitions. At each transition, CPC can take the following actions: #### Actions These actions are handled in the machine-independent layer of CPC. - **Configure Full:** All registers need to be re-synchronized. 1. Check for any new cyclics that need their calls enqueued and enter them as deferred. This is required because an offline CPU will not respond to the broadcast cross-call, which sets up each CPU's calls. When the CPU comes back online, it needs to check the global state of cyclics on the system and enter any calls it missed. 2. Set the previous values (or values updated for new deadlines) of PMCs in each PMC register. 3. Set any control registers to their active state to get the counters running. - **Configure Partial:** Same as above _except_ skip (1). Any cyclic activation that occurred would have taken effect on the CPU. - **Update Counts:** Save the values of the hardware PMCs to counters. #### Transitions These transitions are noticed by the machine-dependent layers of CPC, which then transforms them to actions to call in the machine-independent layer. - CPU is booting from a reset vector, like power off or hibernation/suspend-to-RAM. Needs to **Configure Full**. - CPU is going offline. Needs to **Save PMCs**. - CPU is being brought back online. Needs to **Configure Full**. - CPU is going idle. Needs to **Save PMCs**. - CPU is going active after idle. On systems with architectural state restoration, do nothing. Otherwise, needs to **Configure Partial**, because CPU idle leaves interrupts enabled. ### Diagnostics Run `sysctl kern.cpc.state` to return a string that's the current state of the CPC subsystem. This includes all machine-independent structures, like the configuration of sets and cyclics, as well as machine-specific and HW state, like the reset and live values of registers. ### Tunables - Maximum number of active calls. ## Defense of Design - Init, activate, cancel, destroy design: - Allows allocator operations to be separated from start/stop, so they can happen in allocator-unsafe contexts. - Front-end vs. backend (`cpc.h` vs. `machine_cpc.h`): - Avoiding obvious code duplication in generic interfaces. ### `cpc_counter_t`, `cpc_slot_t` - Index-based vs. object-based: - Referencing an object makes it easier to hold them in sets. - Slots offer a way to reference a particular PMC across all units. - Store configurable counters in set: - Need to reset the counter values to zero for each new set anyway. - Multiplexing will need this. ### `cpc_event_select_t`, `cpc_set_t` - Array vs. field with index: - kpc uses an array with different offets based on "classes," which has proven to be error-prone and confusing. This was probably done to simplify its sysctl interface for user space. - An array of just the events requested is potentially less efficient, but is much more flexible. - Flags vs. selector encoding: - Flags are explicit and avoids overloading selectors. - Selectors are also used for the legacy raw PMU support in kpc. - Built-in cyclics: - Periods and callbacks only make sense for a particular configuration. - Explicitly apply to system: - Eventually applying sets will happen implicitly for multiplexing and during context-switch. - CPMU-mimicking data structure: - Easier debugging. - More efficient when applying is common. - Reset state can be applied the same way. ### `cpc_call_t`, `cpc_cyclic_t` - Maximum active calls: - Not expected to need more than a half-dozen concurrent calls on each counter. - Slot-focused repeating cyclics vs. counter-focused one-shot calls: - Cyclics are not bound to a particular unit, whereas calls are unit-local (like `timer_call_t`). - Targeting calls at `cpc_counter_t`: - Calls target counters and their management metadata is stored there. - Array vs. priority queue: - Overhead of queue management dominates for low number of pending calls. ## CPC Roadmap - Applying `cpc_set_t` to single processes. - CPC kobject server for user space configuration. - Replace Monotonic for UPMU configuration. - DTrace CPC provider. - Integration with Recount to count configurable counters for processes and threads. - Multiplexing arrays of `cpc_event_select_t` in a single `cpc_set_t` based on cycles, instructions, or time. - Support CPMU in VM guests. - Better interface for CLPC to make use of the counters. - Support hardware-specific features of the CPMU and UPMU: - CPMU: Op-code matching. - CPMU: Instruction address matching. - CPMU: Watchpoint matching. - CPMU: Source matching. - UPMU: Latency histogram thresholds. - Boot-args to configure counters. - Always-on configurable CPMU and UPMU counting for telemetry. - Use synchronous data aborts for CPMU PMIs. ## See Also - <doc:recount> |