Performance Counters for Linux ------------------------------ Performance counters are special hardware registers available on most modern CPUs. These registers count the number of certain types of hw events: such as instructions executed, cachemisses suffered, or branches mis-predicted - without slowing down the kernel or applications. These registers can also trigger interrupts when a threshold number of events have passed - and can thus be used to profile the code that runs on that CPU. The Linux Performance Counter subsystem provides an abstraction of these hardware capabilities. It provides per task and per CPU counters, and it provides event capabilities on top of those. Performance counters are accessed via special file descriptors. There's one file descriptor per virtual counter used. The special file descriptor is opened via the perf_counter_open() system call: int perf_counter_open(u32 hw_event_type, u32 hw_event_period, u32 record_type, pid_t pid, int cpu); The syscall returns the new fd. The fd can be used via the normal VFS system calls: read() can be used to read the counter, fcntl() can be used to set the blocking mode, etc. Multiple counters can be kept open at a time, and the counters can be poll()ed. When creating a new counter fd, 'hw_event_type' is one of: enum hw_event_types { PERF_COUNT_CYCLES, PERF_COUNT_INSTRUCTIONS, PERF_COUNT_CACHE_REFERENCES, PERF_COUNT_CACHE_MISSES, PERF_COUNT_BRANCH_INSTRUCTIONS, PERF_COUNT_BRANCH_MISSES, }; These are standardized types of events that work uniformly on all CPUs that implements Performance Counters support under Linux. If a CPU is not able to count branch-misses, then the system call will return -EINVAL. [ Note: more hw_event_types are supported as well, but they are CPU specific and are enumerated via /sys on a per CPU basis. Raw hw event types can be passed in as negative numbers. For example, to count "External bus cycles while bus lock signal asserted" events on Intel Core CPUs, pass in a -0x4064 event type value. ] The parameter 'hw_event_period' is the number of events before waking up a read() that is blocked on a counter fd. Zero value means a non-blocking counter. 'record_type' is the type of data that a read() will provide for the counter, and it can be one of: enum perf_record_type { PERF_RECORD_SIMPLE, PERF_RECORD_IRQ, }; a "simple" counter is one that counts hardware events and allows them to be read out into a u64 count value. (read() returns 8 on a successful read of a simple counter.) An "irq" counter is one that will also provide an IRQ context information: the IP of the interrupted context. In this case read() will return the 8-byte counter value, plus the Instruction Pointer address of the interrupted context. The 'pid' parameter allows the counter to be specific to a task: pid == 0: if the pid parameter is zero, the counter is attached to the current task. pid > 0: the counter is attached to a specific task (if the current task has sufficient privilege to do so) pid < 0: all tasks are counted (per cpu counters) The 'cpu' parameter allows a counter to be made specific to a full CPU: cpu >= 0: the counter is restricted to a specific CPU cpu == -1: the counter counts on all CPUs Note: the combination of 'pid == -1' and 'cpu == -1' is not valid. A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts events of that task and 'follows' that task to whatever CPU the task gets schedule to. Per task counters can be created by any user, for their own tasks. A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.