diff options
Diffstat (limited to 'Documentation/static-keys.txt')
-rw-r--r-- | Documentation/static-keys.txt | 331 |
1 files changed, 0 insertions, 331 deletions
diff --git a/Documentation/static-keys.txt b/Documentation/static-keys.txt deleted file mode 100644 index 38290b9f25eb..000000000000 --- a/Documentation/static-keys.txt +++ /dev/null @@ -1,331 +0,0 @@ -=========== -Static Keys -=========== - -.. warning:: - - DEPRECATED API: - - The use of 'struct static_key' directly, is now DEPRECATED. In addition - static_key_{true,false}() is also DEPRECATED. IE DO NOT use the following:: - - struct static_key false = STATIC_KEY_INIT_FALSE; - struct static_key true = STATIC_KEY_INIT_TRUE; - static_key_true() - static_key_false() - - The updated API replacements are:: - - DEFINE_STATIC_KEY_TRUE(key); - DEFINE_STATIC_KEY_FALSE(key); - DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); - DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); - static_branch_likely() - static_branch_unlikely() - -Abstract -======== - -Static keys allows the inclusion of seldom used features in -performance-sensitive fast-path kernel code, via a GCC feature and a code -patching technique. A quick example:: - - DEFINE_STATIC_KEY_FALSE(key); - - ... - - if (static_branch_unlikely(&key)) - do unlikely code - else - do likely code - - ... - static_branch_enable(&key); - ... - static_branch_disable(&key); - ... - -The static_branch_unlikely() branch will be generated into the code with as little -impact to the likely code path as possible. - - -Motivation -========== - - -Currently, tracepoints are implemented using a conditional branch. The -conditional check requires checking a global variable for each tracepoint. -Although the overhead of this check is small, it increases when the memory -cache comes under pressure (memory cache lines for these global variables may -be shared with other memory accesses). As we increase the number of tracepoints -in the kernel this overhead may become more of an issue. In addition, -tracepoints are often dormant (disabled) and provide no direct kernel -functionality. Thus, it is highly desirable to reduce their impact as much as -possible. Although tracepoints are the original motivation for this work, other -kernel code paths should be able to make use of the static keys facility. - - -Solution -======== - - -gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: - -https://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html - -Using the 'asm goto', we can create branches that are either taken or not taken -by default, without the need to check memory. Then, at run-time, we can patch -the branch site to change the branch direction. - -For example, if we have a simple branch that is disabled by default:: - - if (static_branch_unlikely(&key)) - printk("I am the true branch\n"); - -Thus, by default the 'printk' will not be emitted. And the code generated will -consist of a single atomic 'no-op' instruction (5 bytes on x86), in the -straight-line code path. When the branch is 'flipped', we will patch the -'no-op' in the straight-line codepath with a 'jump' instruction to the -out-of-line true branch. Thus, changing branch direction is expensive but -branch selection is basically 'free'. That is the basic tradeoff of this -optimization. - -This lowlevel patching mechanism is called 'jump label patching', and it gives -the basis for the static keys facility. - -Static key label API, usage and examples -======================================== - - -In order to make use of this optimization you must first define a key:: - - DEFINE_STATIC_KEY_TRUE(key); - -or:: - - DEFINE_STATIC_KEY_FALSE(key); - - -The key must be global, that is, it can't be allocated on the stack or dynamically -allocated at run-time. - -The key is then used in code as:: - - if (static_branch_unlikely(&key)) - do unlikely code - else - do likely code - -Or:: - - if (static_branch_likely(&key)) - do likely code - else - do unlikely code - -Keys defined via DEFINE_STATIC_KEY_TRUE(), or DEFINE_STATIC_KEY_FALSE, may -be used in either static_branch_likely() or static_branch_unlikely() -statements. - -Branch(es) can be set true via:: - - static_branch_enable(&key); - -or false via:: - - static_branch_disable(&key); - -The branch(es) can then be switched via reference counts:: - - static_branch_inc(&key); - ... - static_branch_dec(&key); - -Thus, 'static_branch_inc()' means 'make the branch true', and -'static_branch_dec()' means 'make the branch false' with appropriate -reference counting. For example, if the key is initialized true, a -static_branch_dec(), will switch the branch to false. And a subsequent -static_branch_inc(), will change the branch back to true. Likewise, if the -key is initialized false, a 'static_branch_inc()', will change the branch to -true. And then a 'static_branch_dec()', will again make the branch false. - -The state and the reference count can be retrieved with 'static_key_enabled()' -and 'static_key_count()'. In general, if you use these functions, they -should be protected with the same mutex used around the enable/disable -or increment/decrement function. - -Note that switching branches results in some locks being taken, -particularly the CPU hotplug lock (in order to avoid races against -CPUs being brought in the kernel while the kernel is getting -patched). Calling the static key API from within a hotplug notifier is -thus a sure deadlock recipe. In order to still allow use of the -functionality, the following functions are provided: - - static_key_enable_cpuslocked() - static_key_disable_cpuslocked() - static_branch_enable_cpuslocked() - static_branch_disable_cpuslocked() - -These functions are *not* general purpose, and must only be used when -you really know that you're in the above context, and no other. - -Where an array of keys is required, it can be defined as:: - - DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count); - -or:: - - DEFINE_STATIC_KEY_ARRAY_FALSE(keys, count); - -4) Architecture level code patching interface, 'jump labels' - - -There are a few functions and macros that architectures must implement in order -to take advantage of this optimization. If there is no architecture support, we -simply fall back to a traditional, load, test, and jump sequence. Also, the -struct jump_entry table must be at least 4-byte aligned because the -static_key->entry field makes use of the two least significant bits. - -* ``select HAVE_ARCH_JUMP_LABEL``, - see: arch/x86/Kconfig - -* ``#define JUMP_LABEL_NOP_SIZE``, - see: arch/x86/include/asm/jump_label.h - -* ``__always_inline bool arch_static_branch(struct static_key *key, bool branch)``, - see: arch/x86/include/asm/jump_label.h - -* ``__always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)``, - see: arch/x86/include/asm/jump_label.h - -* ``void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type)``, - see: arch/x86/kernel/jump_label.c - -* ``__init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type)``, - see: arch/x86/kernel/jump_label.c - -* ``struct jump_entry``, - see: arch/x86/include/asm/jump_label.h - - -5) Static keys / jump label analysis, results (x86_64): - - -As an example, let's add the following branch to 'getppid()', such that the -system call now looks like:: - - SYSCALL_DEFINE0(getppid) - { - int pid; - - + if (static_branch_unlikely(&key)) - + printk("I am the true branch\n"); - - rcu_read_lock(); - pid = task_tgid_vnr(rcu_dereference(current->real_parent)); - rcu_read_unlock(); - - return pid; - } - -The resulting instructions with jump labels generated by GCC is:: - - ffffffff81044290 <sys_getppid>: - ffffffff81044290: 55 push %rbp - ffffffff81044291: 48 89 e5 mov %rsp,%rbp - ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9> - ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax - ffffffff810442a0: 00 00 - ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax - ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax - ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi - ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr> - ffffffff810442bc: 5d pop %rbp - ffffffff810442bd: 48 98 cltq - ffffffff810442bf: c3 retq - ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi - ffffffff810442c7: 31 c0 xor %eax,%eax - ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk> - ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9> - -Without the jump label optimization it looks like:: - - ffffffff810441f0 <sys_getppid>: - ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key> - ffffffff810441f6: 55 push %rbp - ffffffff810441f7: 48 89 e5 mov %rsp,%rbp - ffffffff810441fa: 85 c0 test %eax,%eax - ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35> - ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax - ffffffff81044205: 00 00 - ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax - ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax - ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi - ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr> - ffffffff81044221: 5d pop %rbp - ffffffff81044222: 48 98 cltq - ffffffff81044224: c3 retq - ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi - ffffffff8104422c: 31 c0 xor %eax,%eax - ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk> - ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe> - ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) - ffffffff8104423c: 00 00 00 00 - -Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction -vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched -to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump -label case adds:: - - 6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. - -If we then include the padding bytes, the jump label code saves, 16 total bytes -of instruction memory for this small function. In this case the non-jump label -function is 80 bytes long. Thus, we have saved 20% of the instruction -footprint. We can in fact improve this even further, since the 5-byte no-op -really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp. -However, we have not yet implemented optimal no-op sizes (they are currently -hard-coded). - -Since there are a number of static key API uses in the scheduler paths, -'pipe-test' (also known as 'perf bench sched pipe') can be used to show the -performance improvement. Testing done on 3.3.0-rc2: - -jump label disabled:: - - Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): - - 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% ) - 200,003 context-switches # 0.234 M/sec ( +- 0.00% ) - 0 CPU-migrations # 0.000 M/sec ( +- 39.58% ) - 487 page-faults # 0.001 M/sec ( +- 0.02% ) - 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% ) - <not supported> stalled-cycles-frontend - <not supported> stalled-cycles-backend - 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% ) - 208,368,926 branches # 243.507 M/sec ( +- 0.06% ) - 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% ) - - 1.601607384 seconds time elapsed ( +- 0.07% ) - -jump label enabled:: - - Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): - - 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% ) - 200,004 context-switches # 0.238 M/sec ( +- 0.00% ) - 0 CPU-migrations # 0.000 M/sec ( +- 40.87% ) - 487 page-faults # 0.001 M/sec ( +- 0.05% ) - 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% ) - <not supported> stalled-cycles-frontend - <not supported> stalled-cycles-backend - 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% ) - 206,859,359 branches # 245.956 M/sec ( +- 0.04% ) - 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% ) - - 1.579384366 seconds time elapsed - -The percentage of saved branches is .7%, and we've saved 12% on -'branch-misses'. This is where we would expect to get the most savings, since -this optimization is about reducing the number of branches. In addition, we've -saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time. |