summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-11-02Merge branch 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Nothing too interesting. An optimization to short-circuit noop cpumask updates, debug dump code reorg, and doc update" * 'for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: doc: Call out the non-reentrance conditions workqueue: Introduce show_one_worker_pool and show_one_workqueue. workqueue: make sysfs of unbound kworker cpumask more clever
2021-11-02Merge branches 'pm-em' and 'powercap'Rafael J. Wysocki
Merge Energy Model and power capping updates for 5.16-rc1: - Add support for inefficient operating performance points to the Energy Model and modify cpufreq to use them properly (Vincent Donnefort). - Rearrange the DTPM framework code to simplify it and make it easier to follow (Daniel Lezcano). - Fix power intialization in DTPM (Daniel Lezcano). - Add CPU load consideration when estimating the instaneous power consumption in DTPM (Daniel Lezcano). * pm-em: cpufreq: mediatek-hw: Fix cpufreq_table_find_index_dl() call PM: EM: Mark inefficiencies in CPUFreq cpufreq: Use CPUFREQ_RELATION_E in DVFS governors cpufreq: Introducing CPUFREQ_RELATION_E cpufreq: Add an interface to mark inefficient frequencies cpufreq: Make policy min/max hard requirements PM: EM: Allow skipping inefficient states PM: EM: Extend em_perf_domain with a flag field PM: EM: Mark inefficient states PM: EM: Fix inefficient states detection * powercap: powercap/drivers/dtpm: Fix power limit initialization powercap/drivers/dtpm: Scale the power with the load powercap/drivers/dtpm: Use container_of instead of a private data field powercap/drivers/dtpm: Simplify the dtpm table powercap/drivers/dtpm: Encapsulate even more the code
2021-11-02Merge branch 'pm-sleep'Rafael J. Wysocki
Merge updates related to system sleep for 5.16-rc1: - Clean up hib_wait_io() (Falla Coulibaly). - Fix sparse warnings in hibernation-related code (Anders Roxell). - Use vzalloc() and kzalloc() instead of their open-coded equivalents in hibernation-related code (Cai Huoqing). - Prevent user space from crashing the kernel by attempting to restore the system state from a swap partition in use (Ye Bin). - Do not let "syscore" devices runtime-suspend during system PM transitions (Rafael Wysocki). - Do not pause cpuidle in the suspend-to-idle path (Rafael Wysocki). - Pause cpuidle later and resume it earlier during system PM transitions (Rafael Wysocki). - Make system suspend code use valid_state() consistently (Rafael Wysocki). - Add support for enabling wakeup IRQs after invoking the ->runtime_suspend() callback and make two drivers use it (Chunfeng Yun). * pm-sleep: usb: mtu3: enable wake-up interrupt after runtime_suspend called usb: xhci-mtk: enable wake-up interrupt after runtime_suspend called PM / wakeirq: support enabling wake-up irq after runtime_suspend called PM: suspend: Use valid_state() consistently PM: sleep: Pause cpuidle later and resume it earlier during system transitions PM: suspend: Do not pause cpuidle in the suspend-to-idle path PM: sleep: Do not let "syscore" devices runtime-suspend during system transitions PM: hibernate: Get block device exclusively in swsusp_check() PM: hibernate: swap: Use vzalloc() and kzalloc() PM: hibernate: fix sparse warnings Revert "PM: sleep: Do not assume that "mem" is always present" PM: hibernate: Remove blk_status_to_errno in hib_wait_io PM: sleep: Do not assume that "mem" is always present
2021-11-02Merge tag 'printk-for-5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Extend %pGp print format to print hex value of the page flags - Use kvmalloc instead of kmalloc to allocate devkmsg buffers - Misc cleanup and warning fixes * tag 'printk-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: Update %pGp documentation about that it prints hex value lib/vsprintf.c: Amend static asserts for format specifier flags vsprintf: Make %pGp print the hex value test_printf: Append strings more efficiently test_printf: Remove custom appending of '|' test_printf: Remove separate page_flags variable test_printf: Make pft array const ia64: don't do IA64_CMPXCHG_DEBUG without CONFIG_PRINTK printk: use gnu_printf format attribute for printk_sprint() printk: avoid -Wsometimes-uninitialized warning printk: use kvmalloc instead of kmalloc for devkmsg_user
2021-11-02Merge tag 'net-next-for-5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Remove socket skb caches - Add a SO_RESERVE_MEM socket op to forward allocate buffer space and avoid memory accounting overhead on each message sent - Introduce managed neighbor entries - added by control plane and resolved by the kernel for use in acceleration paths (BPF / XDP right now, HW offload users will benefit as well) - Make neighbor eviction on link down controllable by userspace to work around WiFi networks with bad roaming implementations - vrf: Rework interaction with netfilter/conntrack - fq_codel: implement L4S style ce_threshold_ect1 marking - sch: Eliminate unnecessary RCU waits in mini_qdisc_pair_swap() BPF: - Add support for new btf kind BTF_KIND_TAG, arbitrary type tagging as implemented in LLVM14 - Introduce bpf_get_branch_snapshot() to capture Last Branch Records - Implement variadic trace_printk helper - Add a new Bloomfilter map type - Track <8-byte scalar spill and refill - Access hw timestamp through BPF's __sk_buff - Disallow unprivileged BPF by default - Document BPF licensing Netfilter: - Introduce egress hook for looking at raw outgoing packets - Allow matching on and modifying inner headers / payload data - Add NFT_META_IFTYPE to match on the interface type either from ingress or egress Protocols: - Multi-Path TCP: - increase default max additional subflows to 2 - rework forward memory allocation - add getsockopts: MPTCP_INFO, MPTCP_TCPINFO, MPTCP_SUBFLOW_ADDRS - MCTP flow support allowing lower layer drivers to configure msg muxing as needed - Automatic Multicast Tunneling (AMT) driver based on RFC7450 - HSR support the redbox supervision frames (IEC-62439-3:2018) - Support for the ip6ip6 encapsulation of IOAM - Netlink interface for CAN-FD's Transmitter Delay Compensation - Support SMC-Rv2 eliminating the current same-subnet restriction, by exploiting the UDP encapsulation feature of RoCE adapters - TLS: add SM4 GCM/CCM crypto support - Bluetooth: initial support for link quality and audio/codec offload Driver APIs: - Add a batched interface for RX buffer allocation in AF_XDP buffer pool - ethtool: Add ability to control transceiver modules' power mode - phy: Introduce supported interfaces bitmap to express MAC capabilities and simplify PHY code - Drop rtnl_lock from DSA .port_fdb_{add,del} callbacks New drivers: - WiFi driver for Realtek 8852AE 802.11ax devices (rtw89) - Ethernet driver for ASIX AX88796C SPI device (x88796c) Drivers: - Broadcom PHYs - support 72165, 7712 16nm PHYs - support IDDQ-SR for additional power savings - PHY support for QCA8081, QCA9561 PHYs - NXP DPAA2: support for IRQ coalescing - NXP Ethernet (enetc): support for software TCP segmentation - Renesas Ethernet (ravb) - support DMAC and EMAC blocks of Gigabit-capable IP found on RZ/G2L SoC - Intel 100G Ethernet - support for eswitch offload of TC/OvS flow API, including offload of GRE, VxLAN, Geneve tunneling - support application device queues - ability to assign Rx and Tx queues to application threads - PTP and PPS (pulse-per-second) extensions - Broadcom Ethernet (bnxt) - devlink health reporting and device reload extensions - Mellanox Ethernet (mlx5) - offload macvlan interfaces - support HW offload of TC rules involving OVS internal ports - support HW-GRO and header/data split - support application device queues - Marvell OcteonTx2: - add XDP support for PF - add PTP support for VF - Qualcomm Ethernet switch (qca8k): support for QCA8328 - Realtek Ethernet DSA switch (rtl8366rb) - support bridge offload - support STP, fast aging, disabling address learning - support for Realtek RTL8365MB-VC, a 4+1 port 10M/100M/1GE switch - Mellanox Ethernet/IB switch (mlxsw) - multi-level qdisc hierarchy offload (e.g. RED, prio and shaping) - offload root TBF qdisc as port shaper - support multiple routing interface MAC address prefixes - support for IP-in-IP with IPv6 underlay - MediaTek WiFi (mt76) - mt7921 - ASPM, 6GHz, SDIO and testmode support - mt7915 - LED and TWT support - Qualcomm WiFi (ath11k) - include channel rx and tx time in survey dump statistics - support for 80P80 and 160 MHz bandwidths - support channel 2 in 6 GHz band - spectral scan support for QCN9074 - support for rx decapsulation offload (data frames in 802.3 format) - Qualcomm phone SoC WiFi (wcn36xx) - enable Idle Mode Power Save (IMPS) to reduce power consumption during idle - Bluetooth driver support for MediaTek MT7922 and MT7921 - Enable support for AOSP Bluetooth extension in Qualcomm WCN399x and Realtek 8822C/8852A - Microsoft vNIC driver (mana) - support hibernation and kexec - Google vNIC driver (gve) - support for jumbo frames - implement Rx page reuse Refactor: - Make all writes to netdev->dev_addr go thru helpers, so that we can add this address to the address rbtree and handle the updates - Various TCP cleanups and optimizations including improvements to CPU cache use - Simplify the gnet_stats, Qdisc stats' handling and remove qdisc->running sequence counter - Driver changes and API updates to address devlink locking deficiencies" * tag 'net-next-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2122 commits) Revert "net: avoid double accounting for pure zerocopy skbs" selftests: net: add arp_ndisc_evict_nocarrier net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter net: arp: introduce arp_evict_nocarrier sysctl parameter libbpf: Deprecate AF_XDP support kbuild: Unify options for BTF generation for vmlinux and modules selftests/bpf: Add a testcase for 64-bit bounds propagation issue. bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit. bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off. net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c net: avoid double accounting for pure zerocopy skbs tcp: rename sk_wmem_free_skb netdevsim: fix uninit value in nsim_drv_configure_vfs() selftests/bpf: Fix also no-alu32 strobemeta selftest bpf: Add missing map_delete_elem method to bloom filter map selftests/bpf: Add bloom map success test for userspace calls bpf: Add alignment padding for "map_extra" + consolidate holes bpf: Bloom filter map naming fixups selftests/bpf: Add test cases for struct_ops prog bpf: Add dummy BPF STRUCT_OPS for test purpose ...
2021-11-02Merge branch 'for-5.16-vsprintf-pgp' into for-linusPetr Mladek
2021-11-01Merge tag 'audit-pr-20211101' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "Add some additional audit logging to capture the openat2() syscall open_how struct info. Previous variations of the open()/openat() syscalls allowed audit admins to inspect the syscall args to get the information contained in the new open_how struct used in openat2()" * tag 'audit-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: return early if the filter rule has a lower priority audit: add OPENAT2 record to list "how" info audit: add support for the openat2 syscall audit: replace magic audit syscall class numbers with macros lsm_audit: avoid overloading the "key" audit field audit: Convert to SPDX identifier audit: rename struct node to struct audit_node to prevent future name collisions
2021-11-01Merge tag 'selinux-pr-20211101' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: - Add LSM/SELinux/Smack controls and auditing for io-uring. As usual, the individual commit descriptions have more detail, but we were basically missing two things which we're adding here: + establishment of a proper audit context so that auditing of io-uring ops works similarly to how it does for syscalls (with some io-uring additions because io-uring ops are *not* syscalls) + additional LSM hooks to enable access control points for some of the more unusual io-uring features, e.g. credential overrides. The additional audit callouts and LSM hooks were done in conjunction with the io-uring folks, based on conversations and RFC patches earlier in the year. - Fixup the binder credential handling so that the proper credentials are used in the LSM hooks; the commit description and the code comment which is removed in these patches are helpful to understand the background and why this is the proper fix. - Enable SELinux genfscon policy support for securityfs, allowing improved SELinux filesystem labeling for other subsystems which make use of securityfs, e.g. IMA. * tag 'selinux-pr-20211101' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: security: Return xattr name from security_dentry_init_security() selinux: fix a sock regression in selinux_ip_postroute_compat() binder: use cred instead of task for getsecid binder: use cred instead of task for selinux checks binder: use euid from cred instead of using task LSM: Avoid warnings about potentially unused hook variables selinux: fix all of the W=1 build warnings selinux: make better use of the nf_hook_state passed to the NF hooks selinux: fix race condition when computing ocontext SIDs selinux: remove unneeded ipv6 hook wrappers selinux: remove the SELinux lockdown implementation selinux: enable genfscon labeling for securityfs Smack: Brutalist io_uring support selinux: add support for the io_uring access controls lsm,io_uring: add LSM hooks to io_uring io_uring: convert io_uring to the secure anon inode interface fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() audit: add filtering for io_uring records audit,io_uring,io-wq: add some basic audit support to io_uring audit: prepare audit_context for use in calling contexts beyond syscalls
2021-11-01Merge tag 'rcu.2021.11.01a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU updates from Paul McKenney: - Miscellaneous fixes - Torture-test updates for smp_call_function(), most notably improved checking of module parameters. - Tasks-trace RCU updates that fix a number of rare but important race-condition bugs. - Other torture-test updates, most notably better checking of module parameters. In addition, rcutorture may once again be run on CONFIG_PREEMPT_RT kernels. - Torture-test scripting updates, most notably specifying the new CONFIG_KCSAN_STRICT kconfig option rather than maintaining an ever-changing list of individual KCSAN kconfig options. * tag 'rcu.2021.11.01a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (46 commits) rcu: Fix rcu_dynticks_curr_cpu_in_eqs() vs noinstr rcu: Always inline rcu_dynticks_task*_{enter,exit}() torture: Make kvm-remote.sh print size of downloaded tarball torture: Allot 1G of memory for scftorture runs tools/rcu: Add an extract-stall script scftorture: Warn on individual scf_torture_init() error conditions scftorture: Count reschedule IPIs scftorture: Account for weight_resched when checking for all zeroes scftorture: Shut down if nonsensical arguments given scftorture: Allow zero weight to exclude an smp_call_function*() category rcu: Avoid unneeded function call in rcu_read_unlock() rcu-tasks: Update comments to cond_resched_tasks_rcu_qs() rcu-tasks: Fix IPI failure handling in trc_wait_for_one_reader rcu-tasks: Fix read-side primitives comment for call_rcu_tasks_trace rcu-tasks: Clarify read side section info for rcu_tasks_rude GP primitives rcu-tasks: Correct comparisons for CPU numbers in show_stalled_task_trace rcu-tasks: Correct firstreport usage in check_all_holdout_tasks_trace rcu-tasks: Fix s/rcu_add_holdout/trc_add_holdout/ typo in comment rcu-tasks: Move RTGS_WAIT_CBS to beginning of rcu_tasks_kthread() loop rcu-tasks: Fix s/instruction/instructions/ typo in comment ...
2021-11-01Merge tag 'trace-v5.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - kprobes: Restructured stack unwinder to show properly on x86 when a stack dump happens from a kretprobe callback. - Fix to bootconfig parsing - Have tracefs allow owner and group permissions by default (only denying others). There's been pressure to allow non root to tracefs in a controlled fashion, and using groups is probably the safest. - Bootconfig memory managament updates. - Bootconfig clean up to have the tools directory be less dependent on changes in the kernel tree. - Allow perf to be traced by function tracer. - Rewrite of function graph tracer to be a callback from the function tracer instead of having its own trampoline (this change will happen on an arch by arch basis, and currently only x86_64 implements it). - Allow multiple direct trampolines (bpf hooks to functions) be batched together in one synchronization. - Allow histogram triggers to add variables that can perform calculations against the event's fields. - Use the linker to determine architecture callbacks from the ftrace trampoline to allow for proper parameter prototypes and prevent warnings from the compiler. - Extend histogram triggers to key off of variables. - Have trace recursion use bit magic to determine preempt context over if branches. - Have trace recursion disable preemption as all use cases do anyway. - Added testing for verification of tracing utilities. - Various small clean ups and fixes. * tag 'trace-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (101 commits) tracing/histogram: Fix semicolon.cocci warnings tracing/histogram: Fix documentation inline emphasis warning tracing: Increase PERF_MAX_TRACE_SIZE to handle Sentinel1 and docker together tracing: Show size of requested perf buffer bootconfig: Initialize ret in xbc_parse_tree() ftrace: do CPU checking after preemption disabled ftrace: disable preemption when recursion locked tracing/histogram: Document expression arithmetic and constants tracing/histogram: Optimize division by a power of 2 tracing/histogram: Covert expr to const if both operands are constants tracing/histogram: Simplify handling of .sym-offset in expressions tracing: Fix operator precedence for hist triggers expression tracing: Add division and multiplication support for hist triggers tracing: Add support for creating hist trigger variables from literal selftests/ftrace: Stop tracing while reading the trace file by default MAINTAINERS: Update KPROBES and TRACING entries test_kprobes: Move it from kernel/ to lib/ docs, kprobes: Remove invalid URL and add new reference samples/kretprobes: Fix return value if register_kretprobe() failed lib/bootconfig: Fix the xbc_get_info kerneldoc ...
2021-11-01Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-11-01 We've added 181 non-merge commits during the last 28 day(s) which contain a total of 280 files changed, 11791 insertions(+), 5879 deletions(-). The main changes are: 1) Fix bpf verifier propagation of 64-bit bounds, from Alexei. 2) Parallelize bpf test_progs, from Yucong and Andrii. 3) Deprecate various libbpf apis including af_xdp, from Andrii, Hengqi, Magnus. 4) Improve bpf selftests on s390, from Ilya. 5) bloomfilter bpf map type, from Joanne. 6) Big improvements to JIT tests especially on Mips, from Johan. 7) Support kernel module function calls from bpf, from Kumar. 8) Support typeless and weak ksym in light skeleton, from Kumar. 9) Disallow unprivileged bpf by default, from Pawan. 10) BTF_KIND_DECL_TAG support, from Yonghong. 11) Various bpftool cleanups, from Quentin. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (181 commits) libbpf: Deprecate AF_XDP support kbuild: Unify options for BTF generation for vmlinux and modules selftests/bpf: Add a testcase for 64-bit bounds propagation issue. bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit. bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off. selftests/bpf: Fix also no-alu32 strobemeta selftest bpf: Add missing map_delete_elem method to bloom filter map selftests/bpf: Add bloom map success test for userspace calls bpf: Add alignment padding for "map_extra" + consolidate holes bpf: Bloom filter map naming fixups selftests/bpf: Add test cases for struct_ops prog bpf: Add dummy BPF STRUCT_OPS for test purpose bpf: Factor out helpers for ctx access checking bpf: Factor out a helper to prepare trampoline for struct_ops prog selftests, bpf: Fix broken riscv build riscv, libbpf: Add RISC-V (RV64) support to bpf_tracing.h tools, build: Add RISC-V to HOSTARCH parsing riscv, bpf: Increase the maximum number of iterations selftests, bpf: Add one test for sockmap with strparser selftests, bpf: Fix test_txmsg_ingress_parser error ... ==================== Link: https://lore.kernel.org/r/20211102013123.9005-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01bpf: Fix propagation of signed bounds from 64-bit min/max into 32-bit.Alexei Starovoitov
Similar to unsigned bounds propagation fix signed bounds. The 'Fixes' tag is a hint. There is no security bug here. The verifier was too conservative. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211101222153.78759-2-alexei.starovoitov@gmail.com
2021-11-01bpf: Fix propagation of bounds from 64-bit min/max into 32-bit and var_off.Alexei Starovoitov
Before this fix: 166: (b5) if r2 <= 0x1 goto pc+22 from 166 to 189: R2=invP(id=1,umax_value=1,var_off=(0x0; 0xffffffff)) After this fix: 166: (b5) if r2 <= 0x1 goto pc+22 from 166 to 189: R2=invP(id=1,umax_value=1,var_off=(0x0; 0x1)) While processing BPF_JLE the reg_set_min_max() would set true_reg->umax_value = 1 and call __reg_combine_64_into_32(true_reg). Without the fix it would not pass the condition: if (__reg64_bound_u32(reg->umin_value) && __reg64_bound_u32(reg->umax_value)) since umin_value == 0 at this point. Before commit 10bf4e83167c the umin was incorrectly ingored. The commit 10bf4e83167c fixed the correctness issue, but pessimized propagation of 64-bit min max into 32-bit min max and corresponding var_off. Fixes: 10bf4e83167c ("bpf: Fix propagation of 32 bit unsigned bounds from 64 bit bounds") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211101222153.78759-1-alexei.starovoitov@gmail.com
2021-11-01tracing/histogram: Document hist trigger variablesKalesh Singh
Update the tracefs README to describe how hist trigger variables can be created. Link: https://lkml.kernel.org/r/20211029183339.3216491-4-kaleshsingh@google.com Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/histogram: Optimize division by constantsKalesh Singh
If the divisor is a constant use specific division functions to avoid extra branches when the trigger is hit. If the divisor constant but not a power of 2, the division can be replaced with a multiplication and shift in the following case: Let X = dividend and Y = divisor. Choose Z = some power of 2. If Y <= Z, then: X / Y = (X * (Z / Y)) / Z (Z / Y) is a constant (mult) which is calculated at parse time, so: X / Y = (X * mult) / Z The division by Z can be replaced by a shift since Z is a power of 2: X / Y = (X * mult) >> shift As long, as X < Z the results will not be off by more than 1. Link: https://lkml.kernel.org/r/20211029232410.3494196-1-kaleshsingh@google.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01Merge tag 'hardening-v5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull compiler hardening updates from Kees Cook: "These are various compiler-related hardening feature updates. Notable is the addition of an explicit limited rationale for, and deprecation schedule of, gcc-plugins. gcc-plugins: - remove support for GCC 4.9 and older (Ard Biesheuvel) - remove duplicate include in gcc-common.h (Ye Guojin) - Explicitly document purpose and deprecation schedule (Kees Cook) - Remove cyc_complexity (Kees Cook) instrumentation: - Avoid harmless Clang option under CONFIG_INIT_STACK_ALL_ZERO (Kees Cook) Clang LTO: - kallsyms: strip LTO suffixes from static functions (Nick Desaulniers)" * tag 'hardening-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: gcc-plugins: remove duplicate include in gcc-common.h gcc-plugins: Remove cyc_complexity gcc-plugins: Explicitly document purpose and deprecation schedule kallsyms: strip LTO suffixes from static functions gcc-plugins: remove support for GCC 4.9 and older hardening: Avoid harmless Clang option under CONFIG_INIT_STACK_ALL_ZERO
2021-11-01Merge tag 'cpu-to-thread_info-v5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull thread_info update to move 'cpu' back from task_struct from Kees Cook: "Cross-architecture update to move task_struct::cpu back into thread_info on arm64, x86, s390, powerpc, and riscv. All Acked by arch maintainers. Quoting Ard Biesheuvel: 'Move task_struct::cpu back into thread_info Keeping CPU in task_struct is problematic for architectures that define raw_smp_processor_id() in terms of this field, as it requires linux/sched.h to be included, which causes a lot of pain in terms of circular dependencies (aka 'header soup') This series moves it back into thread_info (where it came from) for all architectures that enable THREAD_INFO_IN_TASK, addressing the header soup issue as well as some pointless differences in the implementations of task_cpu() and set_task_cpu()'" * tag 'cpu-to-thread_info-v5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: riscv: rely on core code to keep thread_info::cpu updated powerpc: smp: remove hack to obtain offset of task_struct::cpu sched: move CPU field back into thread_info if THREAD_INFO_IN_TASK=y powerpc: add CPU field to struct thread_info s390: add CPU field to struct thread_info x86: add CPU field to struct thread_info arm64: add CPU field to struct thread_info
2021-11-01Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "There's the usual summary below, but the highlights are support for the Armv8.6 timer extensions, KASAN support for asymmetric MTE, the ability to kexec() with the MMU enabled and a second attempt at switching to the generic pfn_valid() implementation. Summary: - Support for the Arm8.6 timer extensions, including a self-synchronising view of the system registers to elide some expensive ISB instructions. - Exception table cleanup and rework so that the fixup handlers appear correctly in backtraces. - A handful of miscellaneous changes, the main one being selection of CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK. - More mm and pgtable cleanups. - KASAN support for "asymmetric" MTE, where tag faults are reported synchronously for loads (via an exception) and asynchronously for stores (via a register). - Support for leaving the MMU enabled during kexec relocation, which significantly speeds up the operation. - Minor improvements to our perf PMU drivers. - Improvements to the compat vDSO build system, particularly when building with LLVM=1. - Preparatory work for handling some Coresight TRBE tracing errata. - Cleanup and refactoring of the SVE code to pave the way for SME support in future. - Ensure SCS pages are unpoisoned immediately prior to freeing them when KASAN is enabled for the vmalloc area. - Try moving to the generic pfn_valid() implementation again now that the DMA mapping issue from last time has been resolved. - Numerous improvements and additions to our FPSIMD and SVE selftests" [ armv8.6 timer updates were in a shared branch and already came in through -tip in the timer pull - Linus ] * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits) arm64: Select POSIX_CPU_TIMERS_TASK_WORK arm64: Document boot requirements for FEAT_SME_FA64 arm64/sve: Fix warnings when SVE is disabled arm64/sve: Add stub for sve_max_virtualisable_vl() arm64: errata: Add detection for TRBE write to out-of-range arm64: errata: Add workaround for TSB flush failures arm64: errata: Add detection for TRBE overwrite in FILL mode arm64: Add Neoverse-N2, Cortex-A710 CPU part definition selftests: arm64: Factor out utility functions for assembly FP tests arm64: vmlinux.lds.S: remove `.fixup` section arm64: extable: add load_unaligned_zeropad() handler arm64: extable: add a dedicated uaccess handler arm64: extable: add `type` and `data` fields arm64: extable: use `ex` for `exception_table_entry` arm64: extable: make fixup_exception() return bool arm64: extable: consolidate definitions arm64: gpr-num: support W registers arm64: factor out GPR numbering helpers arm64: kvm: use kvm_exception_table_entry arm64: lib: __arch_copy_to_user(): fold fixups into body ...
2021-11-01Merge tag 'x86_cc_for_v5.16_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull generic confidential computing updates from Borislav Petkov: "Add an interface called cc_platform_has() which is supposed to be used by confidential computing solutions to query different aspects of the system. The intent behind it is to unify testing of such aspects instead of having each confidential computing solution add its own set of tests to code paths in the kernel, leading to an unwieldy mess" * tag 'x86_cc_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide: Replace the use of mem_encrypt_active() with cc_platform_has() x86/sev: Replace occurrences of sev_es_active() with cc_platform_has() x86/sev: Replace occurrences of sev_active() with cc_platform_has() x86/sme: Replace occurrences of sme_active() with cc_platform_has() powerpc/pseries/svm: Add a powerpc version of cc_platform_has() x86/sev: Add an x86 version of cc_platform_has() arch/cc: Introduce a function to check for confidential computing features x86/ioremap: Selectively build arch override encryption functions
2021-11-01tracing/osnoise: Remove PREEMPT_RT ifdefs from inside functionsDaniel Bristot de Oliveira
Remove CONFIG_PREEMPT_RT from inside functions, avoiding compilation problems in the future. Link: https://lkml.kernel.org/r/37ee0881b033cdc513efc84ebea26cf77880c8c2.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Remove STACKTRACE ifdefs from inside functionsDaniel Bristot de Oliveira
Remove CONFIG_STACKTRACE from inside functions, avoiding compilation problems in the future. Link: https://lkml.kernel.org/r/3465cca2f28e1ba602a1fc8bdb28d12950b5226e.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Allow multiple instances of the same tracerDaniel Bristot de Oliveira
Currently, the user can start only one instance of timerlat/osnoise tracers and the tracers cannot run in parallel. As starting point to add more flexibility, let's allow the same tracer to run on different trace instances. The workload will start when the first trace_array (instance) is registered and stop when the last instance is unregistered. So, while this patch allows the same tracer to run in multiple instances (e.g., two instances running osnoise), it still does not allow instances of timerlat and osnoise in parallel (e.g., one timerlat and osnoise). That is because the osnoise: events have different behavior depending on which tracer is enabled (osnoise or timerlat). Enabling the parallel usage of these two tracers is in my TODO list. Link: https://lkml.kernel.org/r/38c8f14b613492a4f3f938d9d3bf0b063b72f0f0.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Remove TIMERLAT ifdefs from inside functionsDaniel Bristot de Oliveira
Remove CONFIG_TIMERLAT_TRACER from inside functions, avoiding compilation problems in the future. Link: https://lkml.kernel.org/r/8245abb5a112d249f5da6c1df499244ad9e647bc.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Support a list of trace_array *trDaniel Bristot de Oliveira
osnoise/timerlat were built to run a single instance, and for this, a single variable is enough to store the current struct trace_array *tr with information about the tracing instance. This is done via the *osnoise_trace variable. A trace_array represents a trace instance. In preparation to support multiple instances, replace the *osnoise_trace variable with an RCU protected list of instances. The operations that refer to an instance now propagate to all elements of the list (all instances). Also, replace the osnoise_busy variable with a check if the list has elements (busy). No functional change is expected with this patch, i.e., only one instance is allowed yet. Link: https://lkml.kernel.org/r/91d006e889b9a5d1ff258fe6077f021ae3f26372.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Use start/stop_per_cpu_kthreads() on osnoise_cpus_write()Daniel Bristot de Oliveira
When writing a new CPU mask via osnoise/cpus, if the tracer is running, the workload is restarted to follow the new cpumask. The restart is currently done using osnoise_workload_start/stop(), which disables the workload *and* the instrumentation. However, disabling the instrumentation is not necessary. Calling start/stop_per_cpu_kthreads() is enough to apply the new osnoise/cpus config. Link: https://lkml.kernel.org/r/ee633e82867c5b88851aa6040522a799c0034486.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Split workload start from the tracer startDaniel Bristot de Oliveira
In preparation from supporting multiple trace instances, create workload start/stop specific functions. No functional change. Link: https://lkml.kernel.org/r/74b090971e9acdd13625be1c28ef3270d2275e77.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Improve comments about barrier need for NMI callbacksDaniel Bristot de Oliveira
trace_osnoise_callback_enabled is used by ftrace_nmi_enter/exit() to know when to call the NMI callback. The barrier is used to avoid having callbacks enabled before the resetting date during the start or to touch the values after stopping the tracer. Link: https://lkml.kernel.org/r/a413b8f14aa9312fbd1ba99f96225a8aed831053.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01tracing/osnoise: Do not follow tracing_cpumaskDaniel Bristot de Oliveira
In preparation to support multiple instances, decouple the osnoise/timelat workload from instance-specific tracing_cpumask. Different instances can have conflicting cpumasks, making osnoise workload management needlessly complex. Osnoise already has its global cpumask. I also thought about using the first instance mask, but the "first" instance could be removed before the others. This also fixes the problem that changing the tracing_mask was not re-starting the trace. Link: https://lkml.kernel.org/r/169a71bcc919ce3ab53ae6f9ca5cde57fffaf9c6.1635702894.git.bristot@kernel.org Cc: Ingo Molnar <mingo@redhat.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: linux-rt-users@vger.kernel.org Cc: linux-trace-devel@vger.kernel.org Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-01bpf: Add missing map_delete_elem method to bloom filter mapEric Dumazet
Without it, kernel crashes in map_delete_elem(), as reported by syzbot. BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 72c97067 P4D 72c97067 PUD 1e20c067 PMD 0 Oops: 0010 [#1] PREEMPT SMP KASAN CPU: 0 PID: 6518 Comm: syz-executor196 Not tainted 5.15.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:0x0 Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. RSP: 0018:ffffc90002bafcb8 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: 1ffff92000575f9f RCX: 0000000000000000 RDX: 1ffffffff1327aba RSI: 0000000000000000 RDI: ffff888025a30c00 RBP: ffffc90002baff08 R08: 0000000000000000 R09: 0000000000000001 R10: ffffffff818525d8 R11: 0000000000000000 R12: ffffffff8993d560 R13: ffff888025a30c00 R14: ffff888024bc0000 R15: 0000000000000000 FS: 0000555557491300(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 0000000070189000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: map_delete_elem kernel/bpf/syscall.c:1220 [inline] __sys_bpf+0x34f1/0x5ee0 kernel/bpf/syscall.c:4606 __do_sys_bpf kernel/bpf/syscall.c:4719 [inline] __se_sys_bpf kernel/bpf/syscall.c:4717 [inline] __x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:4717 do_syscall_x64 arch/x86/entry/common.c:50 [inline] Fixes: 9330986c0300 ("bpf: Add bloom filter map implementation") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211031171353.4092388-1-eric.dumazet@gmail.com
2021-11-01bpf: Bloom filter map naming fixupsJoanne Koong
This patch has two changes in the kernel bloom filter map implementation: 1) Change the names of map-ops functions to include the "bloom_map" prefix. As Martin pointed out on a previous patchset, having generic map-ops names may be confusing in tracing and in perf-report. 2) Drop the "& 0xF" when getting nr_hash_funcs, since we already ascertain that no other bits in map_extra beyond the first 4 bits can be set. Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211029224909.1721024-2-joannekoong@fb.com
2021-11-01bpf: Add dummy BPF STRUCT_OPS for test purposeHou Tao
Currently the test of BPF STRUCT_OPS depends on the specific bpf implementation of tcp_congestion_ops, but it can not cover all basic functionalities (e.g, return value handling), so introduce a dummy BPF STRUCT_OPS for test purpose. Loading a bpf_dummy_ops implementation from userspace is prohibited, and its only purpose is to run BPF_PROG_TYPE_STRUCT_OPS program through bpf(BPF_PROG_TEST_RUN). Now programs for test_1() & test_2() are supported. The following three cases are exercised in bpf_dummy_struct_ops_test_run(): (1) test and check the value returned from state arg in test_1(state) The content of state is copied from userspace pointer and copied back after calling test_1(state). The user pointer is saved in an u64 array and the array address is passed through ctx_in. (2) test and check the return value of test_1(NULL) Just simulate the case in which an invalid input argument is passed in. (3) test multiple arguments passing in test_2(state, ...) 5 arguments are passed through ctx_in in form of u64 array. The first element of array is userspace pointer of state and others 4 arguments follow. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20211025064025.2567443-4-houtao1@huawei.com
2021-11-01bpf: Factor out helpers for ctx access checkingHou Tao
Factor out two helpers to check the read access of ctx for raw tp and BTF function. bpf_tracing_ctx_access() is used to check the read access to argument is valid, and bpf_tracing_btf_ctx_access() checks whether the btf type of argument is valid besides the checking of argument read. bpf_tracing_btf_ctx_access() will be used by the following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20211025064025.2567443-3-houtao1@huawei.com
2021-11-01bpf: Factor out a helper to prepare trampoline for struct_ops progHou Tao
Factor out a helper bpf_struct_ops_prepare_trampoline() to prepare trampoline for BPF_PROG_TYPE_STRUCT_OPS prog. It will be used by .test_run callback in following patch. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20211025064025.2567443-2-houtao1@huawei.com
2021-11-01Merge tag 'x86-fpu-2021-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: - Cleanup of extable fixup handling to be more robust, which in turn allows to make the FPU exception fixups more robust as well. - Change the return code for signal frame related failures from explicit error codes to a boolean fail/success as that's all what the calling code evaluates. - A large refactoring of the FPU code to prepare for adding AMX support: - Distangle the public header maze and remove especially the misnomed kitchen sink internal.h which is despite it's name included all over the place. - Add a proper abstraction for the register buffer storage (struct fpstate) which allows to dynamically size the buffer at runtime by flipping the pointer to the buffer container from the default container which is embedded in task_struct::tread::fpu to a dynamically allocated container with a larger register buffer. - Convert the code over to the new fpstate mechanism. - Consolidate the KVM FPU handling by moving the FPU related code into the FPU core which removes the number of exports and avoids adding even more export when AMX has to be supported in KVM. This also removes duplicated code which was of course unnecessary different and incomplete in the KVM copy. - Simplify the KVM FPU buffer handling by utilizing the new fpstate container and just switching the buffer pointer from the user space buffer to the KVM guest buffer when entering vcpu_run() and flipping it back when leaving the function. This cuts the memory requirements of a vCPU for FPU buffers in half and avoids pointless memory copy operations. This also solves the so far unresolved problem of adding AMX support because the current FPU buffer handling of KVM inflicted a circular dependency between adding AMX support to the core and to KVM. With the new scheme of switching fpstate AMX support can be added to the core code without affecting KVM. - Replace various variables with proper data structures so the extra information required for adding dynamically enabled FPU features (AMX) can be added in one place - Add AMX (Advanced Matrix eXtensions) support (finally): AMX is a large XSTATE component which is going to be available with Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD) which allows to trap the (first) use of an AMX related instruction, which has two benefits: 1) It allows the kernel to control access to the feature 2) It allows the kernel to dynamically allocate the large register state buffer instead of burdening every task with the the extra 8K or larger state storage. It would have been great to gain this kind of control already with AVX512. The support comes with the following infrastructure components: 1) arch_prctl() to - read the supported features (equivalent to XGETBV(0)) - read the permitted features for a task - request permission for a dynamically enabled feature Permission is granted per process, inherited on fork() and cleared on exec(). The permission policy of the kernel is restricted to sigaltstack size validation, but the syscall obviously allows further restrictions via seccomp etc. 2) A stronger sigaltstack size validation for sys_sigaltstack(2) which takes granted permissions and the potentially resulting larger signal frame into account. This mechanism can also be used to enforce factual sigaltstack validation independent of dynamic features to help with finding potential victims of the 2K sigaltstack size constant which is broken since AVX512 support was added. 3) Exception handling for #NM traps to catch first use of a extended feature via a new cause MSR. If the exception was caused by the use of such a feature, the handler checks permission for that feature. If permission has not been granted, the handler sends a SIGILL like the #UD handler would do if the feature would have been disabled in XCR0. If permission has been granted, then a new fpstate which fits the larger buffer requirement is allocated. In the unlikely case that this allocation fails, the handler sends SIGSEGV to the task. That's not elegant, but unavoidable as the other discussed options of preallocation or full per task permissions come with their own set of horrors for kernel and/or userspace. So this is the lesser of the evils and SIGSEGV caused by unexpected memory allocation failures is not a fundamentally new concept either. When allocation succeeds, the fpstate properties are filled in to reflect the extended feature set and the resulting sizes, the fpu::fpstate pointer is updated accordingly and the trap is disarmed for this task permanently. 4) Enumeration and size calculations 5) Trap switching via MSR_XFD The XFD (eXtended Feature Disable) MSR is context switched with the same life time rules as the FPU register state itself. The mechanism is keyed off with a static key which is default disabled so !AMX equipped CPUs have zero overhead. On AMX enabled CPUs the overhead is limited by comparing the tasks XFD value with a per CPU shadow variable to avoid redundant MSR writes. In case of switching from a AMX using task to a non AMX using task or vice versa, the extra MSR write is obviously inevitable. All other places which need to be aware of the variable feature sets and resulting variable sizes are not affected at all because they retrieve the information (feature set, sizes) unconditonally from the fpstate properties. 6) Enable the new AMX states Note, this is relatively new code despite the fact that AMX support is in the works for more than a year now. The big refactoring of the FPU code, which allowed to do a proper integration has been started exactly 3 weeks ago. Refactoring of the existing FPU code and of the original AMX patches took a week and has been subject to extensive review and testing. The only fallout which has not been caught in review and testing right away was restricted to AMX enabled systems, which is completely irrelevant for anyone outside Intel and their early access program. There might be dragons lurking as usual, but so far the fine grained refactoring has held up and eventual yet undetected fallout is bisectable and should be easily addressable before the 5.16 release. Famous last words... Many thanks to Chang Bae and Dave Hansen for working hard on this and also to the various test teams at Intel who reserved extra capacity to follow the rapid development of this closely which provides the confidence level required to offer this rather large update for inclusion into 5.16-rc1 * tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits) Documentation/x86: Add documentation for using dynamic XSTATE features x86/fpu: Include vmalloc.h for vzalloc() selftests/x86/amx: Add context switch test selftests/x86/amx: Add test cases for AMX state management x86/fpu/amx: Enable the AMX feature in 64-bit mode x86/fpu: Add XFD handling for dynamic states x86/fpu: Calculate the default sizes independently x86/fpu/amx: Define AMX state components and have it used for boot-time checks x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers x86/fpu/xstate: Add fpstate_realloc()/free() x86/fpu/xstate: Add XFD #NM handler x86/fpu: Update XFD state where required x86/fpu: Add sanity checks for XFD x86/fpu: Add XFD state to fpstate x86/msr-index: Add MSRs for XFD x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit x86/fpu: Reset permission and fpstate on exec() x86/fpu: Prepare fpu_clone() for dynamically enabled features x86/fpu/signal: Prepare for variable sigframe length x86/signal: Use fpu::__state_user_size for sigalt stack validation ...
2021-11-01Merge tag 'sched-core-2021-11-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Thomas Gleixner: - Revert the printk format based wchan() symbol resolution as it can leak the raw value in case that the symbol is not resolvable. - Make wchan() more robust and work with all kind of unwinders by enforcing that the task stays blocked while unwinding is in progress. - Prevent sched_fork() from accessing an invalid sched_task_group - Improve asymmetric packing logic - Extend scheduler statistics to RT and DL scheduling classes and add statistics for bandwith burst to the SCHED_FAIR class. - Properly account SCHED_IDLE entities - Prevent a potential deadlock when initial priority is assigned to a newly created kthread. A recent change to plug a race between cpuset and __sched_setscheduler() introduced a new lock dependency which is now triggered. Break the lock dependency chain by moving the priority assignment to the thread function. - Fix the idle time reporting in /proc/uptime for NOHZ enabled systems. - Improve idle balancing in general and especially for NOHZ enabled systems. - Provide proper interfaces for live patching so it does not have to fiddle with scheduler internals. - Add cluster aware scheduling support. - A small set of tweaks for RT (irqwork, wait_task_inactive(), various scheduler options and delaying mmdrop) - The usual small tweaks and improvements all over the place * tag 'sched-core-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits) sched/fair: Cleanup newidle_balance sched/fair: Remove sysctl_sched_migration_cost condition sched/fair: Wait before decaying max_newidle_lb_cost sched/fair: Skip update_blocked_averages if we are defering load balance sched/fair: Account update_blocked_averages in newidle_balance cost x86: Fix __get_wchan() for !STACKTRACE sched,x86: Fix L2 cache mask sched/core: Remove rq_relock() sched: Improve wake_up_all_idle_cpus() take #2 irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT irq_work: Handle some irq_work in a per-CPU thread on PREEMPT_RT irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support. sched/rt: Annotate the RT balancing logic irqwork as IRQ_WORK_HARD_IRQ sched: Add cluster scheduler level for x86 sched: Add cluster scheduler level in core and related Kconfig for ARM64 topology: Represent clusters of CPUs within a die sched: Disable -Wunused-but-set-variable sched: Add wrapper for get_wchan() to keep task blocked x86: Fix get_wchan() to support the ORC unwinder proc: Use task_is_running() for wchan in /proc/$pid/stat ...
2021-11-01Merge tag 'objtool-core-2021-10-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool updates from Thomas Gleixner: - Improve retpoline code patching by separating it from alternatives which reduces memory footprint and allows to do better optimizations in the actual runtime patching. - Add proper retpoline support for x86/BPF - Address noinstr warnings in x86/kvm, lockdep and paravirtualization code - Add support to handle pv_opsindirect calls in the noinstr analysis - Classify symbols upfront and cache the result to avoid redundant str*cmp() invocations. - Add a CFI hash to reduce memory consumption which also reduces runtime on a allyesconfig by ~50% - Adjust XEN code to make objtool handling more robust and as a side effect to prevent text fragmentation due to placement of the hypercall page. * tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits) bpf,x86: Respect X86_FEATURE_RETPOLINE* bpf,x86: Simplify computing label offsets x86,bugs: Unconditionally allow spectre_v2=retpoline,amd x86/alternative: Add debug prints to apply_retpolines() x86/alternative: Try inline spectre_v2=retpoline,amd x86/alternative: Handle Jcc __x86_indirect_thunk_\reg x86/alternative: Implement .retpoline_sites support x86/retpoline: Create a retpoline thunk array x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h x86/asm: Fixup odd GEN-for-each-reg.h usage x86/asm: Fix register order x86/retpoline: Remove unused replacement symbols objtool,x86: Replace alternatives with .retpoline_sites objtool: Shrink struct instruction objtool: Explicitly avoid self modifying code in .altinstr_replacement objtool: Classify symbols objtool: Support pv_opsindirect calls for noinstr x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays x86/xen: Mark xen_force_evtchn_callback() noinstr x86/xen: Make irq_disable() noinstr ...
2021-11-01Merge tag 'locking-core-2021-10-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Thomas Gleixner: - Move futex code into kernel/futex/ and split up the kitchen sink into seperate files to make integration of sys_futex_waitv() simpler. - Add a new sys_futex_waitv() syscall which allows to wait on multiple futexes. The main use case is emulating Windows' WaitForMultipleObjects which allows Wine to improve the performance of Windows Games. Also native Linux games can benefit from this interface as this is a common wait pattern for this kind of applications. - Add context to ww_mutex_trylock() to provide a path for i915 to rework their eviction code step by step without making lockdep upset until the final steps of rework are completed. It's also useful for regulator and TTM to avoid dropping locks in the non contended path. - Lockdep and might_sleep() cleanups and improvements - A few improvements for the RT substitutions. - The usual small improvements and cleanups. * tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) locking: Remove spin_lock_flags() etc locking/rwsem: Fix comments about reader optimistic lock stealing conditions locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able() locking/rwsem: Disable preemption for spinning region docs: futex: Fix kernel-doc references futex: Fix PREEMPT_RT build futex2: Documentation: Document sys_futex_waitv() uAPI selftests: futex: Test sys_futex_waitv() wouldblock selftests: futex: Test sys_futex_waitv() timeout selftests: futex: Add sys_futex_waitv() test futex,arm: Wire up sys_futex_waitv() futex,x86: Wire up sys_futex_waitv() futex: Implement sys_futex_waitv() futex: Simplify double_lock_hb() futex: Split out wait/wake futex: Split out requeue futex: Rename mark_wake_futex() futex: Rename: match_futex() futex: Rename: hb_waiter_{inc,dec,pending}() futex: Split out PI futex ...
2021-11-01Merge tag 'perf-core-2021-10-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Thomas Gleixner: "Core: - Allow ftrace to instrument parts of the perf core code - Add a new mem_hops field to perf_mem_data_src which allows to represent intra-node/package or inter-node/off-package details to prepare for next generation systems which have more hieararchy within the node/pacakge level. Tools: - Update for the new mem_hops field in perf_mem_data_src Arch: - A set of constraints fixes for the Intel uncore PMU - The usual set of small fixes and improvements for x86 and PPC" * tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses tools/perf: Add mem_hops field in perf_mem_data_src structure perf: Add mem_hops field in perf_mem_data_src structure perf: Add comment about current state of PERF_MEM_LVL_* namespace and remove an extra line perf/core: Allow ftrace for functions in kernel/event/core.c perf/x86: Add new event for AUX output counter index perf/x86: Add compiler barrier after updating BTS perf/x86/intel/uncore: Fix Intel SPR M3UPI event constraints perf/x86/intel/uncore: Fix Intel SPR M2PCIE event constraints perf/x86/intel/uncore: Fix Intel SPR IIO event constraints perf/x86/intel/uncore: Fix Intel SPR CHA event constraints perf/x86/intel/uncore: Fix Intel ICX IIO event constraints perf/x86/intel/uncore: Fix invalid unit check perf/x86/intel/uncore: Support extra IMC channel on Ice Lake server
2021-11-01Merge tag 'irq-core-2021-10-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt subsystem: Core changes: - Prevent a potential deadlock when initial priority is assigned to a newly created interrupt thread. A recent change to plug a race between cpuset and __sched_setscheduler() introduced a new lock dependency which is now triggered. Break the lock dependency chain by moving the priority assignment to the thread function. - A couple of small updates to make the irq core RT safe. - Confine the irq_cpu_online/offline() API to the only left unfixable user Cavium Octeon so that it does not grow new usage. - A small documentation update Driver changes: - A large cross architecture rework to move irq_enter/exit() into the architecture code to make addressing the NOHZ_FULL/RCU issues simpler. - The obligatory new irq chip driver for Microchip EIC - Modularize a few irq chip drivers - Expand usage of devm_*() helpers throughout the driver code - The usual small fixes and improvements all over the place" * tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) h8300: Fix linux/irqchip.h include mess dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings MIPS: irq: Avoid an unused-variable error genirq: Hide irq_cpu_{on,off}line() behind a deprecated option irqchip/mips-gic: Get rid of the reliance on irq_cpu_online() MIPS: loongson64: Drop call to irq_cpu_offline() irq: remove handle_domain_{irq,nmi}() irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY irq: riscv: perform irqentry in entry code irq: openrisc: perform irqentry in entry code irq: csky: perform irqentry in entry code irq: arm64: perform irqentry in entry code irq: arm: perform irqentry in entry code irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ irq: add generic_handle_arch_irq() irq: unexport handle_irq_desc() irq: simplify handle_domain_{irq,nmi}() irq: mips: simplify do_domain_IRQ() ...
2021-11-01cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.cHe Fengqing
In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf") cgroup_bpf_*() called from kernel/bpf/syscall.c, but now they are only used in kernel/bpf/cgroup.c, so move these function to kernel/bpf/cgroup.c, like cgroup_bpf_replace(). Signed-off-by: He Fengqing <hefengqing@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-01cgroup: Fix rootcg cpu.stat guest double countingDan Schatzberg
In account_guest_time in kernel/sched/cputime.c guest time is attributed to both CPUTIME_NICE and CPUTIME_USER in addition to CPUTIME_GUEST_NICE and CPUTIME_GUEST respectively. Therefore, adding both to calculate usage results in double counting any guest time at the rootcg. Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup") Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-01Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - mq-deadline accounting improvements (Bart) - blk-wbt timer fix (Andrea) - Untangle the block layer includes (Christoph) - Rework the poll support to be bio based, which will enable adding support for polling for bio based drivers (Christoph) - Block layer core support for multi-actuator drives (Damien) - blk-crypto improvements (Eric) - Batched tag allocation support (me) - Request completion batching support (me) - Plugging improvements (me) - Shared tag set improvements (John) - Concurrent queue quiesce support (Ming) - Cache bdev in ->private_data for block devices (Pavel) - bdev dio improvements (Pavel) - Block device invalidation and block size improvements (Xie) - Various cleanups, fixes, and improvements (Christoph, Jackie, Masahira, Tejun, Yu, Pavel, Zheng, me) * tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits) blk-mq-debugfs: Show active requests per queue for shared tags block: improve readability of blk_mq_end_request_batch() virtio-blk: Use blk_validate_block_size() to validate block size loop: Use blk_validate_block_size() to validate block size nbd: Use blk_validate_block_size() to validate block size block: Add a helper to validate the block size block: re-flow blk_mq_rq_ctx_init() block: prefetch request to be initialized block: pass in blk_mq_tags to blk_mq_rq_ctx_init() block: add rq_flags to struct blk_mq_alloc_data block: add async version of bio_set_polled block: kill DIO_MULTI_BIO block: kill unused polling bits in __blkdev_direct_IO() block: avoid extra iter advance with async iocb block: Add independent access ranges support blk-mq: don't issue request directly in case that current is to be blocked sbitmap: silence data race warning blk-cgroup: synchronize blkg creation against policy deactivation block: refactor bio_iov_bvec_set() block: add single bio async direct IO helper ...
2021-11-01bpf: Disallow unprivileged bpf by defaultPawan Gupta
Disabling unprivileged BPF would help prevent unprivileged users from creating certain conditions required for potential speculative execution side-channel attacks on unmitigated affected hardware. A deep dive on such attacks and current mitigations is available here [0]. Sync with what many distros are currently applying already, and disable unprivileged BPF by default. An admin can enable this at runtime, if necessary, as described in 08389d888287 ("bpf: Add kconfig knob for disabling unpriv bpf by default"). [0] "BPF and Spectre: Mitigating transient execution attacks", Daniel Borkmann, eBPF Summit '21 https://ebpf.io/summit-2021-slides/eBPF_Summit_2021-Keynote-Daniel_Borkmann-BPF_and_Spectre.pdf Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/bpf/0ace9ce3f97656d5f62d11093ad7ee81190c3c25.1635535215.git.pawan.kumar.gupta@linux.intel.com
2021-11-01Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull memory folios from Matthew Wilcox: "Add memory folios, a new type to represent either order-0 pages or the head page of a compound page. This should be enough infrastructure to support filesystems converting from pages to folios. The point of all this churn is to allow filesystems and the page cache to manage memory in larger chunks than PAGE_SIZE. The original plan was to use compound pages like THP does, but I ran into problems with some functions expecting only a head page while others expect the precise page containing a particular byte. The folio type allows a function to declare that it's expecting only a head page. Almost incidentally, this allows us to remove various calls to VM_BUG_ON(PageTail(page)) and compound_head(). This converts just parts of the core MM and the page cache. For 5.17, we intend to convert various filesystems (XFS and AFS are ready; other filesystems may make it) and also convert more of the MM and page cache to folios. For 5.18, multi-page folios should be ready. The multi-page folios offer some improvement to some workloads. The 80% win is real, but appears to be an artificial benchmark (postgres startup, which isn't a serious workload). Real workloads (eg building the kernel, running postgres in a steady state, etc) seem to benefit between 0-10%. I haven't heard of any performance losses as a result of this series. Nobody has done any serious performance tuning; I imagine that tweaking the readahead algorithm could provide some more interesting wins. There are also other places where we could choose to create large folios and currently do not, such as writes that are larger than PAGE_SIZE. I'd like to thank all my reviewers who've offered review/ack tags: Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil Babka, William Kucharski, Yu Zhao and Zi Yan. I'd also like to thank those who gave feedback I incorporated but haven't offered up review tags for this part of the series: Nick Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard, Hugh Dickins, and probably a few others who I forget" * tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits) mm/writeback: Add folio_write_one mm/filemap: Add FGP_STABLE mm/filemap: Add filemap_get_folio mm/filemap: Convert mapping_get_entry to return a folio mm/filemap: Add filemap_add_folio() mm/filemap: Add filemap_alloc_folio mm/page_alloc: Add folio allocation functions mm/lru: Add folio_add_lru() mm/lru: Convert __pagevec_lru_add_fn to take a folio mm: Add folio_evictable() mm/workingset: Convert workingset_refault() to take a folio mm/filemap: Add readahead_folio() mm/filemap: Add folio_mkwrite_check_truncate() mm/filemap: Add i_blocks_per_folio() mm/writeback: Add folio_redirty_for_writepage() mm/writeback: Add folio_account_redirty() mm/writeback: Add folio_clear_dirty_for_io() mm/writeback: Add folio_cancel_dirty() mm/writeback: Add folio_account_cleaned() mm/writeback: Add filemap_dirty_folio() ...
2021-10-31Merge branches 'apple/dart', 'arm/mediatek', 'arm/renesas', 'arm/smmu', ↵Joerg Roedel
'arm/tegra', 'iommu/fixes', 'x86/amd', 'x86/vt-d' and 'core' into next
2021-10-31sched/fair: Cleanup newidle_balanceVincent Guittot
update_next_balance() uses sd->last_balance which is not modified by load_balance() so we can merge the 2 calls in one place. No functional change Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-6-vincent.guittot@linaro.org
2021-10-31sched/fair: Remove sysctl_sched_migration_cost conditionVincent Guittot
With a default value of 500us, sysctl_sched_migration_cost is significanlty higher than the cost of load_balance. Remove the condition and rely on the sd->max_newidle_lb_cost to abort newidle_balance. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-5-vincent.guittot@linaro.org
2021-10-31sched/fair: Wait before decaying max_newidle_lb_costVincent Guittot
Decay max_newidle_lb_cost only when it has not been updated for a while and ensure to not decay a recently changed value. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org
2021-10-31sched/fair: Skip update_blocked_averages if we are defering load balanceVincent Guittot
In newidle_balance(), the scheduler skips load balance to the new idle cpu when the 1st sd of this_rq is: this_rq->avg_idle < sd->max_newidle_lb_cost Doing a costly call to update_blocked_averages() will not be useful and simply adds overhead when this condition is true. Check the condition early in newidle_balance() to skip update_blocked_averages() when possible. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-3-vincent.guittot@linaro.org
2021-10-31sched/fair: Account update_blocked_averages in newidle_balance costVincent Guittot
The time spent to update the blocked load can be significant depending of the complexity fo the cgroup hierarchy. Take this time into account in the cost of the 1st load balance of a newly idle cpu. Also reduce the number of call to sched_clock_cpu() and track more actual work. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20211019123537.17146-2-vincent.guittot@linaro.org