summaryrefslogtreecommitdiff
path: root/kernel/trace
AgeCommit message (Collapse)Author
2023-11-01Merge tag 'probes-v6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: "Cleanups: - kprobes: Fixes typo in kprobes samples - tracing/eprobes: Remove 'break' after return kretprobe/fprobe performance improvements: - lib: Introduce new `objpool`, which is a high performance lockless object queue. This uses per-cpu ring array to allocate/release objects from the pre-allocated object pool. Since the index of ring array is a 32bit sequential counter, we can retry to push/pop the object pointer from the ring without lock (as seq-lock does) - lib: Add an objpool test module to test the functionality and evaluate the performance under some circumstances - kprobes/fprobe: Improve kretprobe and rethook scalability performance with objpool. This improves both legacy kretprobe and fprobe exit handler (which is based on rethook) to be scalable on SMP systems. Even with 8-threads parallel test, it shows a great scalability improvement - Remove unneeded freelist.h which is replaced by objpool - objpool: Add maintainers entry for the objpool - objpool: Fix to remove unused include header lines" * tag 'probes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: unused header files removed MAINTAINERS: objpool added kprobes: freelist.h removed kprobes: kretprobe scalability improvement lib: objpool test module added lib: objpool added: ring-array based lockless MPMC tracing/eprobe: drop unneeded breaks samples: kprobes: Fixes a typo
2023-10-31Merge tag 'net-next-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF: - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure: https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code: - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API: - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc: - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed: - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend" * tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits) net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos() net: mana: Use xdp_set_features_flag instead of direct assignment vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size() iavf: delete the iavf client interface iavf: add a common function for undoing the interrupt scheme iavf: use unregister_netdev iavf: rely on netdev's own registered state iavf: fix the waiting time for initial reset iavf: in iavf_down, don't queue watchdog_task if comms failed iavf: simplify mutex_trylock+sleep loops iavf: fix comments about old bit locks doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name tools: ynl: introduce option to process unknown attributes or types ipvlan: properly track tx_errors netdevsim: Block until all devices are released nfp: using napi_build_skb() to replace build_skb() net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy" net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN net: dsa: microchip: Refactor switch shutdown routine for WoL preparation ...
2023-10-30Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds
Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...
2023-10-28tracing/kprobes: Fix symbol counting logic by looking at modules as wellAndrii Nakryiko
Recent changes to count number of matching symbols when creating a kprobe event failed to take into account kernel modules. As such, it breaks kprobes on kernel module symbols, by assuming there is no match. Fix this my calling module_kallsyms_on_each_symbol() in addition to kallsyms_on_each_match_symbol() to perform a proper counting. Link: https://lore.kernel.org/all/20231027233126.2073148-1-andrii@kernel.org/ Cc: Francis Laniel <flaniel@linux.microsoft.com> Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-27tracing/kprobes: Fix the description of variable length argumentsYujie Liu
Fix the following kernel-doc warnings: kernel/trace/trace_kprobe.c:1029: warning: Excess function parameter 'args' description in '__kprobe_event_gen_cmd_start' kernel/trace/trace_kprobe.c:1097: warning: Excess function parameter 'args' description in '__kprobe_event_add_fields' Refer to the usage of variable length arguments elsewhere in the kernel code, "@..." is the proper way to express it in the description. Link: https://lore.kernel.org/all/20231027041315.2613166-1-yujie.liu@intel.com/ Fixes: 2a588dd1d5d6 ("tracing: Add kprobe event command generation functions") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310190437.paI6LYJF-lkp@intel.com/ Signed-off-by: Yujie Liu <yujie.liu@intel.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/mac80211/rx.c 91535613b609 ("wifi: mac80211: don't drop all unprotected public action frames") 6c02fab72429 ("wifi: mac80211: split ieee80211_drop_unencrypted_mgmt() return value") Adjacent changes: drivers/net/ethernet/apm/xgene/xgene_enet_main.c 61471264c018 ("net: ethernet: apm: Convert to platform remove callback returning void") d2ca43f30611 ("net: xgene: Fix unused xgene_enet_of_match warning for !CONFIG_OF") net/vmw_vsock/virtio_transport.c 64c99d2d6ada ("vsock/virtio: support to send non-linear skb") 53b08c498515 ("vsock/virtio: initialize the_virtio_vsock before using VQs") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-24kprobes: unused header files removedwuqiang.matt
As kernel test robot reported, lib/test_objpool.c (trace:probes/for-next) has linux/version.h included, but version.h is not used at all. Then more unused headers are found in test_objpool.c and rethook.c, and all of them should be removed. Link: https://lore.kernel.org/all/20231023112245.6112-1-wuqiang.matt@bytedance.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310191512.vvypKU5Z-lkp@intel.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-20tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbolsFrancis Laniel
When a kprobe is attached to a function that's name is not unique (is static and shares the name with other functions in the kernel), the kprobe is attached to the first function it finds. This is a bug as the function that it is attaching to is not necessarily the one that the user wants to attach to. Instead of blindly picking a function to attach to what is ambiguous, error with EADDRNOTAVAIL to let the user know that this function is not unique, and that the user must use another unique function with an address offset to get to the function they want to attach to. Link: https://lore.kernel.org/all/20231020104250.9537-2-flaniel@linux.microsoft.com/ Cc: stable@vger.kernel.org Fixes: 413d37d1eb69 ("tracing: Add kprobe-based event tracer") Suggested-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com> Link: https://lore.kernel.org/lkml/20230819101105.b0c104ae4494a7d1f2eea742@kernel.org/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. net/mac80211/key.c 02e0e426a2fb ("wifi: mac80211: fix error path key leak") 2a8b665e6bcc ("wifi: mac80211: remove key_mtx") 7d6904bf26b9 ("Merge wireless into wireless-next") https://lore.kernel.org/all/20231012113648.46eea5ec@canb.auug.org.au/ Adjacent changes: drivers/net/ethernet/ti/Kconfig a602ee3176a8 ("net: ethernet: ti: Fix mixed module-builtin object") 98bdeae9502b ("net: cpmac: remove driver to prepare for platform removal") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19fs: create helper file_user_path() for user displayed mapped file pathAmir Goldstein
Overlayfs uses backing files with "fake" overlayfs f_path and "real" underlying f_inode, in order to use underlying inode aops for mapped files and to display the overlayfs path in /proc/<pid>/maps. In preparation for storing the overlayfs "fake" path instead of the underlying "real" path in struct backing_file, define a noop helper file_user_path() that returns f_path for now. Use the new helper in procfs and kernel logs whenever a path of a mapped file is displayed to users. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18kprobes: kretprobe scalability improvementwuqiang.matt
kretprobe is using freelist to manage return-instances, but freelist, as LIFO queue based on singly linked list, scales badly and reduces the overall throughput of kretprobed routines, especially for high contention scenarios. Here's a typical throughput test of sys_prctl (counts in 10 seconds, measured with perf stat -a -I 10000 -e syscalls:sys_enter_prctl): OS: Debian 10 X86_64, Linux 6.5rc7 with freelist HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s 1T 2T 4T 8T 16T 24T 24150045 29317964 15446741 12494489 18287272 17708768 32T 48T 64T 72T 96T 128T 16200682 13737658 11645677 11269858 10470118 9931051 This patch introduces objpool to replace freelist. objpool is a high performance queue, which can bring near-linear scalability to kretprobed routines. Tests of kretprobe throughput show the biggest ratio as 159x of original freelist. Here's the result: 1T 2T 4T 8T 16T native: 41186213 82336866 164250978 328662645 658810299 freelist: 24150045 29317964 15446741 12494489 18287272 objpool: 23926730 48010314 96125218 191782984 385091769 32T 48T 64T 96T 128T native: 1330338351 1969957941 2512291791 2615754135 2671040914 freelist: 16200682 13737658 11645677 10470118 9931051 objpool: 764481096 1147149781 1456220214 1502109662 1579015050 Testings on 96-core ARM64 output similarly, but with the biggest ratio up to 448x: OS: Debian 10 AARCH64, Linux 6.5rc7 HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s 1T 2T 4T 8T 16T native: . 30066096 63569843 126194076 257447289 505800181 freelist: 16152090 11064397 11124068 7215768 5663013 objpool: 13997541 28032100 55726624 110099926 221498787 24T 32T 48T 64T 96T native: 763305277 1015925192 1521075123 2033009392 3021013752 freelist: 5015810 4602893 3766792 3382478 2945292 objpool: 328192025 439439564 668534502 887401381 1319972072 Link: https://lore.kernel.org/all/20231017135654.82270-4-wuqiang.matt@bytedance.com/ Signed-off-by: wuqiang.matt <wuqiang.matt@bytedance.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-16Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17fprobe: Fix to ensure the number of active retprobes is not zeroMasami Hiramatsu (Google)
The number of active retprobes can be zero but it is not acceptable, so return EINVAL error if detected. Link: https://lore.kernel.org/all/169750018550.186853.11198884812017796410.stgit@devnote2/ Reported-by: wuqiang.matt <wuqiang.matt@bytedance.com> Closes: https://lore.kernel.org/all/20231016222103.cb9f426edc60220eabd8aa6a@kernel.org/ Fixes: 5b0ab78998e3 ("fprobe: Add exit_handler support") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-10-13bpf: Change syscall_nr type to int in struct syscall_tp_tArtem Savkov
linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support for lazy preemption")) that adds an extra member to struct trace_entry. This causes the offset of args field in struct trace_event_raw_sys_enter be different from the one in struct syscall_trace_enter: struct trace_event_raw_sys_enter { struct trace_entry ent; /* 0 12 */ /* XXX last struct has 3 bytes of padding */ /* XXX 4 bytes hole, try to pack */ long int id; /* 16 8 */ long unsigned int args[6]; /* 24 48 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ char __data[]; /* 72 0 */ /* size: 72, cachelines: 2, members: 4 */ /* sum members: 68, holes: 1, sum holes: 4 */ /* paddings: 1, sum paddings: 3 */ /* last cacheline: 8 bytes */ }; struct syscall_trace_enter { struct trace_entry ent; /* 0 12 */ /* XXX last struct has 3 bytes of padding */ int nr; /* 12 4 */ long unsigned int args[]; /* 16 0 */ /* size: 16, cachelines: 1, members: 3 */ /* paddings: 1, sum paddings: 3 */ /* last cacheline: 16 bytes */ }; This, in turn, causes perf_event_set_bpf_prog() fail while running bpf test_profiler testcase because max_ctx_offset is calculated based on the former struct, while off on the latter: 10488 if (is_tracepoint || is_syscall_tp) { 10489 int off = trace_event_get_offsets(event->tp_event); 10490 10491 if (prog->aux->max_ctx_offset > off) 10492 return -EACCES; 10493 } What bpf program is actually getting is a pointer to struct syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes the problem by aligning struct syscall_tp_t with struct syscall_trace_(enter|exit) and changing the tests to use these structs to dereference context. Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20231013054219.172920-1-asavkov@redhat.com
2023-10-10tracing/eprobe: drop unneeded breaksJulia Lawall
Drop break after return. Link: https://lore.kernel.org/all/20230928104334.41215-1-Julia.Lawall@inria.fr/ Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-09-30tracing/user_events: Align set_bit() address for all archsBeau Belgrave
All architectures should use a long aligned address passed to set_bit(). User processes can pass either a 32-bit or 64-bit sized value to be updated when tracing is enabled when on a 64-bit kernel. Both cases are ensured to be naturally aligned, however, that is not enough. The address must be long aligned without affecting checks on the value within the user process which require different adjustments for the bit for little and big endian CPUs. Add a compat flag to user_event_enabler that indicates when a 32-bit value is being used on a 64-bit kernel. Long align addresses and correct the bit to be used by set_bit() to account for this alignment. Ensure compat flags are copied during forks and used during deletion clears. Link: https://lore.kernel.org/linux-trace-kernel/20230925230829.341-2-beaub@linux.microsoft.com Link: https://lore.kernel.org/linux-trace-kernel/20230914131102.179100-1-cleger@rivosinc.com/ Cc: stable@vger.kernel.org Fixes: 7235759084a4 ("tracing/user_events: Use remote writes for event enablement") Reported-by: Clément Léger <cleger@rivosinc.com> Suggested-by: Clément Léger <cleger@rivosinc.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30tracing: relax trace_event_eval_update() execution with cond_resched()Clément Léger
When kernel is compiled without preemption, the eval_map_work_func() (which calls trace_event_eval_update()) will not be preempted up to its complete execution. This can actually cause a problem since if another CPU call stop_machine(), the call will have to wait for the eval_map_work_func() function to finish executing in the workqueue before being able to be scheduled. This problem was observe on a SMP system at boot time, when the CPU calling the initcalls executed clocksource_done_booting() which in the end calls stop_machine(). We observed a 1 second delay because one CPU was executing eval_map_work_func() and was not preempted by the stop_machine() task. Adding a call to cond_resched() in trace_event_eval_update() allows other tasks to be executed and thus continue working asynchronously like before without blocking any pending task at boot time. Link: https://lore.kernel.org/linux-trace-kernel/20230929191637.416931-1-cleger@rivosinc.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Clément Léger <cleger@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30ring-buffer: Update "shortest_full" in pollingSteven Rostedt (Google)
It was discovered that the ring buffer polling was incorrectly stating that read would not block, but that's because polling did not take into account that reads will block if the "buffer-percent" was set. Instead, the ring buffer polling would say reads would not block if there was any data in the ring buffer. This was incorrect behavior from a user space point of view. This was fixed by commit 42fb0a1e84ff by having the polling code check if the ring buffer had more data than what the user specified "buffer percent" had. The problem now is that the polling code did not register itself to the writer that it wanted to wait for a specific "full" value of the ring buffer. The result was that the writer would wake the polling waiter whenever there was a new event. The polling waiter would then wake up, see that there's not enough data in the ring buffer to notify user space and then go back to sleep. The next event would wake it up again. Before the polling fix was added, the code would wake up around 100 times for a hackbench 30 benchmark. After the "fix", due to the constant waking of the writer, it would wake up over 11,0000 times! It would never leave the kernel, so the user space behavior was still "correct", but this definitely is not the desired effect. To fix this, have the polling code add what it's waiting for to the "shortest_full" variable, to tell the writer not to wake it up if the buffer is not as full as it expects to be. Note, after this fix, it appears that the waiter is now woken up around 2x the times it was before (~200). This is a tremendous improvement from the 11,000 times, but I will need to spend some time to see why polling is more aggressive in its wakeups than the read blocking code. Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark") Reported-by: Julia Lawall <julia.lawall@inria.fr> Tested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-25bpf: Count missed stats in trace_call_bpfJiri Olsa
Increase misses stats in case bpf array execution is skipped because of recursion check in trace_call_bpf. Adding bpf_prog_inc_misses_counters that increase misses counts for all bpf programs in bpf_prog_array. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-5-jolsa@kernel.org
2023-09-25bpf: Add missed value to kprobe perf link infoJiri Olsa
Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution. The kprobe's missed counter gets incremented when kprobe handler is not executed due to another kprobe running on the same cpu. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org
2023-09-25bpf: Add missed value to kprobe_multi link infoJiri Olsa
Add missed value to kprobe_multi link info to hold the stats of missed kprobe_multi probe. The missed counter gets incremented when fprobe fails the recursion check or there's no rethook available for return probe. In either case the attached bpf program is not executed. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230920213145.1941596-3-jolsa@kernel.org
2023-09-25bpf: Count stats for kprobe_multi programsJiri Olsa
Adding support to gather missed stats for kprobe_multi programs due to bpf_prog_active protection. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Song Liu <song@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-2-jolsa@kernel.org
2023-09-24Merge tag 'trace-v6.6-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the "bytes" output of the per_cpu stat file The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the accounting was not accurate. It is suppose to show how many used bytes are still in the ring buffer, but even when the ring buffer was empty it would still show there were bytes used. - Fix a bug in eventfs where reading a dynamic event directory (open) and then creating a dynamic event that goes into that diretory screws up the accounting. On close, the newly created event dentry will get a "dput" without ever having a "dget" done for it. The fix is to allocate an array on dir open to save what dentries were actually "dget" on, and what ones to "dput" on close. * tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Remember what dentries were created on dir open ring-buffer: Fix bytes info in per_cpu buffer stats
2023-09-22ring-buffer: Fix bytes info in per_cpu buffer statsZheng Yejian
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of bytes in cpu buffer that have not been consumed. However, currently after consuming data by reading file 'trace_pipe', the 'bytes' info was not changed as expected. # cat per_cpu/cpu0/stats entries: 0 overrun: 0 commit overrun: 0 bytes: 568 <--- 'bytes' is problematical !!! oldest event ts: 8651.371479 now ts: 8653.912224 dropped events: 0 read events: 8 The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it: 1. When stat 'read_bytes', account consumed event in rb_advance_reader(); 2. When stat 'entries_bytes', exclude the discarded padding event which is smaller than minimum size because it is invisible to reader. Then use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for page-based read/remove/overrun. Also correct the comments of ring_buffer_bytes_cpu() in this patch. Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== The following pull-request contains BPF updates for your *net* tree. We've added 21 non-merge commits during the last 8 day(s) which contain a total of 21 files changed, 450 insertions(+), 36 deletions(-). The main changes are: 1) Adjust bpf_mem_alloc buckets to match ksize(), from Hou Tao. 2) Check whether override is allowed in kprobe mult, from Jiri Olsa. 3) Fix btf_id symbol generation with ld.lld, from Jiri and Nick. 4) Fix potential deadlock when using queue and stack maps from NMI, from Toke Høiland-Jørgensen. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Alan Maguire, Biju Das, Björn Töpel, Dan Carpenter, Daniel Borkmann, Eduard Zingerman, Hsin-Wei Hung, Marcus Seyfarth, Nathan Chancellor, Satya Durga Srinivasu Prabhala, Song Liu, Stephen Rothwell ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-15bpf: Fix uprobe_multi get_pid_task error pathJiri Olsa
Dan reported Smatch static checker warning due to missing error value set in uprobe multi link's get_pid_task error path. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/bpf/c5ffa7c0-6b06-40d5-aca2-63833b5cd9af@moroto.mountain/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230915101420.1193800-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-13Merge tag 'trace-v6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Add missing LOCKDOWN checks for eventfs callers When LOCKDOWN is active for tracing, it causes inconsistent state when some functions succeed and others fail. - Use dput() to free the top level eventfs descriptor There was a race between accesses and freeing it. - Fix a long standing bug that eventfs exposed due to changing timings by dynamically creating files. That is, If a event file is opened for an instance, there's nothing preventing the instance from being removed which will make accessing the files cause use-after-free bugs. - Fix a ring buffer race that happens when iterating over the ring buffer while writers are active. Check to make sure not to read the event meta data if it's beyond the end of the ring buffer sub buffer. - Fix the print trigger that disappeared because the test to create it was looking for the event dir field being filled, but now it has the "ef" field filled for the eventfs structure. - Remove the unused "dir" field from the event structure. - Fix the order of the trace_dynamic_info as it had it backwards for the offset and len fields for which one was for which endianess. - Fix NULL pointer dereference with eventfs_remove_rec() If an allocation fails in one of the eventfs_add_*() functions, the caller of it in event_subsystem_dir() or event_create_dir() assigns the result to the structure. But it's assigning the ERR_PTR and not NULL. This was passed to eventfs_remove_rec() which expects either a good pointer or a NULL, not ERR_PTR. The fix is to not assign the ERR_PTR to the structure, but to keep it NULL on error. - Fix list_for_each_rcu() to use list_for_each_srcu() in dcache_dir_open_wrapper(). One iteration of the code used RCU but because it had to call sleepable code, it had to be changed to use SRCU, but one of the iterations was missed. - Fix synthetic event print function to use "as_u64" instead of passing in a pointer to the union. To fix big/little endian issues, the u64 that represented several types was turned into a union to define the types properly. * tag 'trace-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec() tracefs/eventfs: Use list_for_each_srcu() in dcache_dir_open_wrapper() tracing/synthetic: Print out u64 values properly tracing/synthetic: Fix order of struct trace_dynamic_info selftests/ftrace: Fix dependencies for some of the synthetic event tests tracing: Remove unused trace_event_file dir field tracing: Use the new eventfs descriptor for print trigger ring-buffer: Do not attempt to read past "commit" tracefs/eventfs: Free top level files on removal ring-buffer: Avoid softlockup in ring_buffer_resize() tracing: Have event inject files inc the trace array ref count tracing: Have option files inc the trace array ref count tracing: Have current_trace inc the trace array ref count tracing: Have tracing_max_latency inc the trace array ref count tracing: Increase trace array ref count on enable and filter files tracefs/eventfs: Use dput to free the toplevel events directory tracefs/eventfs: Add missing lockdown checks tracefs: Add missing lockdown check to tracefs_create_dir()
2023-09-12eventfs: Fix the NULL pointer dereference bug in eventfs_remove_rec()Jinjie Ruan
Inject fault while probing btrfs.ko, if kstrdup() fails in eventfs_prepare_ef() in eventfs_add_dir(), it will return ERR_PTR to assign file->ef. But the eventfs_remove() check NULL in trace_module_remove_events(), which causes the below NULL pointer dereference. As both Masami and Steven suggest, allocater side should handle the error carefully and remove it, so fix the places where it failed. Could not create tracefs 'raid56_write' directory Btrfs loaded, zoned=no, fsverity=no Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000102544000 [000000000000001c] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: btrfs(-) libcrc32c xor xor_neon raid6_pq cfg80211 rfkill 8021q garp mrp stp llc ipv6 [last unloaded: btrfs] CPU: 15 PID: 1343 Comm: rmmod Tainted: G N 6.5.0+ #40 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : eventfs_remove_rec+0x24/0xc0 lr : eventfs_remove+0x68/0x1d8 sp : ffff800082d63b60 x29: ffff800082d63b60 x28: ffffb84b80ddd00c x27: ffffb84b3054ba40 x26: 0000000000000002 x25: ffff800082d63bf8 x24: ffffb84b8398e440 x23: ffffb84b82af3000 x22: dead000000000100 x21: dead000000000122 x20: ffff800082d63bf8 x19: fffffffffffffff4 x18: ffffb84b82508820 x17: 0000000000000000 x16: 0000000000000000 x15: 000083bc876a3166 x14: 000000000000006d x13: 000000000000006d x12: 0000000000000000 x11: 0000000000000001 x10: 00000000000017e0 x9 : 0000000000000001 x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffffb84b84289804 x5 : 0000000000000000 x4 : 9696969696969697 x3 : ffff33a5b7601f38 x2 : 0000000000000000 x1 : ffff800082d63bf8 x0 : fffffffffffffff4 Call trace: eventfs_remove_rec+0x24/0xc0 eventfs_remove+0x68/0x1d8 remove_event_file_dir+0x88/0x100 event_remove+0x140/0x15c trace_module_notify+0x1fc/0x230 notifier_call_chain+0x98/0x17c blocking_notifier_call_chain+0x4c/0x74 __arm64_sys_delete_module+0x1a4/0x298 invoke_syscall+0x44/0x100 el0_svc_common.constprop.1+0x68/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x3c/0xc4 el0t_64_sync_handler+0xa0/0xc4 el0t_64_sync+0x174/0x178 Code: 5400052c a90153b3 aa0003f3 aa0103f4 (f9401400) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: 0x384b00c00000 from 0xffff800080000000 PHYS_OFFSET: 0xffffcc5b80000000 CPU features: 0x88000203,3c020000,1000421b Memory Limit: none Rebooting in 1 seconds.. Link: https://lore.kernel.org/linux-trace-kernel/20230912134752.1838524-1-ruanjinjie@huawei.com Link: https://lore.kernel.org/all/20230912025808.668187-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230911052818.1020547-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230909072817.182846-1-ruanjinjie@huawei.com/ Link: https://lore.kernel.org/all/20230908074816.3724716-1-ruanjinjie@huawei.com/ Cc: Ajay Kaher <akaher@vmware.com> Fixes: 5bdcd5f5331a ("eventfs: Implement removal of meta data from eventfs") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-11tracing/synthetic: Print out u64 values properlyTero Kristo
The synth traces incorrectly print pointer to the synthetic event values instead of the actual value when using u64 type. Fix by addressing the contents of the union properly. Link: https://lore.kernel.org/linux-trace-kernel/20230911141704.3585965-1-tero.kristo@linux.intel.com Fixes: ddeea494a16f ("tracing/synthetic: Use union instead of casts") Cc: stable@vger.kernel.org Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08tracing: Remove unused trace_event_file dir fieldSteven Rostedt (Google)
Now that eventfs structure is used to create the events directory via the eventfs dynamically allocate code, the "dir" field of the trace_event_file structure is no longer used. Remove it. Link: https://lkml.kernel.org/r/20230908022001.580400115@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08tracing: Use the new eventfs descriptor for print triggerSteven Rostedt (Google)
The check to create the print event "trigger" was using the obsolete "dir" value of the trace_event_file to determine if it should create the trigger or not. But that value will now be NULL because it uses the event file descriptor. Change it to test the "ef" field of the trace_event_file structure so that the trace_marker "trigger" file appears again. Link: https://lkml.kernel.org/r/20230908022001.371815239@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Fixes: 27152bceea1df ("eventfs: Move tracing/events to eventfs") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08ring-buffer: Do not attempt to read past "commit"Steven Rostedt (Google)
When iterating over the ring buffer while the ring buffer is active, the writer can corrupt the reader. There's barriers to help detect this and handle it, but that code missed the case where the last event was at the very end of the page and has only 4 bytes left. The checks to detect the corruption by the writer to reads needs to see the length of the event. If the length in the first 4 bytes is zero then the length is stored in the second 4 bytes. But if the writer is in the process of updating that code, there's a small window where the length in the first 4 bytes could be zero even though the length is only 4 bytes. That will cause rb_event_length() to read the next 4 bytes which could happen to be off the allocated page. To protect against this, fail immediately if the next event pointer is less than 8 bytes from the end of the commit (last byte of data), as all events must be a minimum of 8 bytes anyway. Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Reported-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08bpf: Add override check to kprobe multi link attachJiri Olsa
Currently the multi_kprobe link attach does not check error injection list for programs with bpf_override_return helper and allows them to attach anywhere. Adding the missing check. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20230907200652.926951-1-jolsa@kernel.org
2023-09-07ring-buffer: Avoid softlockup in ring_buffer_resize()Zheng Yejian
When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be allocated, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer allocation. Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Have event inject files inc the trace array ref countSteven Rostedt (Google)
The event inject files add events for a specific trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when a event inject file is opened. Link: https://lkml.kernel.org/r/20230907024804.292337868@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 6c3edaf9fd6a ("tracing: Introduce trace event injection") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Have option files inc the trace array ref countSteven Rostedt (Google)
The option files update the options for a given trace array. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when an option file is opened. Link: https://lkml.kernel.org/r/20230907024804.086679464@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Have current_trace inc the trace array ref countSteven Rostedt (Google)
The current_trace updates the trace array tracer. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace array when current_trace is opened. Link: https://lkml.kernel.org/r/20230907024803.877687227@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Have tracing_max_latency inc the trace array ref countSteven Rostedt (Google)
The tracing_max_latency file points to the trace_array max_latency field. For an instance, if the file is opened and the instance is deleted, reading or writing to the file will cause a use after free. Up the ref count of the trace_array when tracing_max_latency is opened. Link: https://lkml.kernel.org/r/20230907024803.666889383@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07tracing: Increase trace array ref count on enable and filter filesSteven Rostedt (Google)
When the trace event enable and filter files are opened, increment the trace array ref counter, otherwise they can be accessed when the trace array is being deleted. The ref counter keeps the trace array from being deleted while those files are opened. Link: https://lkml.kernel.org/r/20230907024803.456187066@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 8530dec63e7b4 ("tracing: Add tracing_check_open_get_tr()") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-02Merge tag 'probes-v6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: - kprobes: use struct_size() for variable size kretprobe_instance data structure. - eprobe: Simplify trace_eprobe list iteration. - probe events: Data structure field access support on BTF argument. - Update BTF argument support on the functions in the kernel loadable modules (only loaded modules are supported). - Move generic BTF access function (search function prototype and get function parameters) to a separated file. - Add a function to search a member of data structure in BTF. - Support accessing BTF data structure member from probe args by C-like arrow('->') and dot('.') operators. e.g. 't sched_switch next=next->pid vruntime=next->se.vruntime' - Support accessing BTF data structure member from $retval. e.g. 'f getname_flags%return +0($retval->name):string' - Add string type checking if BTF type info is available. This will reject if user specify ":string" type for non "char pointer" type. - Automatically assume the fprobe event as a function return event if $retval is used. - selftests/ftrace: Add BTF data field access test cases. - Documentation: Update fprobe event example with BTF data field. * tag 'probes-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: Documentation: tracing: Update fprobe event example with BTF field selftests/ftrace: Add BTF fields access testcases tracing/fprobe-event: Assume fprobe is a return event by $retval tracing/probes: Add string type check with BTF tracing/probes: Support BTF field access from $retval tracing/probes: Support BTF based data structure field access tracing/probes: Add a function to search a member of a struct/union tracing/probes: Move finding func-proto API and getting func-param API to trace_btf tracing/probes: Support BTF argument on module functions tracing/eprobe: Iterate trace_eprobe directly kernel: kprobes: Use struct_size()
2023-09-01tracing/filters: Fix coding style issuesValentin Schneider
Recent commits have introduced some coding style issues, fix those up. Link: https://lkml.kernel.org/r/20230901151039.125186-5-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01tracing/filters: Change parse_pred() cpulist ternary into an if blockValentin Schneider
Review comments noted that an if block would be clearer than a ternary, so swap it out. No change in behaviour intended Link: https://lkml.kernel.org/r/20230901151039.125186-4-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01tracing/filters: Fix double-free of struct filter_pred.maskValentin Schneider
When a cpulist filter is found to contain a single CPU, that CPU is saved as a scalar and the backing cpumask storage is freed. Also NULL the mask to avoid a double-free once we get down to free_predicate(). Link: https://lkml.kernel.org/r/20230901151039.125186-3-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01tracing/filters: Fix error-handling of cpulist parsing bufferValentin Schneider
parse_pred() allocates a string buffer to parse the user-provided cpulist, but doesn't check the allocation result nor does it free the buffer once it is no longer needed. Add an allocation check, and free the buffer as soon as it is no longer needed. Link: https://lkml.kernel.org/r/20230901151039.125186-2-vschneid@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Reported-by: Steven Rostedt <rostedt@goodmis.org> Reported-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01tracing: Zero the pipe cpumask on alloc to avoid spurious -EBUSYBrian Foster
The pipe cpumask used to serialize opens between the main and percpu trace pipes is not zeroed or initialized. This can result in spurious -EBUSY returns if underlying memory is not fully zeroed. This has been observed by immediate failure to read the main trace_pipe file on an otherwise newly booted and idle system: # cat /sys/kernel/debug/tracing/trace_pipe cat: /sys/kernel/debug/tracing/trace_pipe: Device or resource busy Zero the allocation of pipe_cpumask to avoid the problem. Link: https://lore.kernel.org/linux-trace-kernel/20230831125500.986862-1-bfoster@redhat.com Cc: stable@vger.kernel.org Fixes: c2489bb7e6be ("tracing: Introduce pipe_cpumask to avoid race on trace_pipes") Reviewed-by: Zheng Yejian <zhengyejian1@huawei.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01ftrace: Use LIST_HEAD to initialize clear_hashRuan Jinjie
Use LIST_HEAD() to initialize clear_hash instead of open-coding it. Link: https://lore.kernel.org/linux-trace-kernel/20230809071551.913041-1-ruanjinjie@huawei.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01ftrace: Use within_module to check rec->ip within specified module.Levi Yun
within_module_core && within_module_init condition is same to within module but it's more readable. Use within_module instead of former condition to check rec->ip within specified module area or not. Link: https://lore.kernel.org/linux-trace-kernel/20230803205236.32201-1-ppbuk5246@gmail.com Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01tracing: Fix race issue between cpu buffer write and swapZheng Yejian
Warning happened in rb_end_commit() at code: if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing))) WARNING: CPU: 0 PID: 139 at kernel/trace/ring_buffer.c:3142 rb_commit+0x402/0x4a0 Call Trace: ring_buffer_unlock_commit+0x42/0x250 trace_buffer_unlock_commit_regs+0x3b/0x250 trace_event_buffer_commit+0xe5/0x440 trace_event_buffer_reserve+0x11c/0x150 trace_event_raw_event_sched_switch+0x23c/0x2c0 __traceiter_sched_switch+0x59/0x80 __schedule+0x72b/0x1580 schedule+0x92/0x120 worker_thread+0xa0/0x6f0 It is because the race between writing event into cpu buffer and swapping cpu buffer through file per_cpu/cpu0/snapshot: Write on CPU 0 Swap buffer by per_cpu/cpu0/snapshot on CPU 1 -------- -------- tracing_snapshot_write() [...] ring_buffer_lock_reserve() cpu_buffer = buffer->buffers[cpu]; // 1. Suppose find 'cpu_buffer_a'; [...] rb_reserve_next_event() [...] ring_buffer_swap_cpu() if (local_read(&cpu_buffer_a->committing)) goto out_dec; if (local_read(&cpu_buffer_b->committing)) goto out_dec; buffer_a->buffers[cpu] = cpu_buffer_b; buffer_b->buffers[cpu] = cpu_buffer_a; // 2. cpu_buffer has swapped here. rb_start_commit(cpu_buffer); if (unlikely(READ_ONCE(cpu_buffer->buffer) != buffer)) { // 3. This check passed due to 'cpu_buffer->buffer' [...] // has not changed here. return NULL; } cpu_buffer_b->buffer = buffer_a; cpu_buffer_a->buffer = buffer_b; [...] // 4. Reserve event from 'cpu_buffer_a'. ring_buffer_unlock_commit() [...] cpu_buffer = buffer->buffers[cpu]; // 5. Now find 'cpu_buffer_b' !!! rb_commit(cpu_buffer) rb_end_commit() // 6. WARN for the wrong 'committing' state !!! Based on above analysis, we can easily reproduce by following testcase: ``` bash #!/bin/bash dmesg -n 7 sysctl -w kernel.panic_on_warn=1 TR=/sys/kernel/tracing echo 7 > ${TR}/buffer_size_kb echo "sched:sched_switch" > ${TR}/set_event while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & while [ true ]; do echo 1 > ${TR}/per_cpu/cpu0/snapshot done & ``` To fix it, IIUC, we can use smp_call_function_single() to do the swap on the target cpu where the buffer is located, so that above race would be avoided. Link: https://lore.kernel.org/linux-trace-kernel/20230831132739.4070878-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Fixes: f1affcaaa861 ("tracing: Add snapshot in the per_cpu trace directories") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01tracing: Remove extra space at the end of hwlat_detector/modeMikhail Kobuk
Space is printed after each mode value including the last one: $ echo \"$(sudo cat /sys/kernel/tracing/hwlat_detector/mode)\" "none [round-robin] per-cpu " Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lore.kernel.org/linux-trace-kernel/20230825103432.7750-1-m.kobuk@ispras.ru Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 8fa826b7344d ("trace/hwlat: Implement the mode config option") Signed-off-by: Mikhail Kobuk <m.kobuk@ispras.ru> Reviewed-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-01Merge tag 'trace-v6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "User visible changes: - Added a way to easier filter with cpumasks: # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter - Show actual size of ring buffer after modifying the ring buffer size via buffer_size_kb. Currently it just returns what was written, but the actual size rounds up to the sub buffer size. Show that real size instead. Major changes: - Added "eventfs". This is the code that handles the inodes and dentries of tracefs/events directory. As there are thousands of events, and each event has several inodes and dentries that currently exist even when tracing is never used, they take up precious memory. Instead, eventfs will allocate the inodes and dentries in a JIT way (similar to what procfs does). There is now metadata that handles the events and subdirectories, and will create the inodes and dentries when they are used. Note, I also have patches that remove the subdirectory meta data, but will wait till the next merge window before applying them. It's a little more complex, and I want to make sure the dynamic code works properly before adding more complexity, making it easier to revert if need be. Minor changes: - Optimization to user event list traversal - Remove intermediate permission of tracefs files (note the intermediate permission removes all access to the files so it is not a security concern, but just a clean up) - Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic - Other minor cleanups" * tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits) tracefs: Remove kerneldoc from struct eventfs_file tracefs: Avoid changing i_mode to a temp value tracing/user_events: Optimize safe list traversals ftrace: Remove empty declaration ftrace_enable_daemon() and ftrace_disable_daemon() tracing: Remove unused function declarations tracing/filters: Document cpumask filtering tracing/filters: Further optimise scalar vs cpumask comparison tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU tracing/filters: Enable filtering the CPU common field by a cpumask tracing/filters: Enable filtering a scalar field by a cpumask tracing/filters: Enable filtering a cpumask field by another cpumask tracing/filters: Dynamically allocate filter_pred.regex test: ftrace: Fix kprobe test for eventfs eventfs: Move tracing/events to eventfs eventfs: Implement removal of meta data from eventfs eventfs: Implement functions to create files and dirs when accessed eventfs: Implement eventfs lookup, read, open functions eventfs: Implement eventfs file add functions ...