Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Along with the usual shower of singleton patches, notable patch series
in this pull request are:
- "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
consistency to the APIs and behaviour of these two core allocation
functions. This also simplifies/enables Rustification.
- "Some cleanups for shmem" from Baolin Wang. No functional changes -
mode code reuse, better function naming, logic simplifications.
- "mm: some small page fault cleanups" from Josef Bacik. No
functional changes - code cleanups only.
- "Various memory tiering fixes" from Zi Yan. A small fix and a
little cleanup.
- "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
simplifications and .text shrinkage.
- "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
Butt. This is a feature, it adds new feilds to /proc/vmstat such as
$ grep kstack /proc/vmstat
kstack_1k 3
kstack_2k 188
kstack_4k 11391
kstack_8k 243
kstack_16k 0
which tells us that 11391 processes used 4k of stack while none at
all used 16k. Useful for some system tuning things, but
partivularly useful for "the dynamic kernel stack project".
- "kmemleak: support for percpu memory leak detect" from Pavel
Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.
- "mm: memcg: page counters optimizations" from Roman Gushchin. "3
independent small optimizations of page counters".
- "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
David Hildenbrand. Improves PTE/PMD splitlock detection, makes
powerpc/8xx work correctly by design rather than by accident.
- "mm: remove arch_make_page_accessible()" from David Hildenbrand.
Some folio conversions which make arch_make_page_accessible()
unneeded.
- "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
Finkel. Cleans up and fixes our handling of the resetting of the
cgroup/process peak-memory-use detector.
- "Make core VMA operations internal and testable" from Lorenzo
Stoakes. Rationalizaion and encapsulation of the VMA manipulation
APIs. With a view to better enable testing of the VMA functions,
even from a userspace-only harness.
- "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
issues in the zswap global shrinker, resulting in improved
performance.
- "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
in some missing info in /proc/zoneinfo.
- "mm: replace follow_page() by folio_walk" from David Hildenbrand.
Code cleanups and rationalizations (conversion to folio_walk())
resulting in the removal of follow_page().
- "improving dynamic zswap shrinker protection scheme" from Nhat
Pham. Some tuning to improve zswap's dynamic shrinker. Significant
reductions in swapin and improvements in performance are shown.
- "mm: Fix several issues with unaccepted memory" from Kirill
Shutemov. Improvements to the new unaccepted memory feature,
- "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
DAX PUDs. This was missing, although nobody seems to have notied
yet.
- "Introduce a store type enum for the Maple tree" from Sidhartha
Kumar. Cleanups and modest performance improvements for the maple
tree library code.
- "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
more cgroup v1 remnants away from the v2 memcg code.
- "memcg: initiate deprecation of v1 features" from Shakeel Butt.
Adds various warnings telling users that memcg v1 features are
deprecated.
- "mm: swap: mTHP swap allocator base on swap cluster order" from
Chris Li. Greatly improves the success rate of the mTHP swap
allocation.
- "mm: introduce numa_memblks" from Mike Rapoport. Moves various
disparate per-arch implementations of numa_memblk code into generic
code.
- "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
improves the performance of munmap() of swap-filled ptes.
- "support large folio swap-out and swap-in for shmem" from Baolin
Wang. With this series we no longer split shmem large folios into
simgle-page folios when swapping out shmem.
- "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
performance improvements and code reductions for gigantic folios.
- "support shmem mTHP collapse" from Baolin Wang. Adds support for
khugepaged's collapsing of shmem mTHP folios.
- "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
performance regression due to the addition of mseal().
- "Increase the number of bits available in page_type" from Matthew
Wilcox. Increases the number of bits available in page_type!
- "Simplify the page flags a little" from Matthew Wilcox. Many legacy
page flags are now folio flags, so the page-based flags and their
accessors/mutators can be removed.
- "mm: store zero pages to be swapped out in a bitmap" from Usama
Arif. An optimization which permits us to avoid writing/reading
zero-filled zswap pages to backing store.
- "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
window which occurs when a MAP_FIXED operqtion is occurring during
an unrelated vma tree walk.
- "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
the vma_merge() functionality, making ot cleaner, more testable and
better tested.
- "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
Minor fixups of DAMON selftests and kunit tests.
- "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
Code cleanups and folio conversions.
- "Shmem mTHP controls and stats improvements" from Ryan Roberts.
Cleanups for shmem controls and stats.
- "mm: count the number of anonymous THPs per size" from Barry Song.
Expose additional anon THP stats to userspace for improved tuning.
- "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
folio conversions and removal of now-unused page-based APIs.
- "replace per-quota region priorities histogram buffer with
per-context one" from SeongJae Park. DAMON histogram
rationalization.
- "Docs/damon: update GitHub repo URLs and maintainer-profile" from
SeongJae Park. DAMON documentation updates.
- "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
improve related doc and warn" from Jason Wang: fixes usage of page
allocator __GFP_NOFAIL and GFP_ATOMIC flags.
- "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
This was overprovisioning THPs in sparsely accessed memory areas.
- "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
Add support for zram run-time compression algorithm tuning.
- "mm: Care about shadow stack guard gap when getting an unmapped
area" from Mark Brown. Fix up the various arch_get_unmapped_area()
implementations to better respect guard areas.
- "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
of mem_cgroup_iter() and various code cleanups.
- "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
pfnmap support.
- "resource: Fix region_intersects() vs add_memory_driver_managed()"
from Huang Ying. Fix a bug in region_intersects() for systems with
CXL memory.
- "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
a couple more code paths to correctly recover from the encountering
of poisoned memry.
- "mm: enable large folios swap-in support" from Barry Song. Support
the swapin of mTHP memory into appropriately-sized folios, rather
than into single-page folios"
* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
zram: free secondary algorithms names
uprobes: turn xol_area->pages[2] into xol_area->page
uprobes: introduce the global struct vm_special_mapping xol_mapping
Revert "uprobes: use vm_special_mapping close() functionality"
mm: support large folios swap-in for sync io devices
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
set_memory: add __must_check to generic stubs
mm/vma: return the exact errno in vms_gather_munmap_vmas()
memcg: cleanup with !CONFIG_MEMCG_V1
mm/show_mem.c: report alloc tags in human readable units
mm: support poison recovery from copy_present_page()
mm: support poison recovery from do_cow_fault()
resource, kunit: add test case for region_intersects()
resource: make alloc_free_mem_region() works for iomem_resource
mm: z3fold: deprecate CONFIG_Z3FOLD
vfio/pci: implement huge_fault support
mm/arm64: support large pfn mappings
mm/x86: support large pfn mappings
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot
de Oliveira's last major contribution to the kernel:
"SCHED_DEADLINE servers can help fixing starvation issues of low
priority tasks (e.g., SCHED_OTHER) when higher priority tasks
monopolize CPU cycles. Today we have RT Throttling; DEADLINE
servers should be able to replace and improve that."
(Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef
Esmat, Huang Shijie)
- Preparatory changes for sched_ext integration:
- Use set_next_task(.first) where required
- Fix up set_next_task() implementations
- Clean up DL server vs. core sched
- Split up put_prev_task_balance()
- Rework pick_next_task()
- Combine the last put_prev_task() and the first set_next_task()
- Rework dl_server
- Add put_prev_task(.next)
(Peter Zijlstra, with a fix by Tejun Heo)
- Complete the EEVDF transition and refine EEVDF scheduling:
- Implement delayed dequeue
- Allow shorter slices to wakeup-preempt
- Use sched_attr::sched_runtime to set request/slice suggestion
- Document the new feature flags
- Remove unused and duplicate-functionality fields
- Simplify & unify pick_next_task_fair()
- Misc debuggability enhancements
(Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin
Schneider and Chuyi Zhou)
- Initialize the vruntime of a new task when it is first enqueued,
resulting in significant decrease in latency of newly woken tasks
(Zhang Qiao)
- Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
(K Prateek Nayak, Peter Zijlstra)
- Clean up and clarify the usage of Clean up usage of rt_task()
(Qais Yousef)
- Preempt SCHED_IDLE entities in strict cgroup hierarchies
(Tianchen Ding)
- Clarify the documentation of time units for deadline scheduler
parameters (Christian Loehle)
- Remove the HZ_BW chicken-bit feature flag introduced a year ago,
the original change seems to be working fine (Phil Auld)
- Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie,
Peilin He, Qais Yousefm and Vincent Guittot)
* tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
sched/cpufreq: Use NSEC_PER_MSEC for deadline task
cpufreq/cppc: Use NSEC_PER_MSEC for deadline task
sched/deadline: Clarify nanoseconds in uapi
sched/deadline: Convert schedtool example to chrt
sched/debug: Fix the runnable tasks output
sched: Fix sched_delayed vs sched_core
kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
kthread: Fix task state in kthread worker if being frozen
sched/pelt: Use rq_clock_task() for hw_pressure
sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
sched: Add put_prev_task(.next)
sched: Rework dl_server
sched: Combine the last put_prev_task() and the first set_next_task()
sched: Rework pick_next_task()
sched: Split up put_prev_task_balance()
sched: Clean up DL server vs core sched
sched: Fixup set_next_task() implementations
sched: Use set_next_task(.first) where required
sched/fair: Properly deactivate sched_delayed task upon class change
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"12 hotfixes, 11 of which are cc:stable.
Four fixes for longstanding ocfs2 issues and the remainder address
random MM things"
* tag 'mm-hotfixes-stable-2024-09-19-00-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/madvise: process_madvise() drop capability check if same mm
mm/huge_memory: ensure huge_zero_folio won't have large_rmappable flag set
mm/hugetlb.c: fix UAF of vma in hugetlb fault pathway
mm: change vmf_anon_prepare() to __vmf_anon_prepare()
resource: fix region_intersects() vs add_memory_driver_managed()
zsmalloc: use unique zsmalloc caches names
mm/damon/vaddr: protect vma traversal in __damon_va_thre_regions() with rcu read lock
mm: vmscan.c: fix OOM on swap stress test
ocfs2: cancel dqi_sync_work before freeing oinfo
ocfs2: fix possible null-ptr-deref in ocfs2_set_buffer_uptodate
ocfs2: remove unreasonable unlock in ocfs2_read_blocks
ocfs2: fix null-ptr-deref when journal load failed.
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- support DMA zones for arm64 systems where memory starts at > 4GB
(Baruch Siach, Catalin Marinas)
- support direct calls into dma-iommu and thus obsolete dma_map_ops for
many common configurations (Leon Romanovsky)
- add DMA-API tracing (Sean Anderson)
- remove the not very useful return value from various dma_set_* APIs
(Christoph Hellwig)
- misc cleanups and minor optimizations (Chen Y, Yosry Ahmed, Christoph
Hellwig)
* tag 'dma-mapping-6.12-2024-09-19' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: reflow dma_supported
dma-mapping: reliably inform about DMA support for IOMMU
dma-mapping: add tracing for dma-mapping API calls
dma-mapping: use IOMMU DMA calls for common alloc/free page calls
dma-direct: optimize page freeing when it is not addressable
dma-mapping: clearly mark DMA ops as an architecture feature
vdpa_sim: don't select DMA_OPS
arm64: mm: keep low RAM dma zone
dma-mapping: don't return errors from dma_set_max_seg_size
dma-mapping: don't return errors from dma_set_seg_boundary
dma-mapping: don't return errors from dma_set_min_align_mask
scsi: check that busses support the DMA API before setting dma parameters
arm64: mm: fix DMA zone when dma-ranges is missing
dma-mapping: direct calls for dma-iommu
dma-mapping: call ->unmap_page and ->unmap_sg unconditionally
arm64: support DMA zone above 4GB
dma-mapping: replace zone_dma_bits by zone_dma_limit
dma-mapping: use bit masking to check VM_DMA_COHERENT
|
|
Pull drm updates from Dave Airlie:
"This adds a couple of patches outside the drm core, all should be
acked appropriately, the string and pstore ones are the main ones that
come to mind.
Otherwise it's the usual drivers, xe is getting enabled by default on
some new hardware, we've changed the device number handling to allow
more devices, and we added some optional rust code to create QR codes
in the panic handler, an idea first suggested I think 10 years ago :-)
string:
- add mem_is_zero()
core:
- support more device numbers
- use XArray for minor ids
- add backlight constants
- Split dma fence array creation into alloc and arm
fbdev:
- remove usage of old fbdev hooks
kms:
- Add might_fault() to drm_modeset_lock priming
- Add dynamic per-crtc vblank configuration support
dma-buf:
- docs cleanup
buddy:
- Add start address support for trim function
printk:
- pass description to kmsg_dump
scheduler:
- Remove full_recover from drm_sched_start
ttm:
- Make LRU walk restartable after dropping locks
- Allow direct reclaim to allocate local memory
panic:
- add display QR code (in rust)
displayport:
- mst: GUID improvements
bridge:
- Silence error message on -EPROBE_DEFER
- analogix: Clean aup
- bridge-connector: Fix double free
- lt6505: Disable interrupt when powered off
- tc358767: Make default DP port preemphasis configurable
- lt9611uxc: require DRM_BRIDGE_ATTACH_NO_CONNECTOR
- anx7625: simplify OF array handling
- dw-hdmi: simplify clock handling
- lontium-lt8912b: fix mode validation
- nwl-dsi: fix mode vsync/hsync polarity
xe:
- Enable LunarLake and Battlemage support
- Introducing Xe2 ccs modifiers for integrated and discrete graphics
- rename xe perf to xe observation
- use wb caching on DGFX for system memory
- add fence timeouts
- Lunar Lake graphics/media/display workarounds
- Battlemage workarounds
- Battlemage GSC support
- GSC and HuC fw updates for LL/BM
- use dma_fence_chain_free
- refactor hw engine lookup and mmio access
- enable priority mem read for Xe2
- Add first GuC BMG fw
- fix dma-resv lock
- Fix DGFX display suspend/resume
- Use xe_managed for kernel BOs
- Use reserved copy engine for user binds on faulting devices
- Allow mixing dma-fence jobs and long-running faulting jobs
- fix media TLB invalidation
- fix rpm in TTM swapout path
- track resources and VF state by PF
i915:
- Type-C programming fix for MTL+
- FBC cleanup
- Calc vblank delay more accurately
- On DP MST, Enable LT fallback for UHBR<->non-UHBR rates
- Fix DP LTTPR detection
- limit relocations to INT_MAX
- fix long hangs in buddy allocator on DG2/A380
amdgpu:
- Per-queue reset support
- SDMA devcoredump support
- DCN 4.0.1 updates
- GFX12/VCN4/JPEG4 updates
- Convert vbios embedded EDID to drm_edid
- GFX9.3/9.4 devcoredump support
- process isolation framework for GFX 9.4.3/4
- take IOMMU mappings into account for P2P DMA
amdkfd:
- CRIU fixes
- HMM fix
- Enable process isolation support for GFX 9.4.3/4
- Allow users to target recommended SDMA engines
- KFD support for targetting queues on recommended SDMA engines
radeon:
- remove .load and drm_dev_alloc
- Fix vbios embedded EDID size handling
- Convert vbios embedded EDID to drm_edid
- Use GEM references instead of TTM
- r100 cp init cleanup
- Fix potential overflows in evergreen CS offset tracking
msm:
- DPU:
- implement DP/PHY mapping on SC8180X
- Enable writeback on SM8150, SC8180X, SM6125, SM6350
- DP:
- Enable widebus on all relevant chipsets
- MSM8998 HDMI support
- GPU:
- A642L speedbin support
- A615/A306/A621 support
- A7xx devcoredump support
ast:
- astdp: Support AST2600 with VGA
- Clean up HPD
- Fix timeout loop for DP link training
- reorganize output code by type (VGA, DP, etc)
- convert to struct drm_edid
- fix BMC handling for all outputs
exynos:
- drop stale MAINTAINERS pattern
- constify struct
loongson:
- use GEM refcount over TTM
mgag200:
- Improve BMC handling
- Support VBLANK intterupts
- transparently support BMC outputs
nouveau:
- Refactor and clean up internals
- Use GEM refcount over TTM's
gm12u320:
- convert to struct drm_edid
gma500:
- update i2c terms
lcdif:
- pixel clock fix
host1x:
- fix syncpoint IRQ during resume
- use iommu_paging_domain_alloc()
imx:
- ipuv3: convert to struct drm_edid
omapdrm:
- improve error handling
- use common helper for_each_endpoint_of_node()
panel:
- add support for BOE TV101WUM-LL2 plus DT bindings
- novatek-nt35950: improve error handling
- nv3051d: improve error handling
- panel-edp:
- add support for BOE NE140WUM-N6G
- revert support for SDC ATNA45AF01
- visionox-vtdr6130:
- improve error handling
- use devm_regulator_bulk_get_const()
- boe-th101mb31ig002:
- Support for starry-er88577 MIPI-DSI panel plus DT
- Fix porch parameter
- edp: Support AOU B116XTN02.3, AUO B116XAN06.1, AOU B116XAT04.1, BOE
NV140WUM-N41, BOE NV133WUM-N63, BOE NV116WHM-A4D, CMN N116BCA-EA2,
CMN N116BCP-EA2, CSW MNB601LS1-4
- himax-hx8394: Support Microchip AC40T08A MIPI Display panel plus DT
- ilitek-ili9806e: Support Densitron DMT028VGHMCMI-1D TFT plus DT
- jd9365da:
- Support Melfas lmfbx101117480 MIPI-DSI panel plus DT
- Refactor for code sharing
- panel-edp: fix name for HKC MB116AN01
- jd9365da: fix "exit sleep" commands
- jdi-fhd-r63452: simplify error handling with DSI multi-style
helpers
- mantix-mlaf057we51: simplify error handling with DSI multi-style
helpers
- simple:
- support Innolux G070ACE-LH3 plus DT bindings
- support On Tat Industrial Company KD50G21-40NT-A1 plus DT
bindings
- st7701:
- decouple DSI and DRM code
- add SPI support
- support Anbernic RG28XX plus DT bindings
mediatek:
- support alpha blending
- remove cl in struct cmdq_pkt
- ovl adaptor fix
- add power domain binding for mediatek DPI controller
renesas:
- rz-du: add support for RZ/G2UL plus DT bindings
rockchip:
- Improve DP sink-capability reporting
- dw_hdmi: Support 4k@60Hz
- vop:
- Support RGB display on Rockchip RK3066
- Support 4096px width
sti:
- convert to struct drm_edid
stm:
- Avoid UAF wih managed plane and CRTC helpers
- Fix module owner
- Fix error handling in probe
- Depend on COMMON_CLK
- ltdc:
- Fix transparency after disabling plane
- Remove unused interrupt
tegra:
- gr3d: improve PM domain handling
- convert to struct drm_edid
- Call drm_atomic_helper_shutdown()
vc4:
- fix PM during detect
- replace DRM_ERROR() with drm_error()
- v3d: simplify clock retrieval
v3d:
- Clean up perfmon
virtio:
- add DRM capset"
* tag 'drm-next-2024-09-19' of https://gitlab.freedesktop.org/drm/kernel: (1326 commits)
drm/xe: Fix missing conversion to xe_display_pm_runtime_resume
drm/xe/xe2hpg: Add Wa_15016589081
drm/xe: Don't keep stale pointer to bo->ggtt_node
drm/xe: fix missing 'xe_vm_put'
drm/xe: fix build warning with CONFIG_PM=n
drm/xe: Suppress missing outer rpm protection warning
drm/xe: prevent potential UAF in pf_provision_vf_ggtt()
drm/amd/display: Add all planes on CRTC to state for overlay cursor
drm/i915/bios: fix printk format width
drm/i915/display: Fix BMG CCS modifiers
drm/amdgpu: get rid of bogus includes of fdtable.h
drm/amdkfd: CRIU fixes
drm/amdgpu: fix a race in kfd_mem_export_dmabuf()
drm: new helper: drm_gem_prime_handle_to_dmabuf()
drm/amdgpu/atomfirmware: Silence UBSAN warning
drm/amdgpu: Fix kdoc entry in 'amdgpu_vm_cpu_prepare'
drm/amd/amdgpu: apply command submission parser for JPEG v1
drm/amd/amdgpu: apply command submission parser for JPEG v2+
drm/amd/pm: fix the pp_dpm_pcie issue on smu v14.0.2/3
drm/amd/pm: update the features set on smu v14.0.2/3
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Implement per-PMU context rescheduling to significantly improve
single-PMU performance, and related cleanups/fixes (Peter Zijlstra
and Namhyung Kim)
- Fix ancient bug resulting in a lot of events being dropped
erroneously at higher sampling frequencies (Luo Gengkun)
- uprobes enhancements:
- Implement RCU-protected hot path optimizations for better
performance:
"For baseline vs SRCU, peak througput increased from 3.7 M/s
(million uprobe triggerings per second) up to about 8 M/s. For
uretprobes it's a bit more modest with bump from 2.4 M/s to
5 M/s.
For SRCU vs RCU Tasks Trace, peak throughput for uprobes
increases further from 8 M/s to 10.3 M/s (+28%!), and for
uretprobes from 5.3 M/s to 5.8 M/s (+11%), as we have more
work to do on uretprobes side.
Even single-thread (no contention) performance is slightly
better: 3.276 M/s to 3.396 M/s (+3.5%) for uprobes, and 2.055
M/s to 2.174 M/s (+5.8%) for uretprobes."
(Andrii Nakryiko et al)
- Document mmap_lock, don't abuse get_user_pages_remote() (Oleg
Nesterov)
- Cleanups & fixes to prepare for future work:
- Remove uprobe_register_refctr()
- Simplify error handling for alloc_uprobe()
- Make uprobe_register() return struct uprobe *
- Fold __uprobe_unregister() into uprobe_unregister()
- Shift put_uprobe() from delete_uprobe() to uprobe_unregister()
- BPF: Fix use-after-free in bpf_uprobe_multi_link_attach()
(Oleg Nesterov)
- New feature & ABI extension: allow events to use PERF_SAMPLE READ
with inheritance, enabling sample based profiling of a group of
counters over a hierarchy of processes or threads (Ben Gainey)
- Intel uncore & power events updates:
- Add Arrow Lake and Lunar Lake support
- Add PERF_EV_CAP_READ_SCOPE
- Clean up and enhance cpumask and hotplug support
(Kan Liang)
- Add LNL uncore iMC freerunning support
- Use D0:F0 as a default device
(Zhenyu Wang)
- Intel PT: fix AUX snapshot handling race (Adrian Hunter)
- Misc fixes and cleanups (James Clark, Jiri Olsa, Oleg Nesterov and
Peter Zijlstra)
* tag 'perf-core-2024-09-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
dmaengine: idxd: Clean up cpumask and hotplug for perfmon
iommu/vt-d: Clean up cpumask and hotplug for perfmon
perf/x86/intel/cstate: Clean up cpumask and hotplug
perf: Add PERF_EV_CAP_READ_SCOPE
perf: Generic hotplug support for a PMU with a scope
uprobes: perform lockless SRCU-protected uprobes_tree lookup
rbtree: provide rb_find_rcu() / rb_find_add_rcu()
perf/uprobe: split uprobe_unregister()
uprobes: travers uprobe's consumer list locklessly under SRCU protection
uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
uprobes: protected uprobe lifetime with SRCU
uprobes: revamp uprobe refcounting and lifetime management
bpf: Fix use-after-free in bpf_uprobe_multi_link_attach()
perf/core: Fix small negative period being ignored
perf: Really fix event_function_call() locking
perf: Optimize __pmu_ctx_sched_out()
perf: Add context time freeze
perf: Fix event_function_call() locking
perf: Extract a few helpers
perf: Optimize context reschedule for single PMU cases
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- binfmt_elf: Dump smaller VMAs first in ELF cores (Brian Mak)
- binfmt_elf: mseal address zero (Jeff Xu)
- binfmt_elf, coredump: Log the reason of the failed core dumps (Roman
Kisel)
* tag 'execve-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
binfmt_elf: mseal address zero
binfmt_elf: Dump smaller VMAs first in ELF cores
binfmt_elf, coredump: Log the reason of the failed core dumps
coredump: Standartize and fix logging
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
"This time it's mostly refactoring and improving APIs for slab users in
the kernel, along with some debugging improvements.
- kmem_cache_create() refactoring (Christian Brauner)
Over the years have been growing new parameters to
kmem_cache_create() where most of them are needed only for a small
number of caches - most recently the rcu_freeptr_offset parameter.
To avoid adding new parameters to kmem_cache_create() and adjusting
all its callers, or creating new wrappers such as
kmem_cache_create_rcu(), we can now pass extra parameters using the
new struct kmem_cache_args. Not explicitly initialized fields
default to values interpreted as unused.
kmem_cache_create() is for now a wrapper that works both with the
new form: kmem_cache_create(name, object_size, args, flags) and the
legacy form: kmem_cache_create(name, object_size, align, flags,
ctor)
- kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil
Babka, Uladislau Rezki)
Since SLOB removal, kfree() is allowed for freeing objects
allocated by kmem_cache_create(). By extension kfree_rcu() as
allowed as well, which can allow converting simple call_rcu()
callbacks that only do kmem_cache_free(), as there was never a
kmem_cache_free_rcu() variant. However, for caches that can be
destroyed e.g. on module removal, the cache owners knew to issue
rcu_barrier() first to wait for the pending call_rcu()'s, and this
is not sufficient for pending kfree_rcu()'s due to its internal
batching optimizations. Ulad has provided a new
kvfree_rcu_barrier() and to make the usage less error-prone,
kmem_cache_destroy() calls it. Additionally, destroying
SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier()
synchronously instead of using an async work, because the past
motivation for async work no longer applies. Users of custom
call_rcu() callbacks should however keep calling rcu_barrier()
before cache destruction.
- Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn)
Currently, KASAN cannot catch UAFs in such caches as it is legal to
access them within a grace period, and we only track the grace
period when trying to free the underlying slab page. The new
CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual
object to be RCU-delayed, after which KASAN can poison them.
- Delayed memcg charging (Shakeel Butt)
In some cases, the memcg is uknown at allocation time, such as
receiving network packets in softirq context. With
kmem_cache_charge() these may be now charged later when the user
and its memcg is known.
- Misc fixes and improvements (Pedro Falcato, Axel Rasmussen,
Christoph Lameter, Yan Zhen, Peng Fan, Xavier)"
* tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits)
mm, slab: restore kerneldoc for kmem_cache_create()
io_uring: port to struct kmem_cache_args
slab: make __kmem_cache_create() static inline
slab: make kmem_cache_create_usercopy() static inline
slab: remove kmem_cache_create_rcu()
file: port to struct kmem_cache_args
slab: create kmem_cache_create() compatibility layer
slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args
slab: port KMEM_CACHE() to struct kmem_cache_args
slab: remove rcu_freeptr_offset from struct kmem_cache
slab: pass struct kmem_cache_args to do_kmem_cache_create()
slab: pull kmem_cache_open() into do_kmem_cache_create()
slab: pass struct kmem_cache_args to create_cache()
slab: port kmem_cache_create_usercopy() to struct kmem_cache_args
slab: port kmem_cache_create_rcu() to struct kmem_cache_args
slab: port kmem_cache_create() to struct kmem_cache_args
slab: add struct kmem_cache_args
slab: s/__kmem_cache_create/do_kmem_cache_create/g
memcg: add charging of already allocated slab objects
mm/slab: Optimize the code logic in find_mergeable()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull core dump update from Paul McKenney:
"Sleep at TASK_IDLE when waiting for application core dump
This causes the coredump_task_exit() function to sleep at TASK_IDLE,
thus preventing task-blocked splats in case of large core dumps to
slow devices"
* tag 'misc.2024.09.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
exit: Sleep at TASK_IDLE when waiting for application core dump
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull kcsan update from Paul McKenney:
"Use min() to fix Coccinelle warning.
Courtesy of Thorsten Blum"
* tag 'kcsan.2024.09.14a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
kcsan: Use min() to fix Coccinelle warning
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Neeraj Upadhyay:
"Context tracking:
- rename context tracking state related symbols and remove references
to "dynticks" in various context tracking state variables and
related helpers
- force context_tracking_enabled_this_cpu() to be inlined to avoid
leaving a noinstr section
CSD lock:
- enhance CSD-lock diagnostic reports
- add an API to provide an indication of ongoing CSD-lock stall
nocb:
- update and simplify RCU nocb code to handle (de-)offloading of
callbacks only for offline CPUs
- fix RT throttling hrtimer being armed from offline CPU
rcutorture:
- remove redundant rcu_torture_ops get_gp_completed fields
- add SRCU ->same_gp_state and ->get_comp_state functions
- add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU
polled grace periods
- add CFcommon.arch for arch-specific Kconfig options
- print number of update types in rcu_torture_write_types()
- add rcutree.nohz_full_patience_delay testing to the TREE07 scenario
- add a stall_cpu_repeat module parameter to test repeated CPU stalls
- add argument to limit number of CPUs a guest OS can use in
torture.sh
rcustall:
- abbreviate RCU CPU stall warnings during CSD-lock stalls
- Allow dump_cpu_task() to be called without disabling preemption
- defer printing stall-warning backtrace when holding rcu_node lock
srcu:
- make SRCU gp seq wrap-around faster
- add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and
->reschedule_count which are used in heuristics governing
auto-expediting of normal SRCU grace periods and
grace-period-state-machine delays
- mark idle SRCU-barrier callbacks to help identify stuck
SRCU-barrier callback
rcu tasks:
- remove RCU Tasks Rude asynchronous APIs as they are no longer used
- stop testing RCU Tasks Rude asynchronous APIs
- fix access to non-existent percpu regions
- check processor-ID assumptions during chosen CPU calculation for
callback enqueuing
- update description of rtp->tasks_gp_seq grace-period sequence
number
- add rcu_barrier_cb_is_done() to identify whether a given
rcu_barrier callback is stuck
- mark idle Tasks-RCU-barrier callbacks
- add *torture_stats_print() functions to print detailed diagnostics
for Tasks-RCU variants
- capture start time of rcu_barrier_tasks*() operation to help
distinguish a hung barrier operation from a long series of barrier
operations
refscale:
- add a TINY scenario to support tests of Tiny RCU and Tiny
SRCU
- optimize process_durations() operation
rcuscale:
- dump stacks of stalled rcu_scale_writer() instances and
grace-period statistics when rcu_scale_writer() stalls
- mark idle RCU-barrier callbacks to identify stuck RCU-barrier
callbacks
- print detailed grace-period and barrier diagnostics on
rcu_scale_writer() hangs for Tasks-RCU variants
- warn if async module parameter is specified for RCU implementations
that do not have async primitives such as RCU Tasks Rude
- make all writer tasks report upon hang
- tolerate repeated GFP_KERNEL failure in rcu_scale_writer()
- use special allocator for rcu_scale_writer()
- NULL out top-level pointers to heap memory to avoid double-free
bugs on modprobe failures
- maintain per-task instead of per-CPU callbacks count to avoid any
issues with migration of either tasks or callbacks
- constify struct ref_scale_ops
Fixes:
- use system_unbound_wq for kfree_rcu work to avoid disturbing
isolated CPUs
Misc:
- warn on unexpected rcu_state.srs_done_tail state
- better define "atomic" for list_replace_rcu() and
hlist_replace_rcu() routines
- annotate struct kvfree_rcu_bulk_data with __counted_by()"
* tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (90 commits)
rcu: Defer printing stall-warning backtrace when holding rcu_node lock
rcu/nocb: Remove superfluous memory barrier after bypass enqueue
rcu/nocb: Conditionally wake up rcuo if not already waiting on GP
rcu/nocb: Fix RT throttling hrtimer armed from offline CPU
rcu/nocb: Simplify (de-)offloading state machine
context_tracking: Tag context_tracking_enabled_this_cpu() __always_inline
context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching
rcu: Update stray documentation references to rcu_dynticks_eqs_{enter, exit}()
rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs()
rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck()
rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save()
rcu: Rename struct rcu_data .exp_dynticks_snap into .exp_watching_snap
rcu: Rename struct rcu_data .dynticks_snap into .watching_snap
rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs()
rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since()
rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs()
rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online()
context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into rcu_is_watching_curr_cpu()
context_tracking, rcu: Rename rcu_dynticks_task*() into rcu_task*()
refscale: Constify struct ref_scale_ops
...
|
|
Pull workqueue updates from Tejun Heo:
"Nothing major:
- workqueue.panic_on_stall boot param added
- alloc_workqueue_lockdep_map() added (used by DRM)
- Other cleanusp and doc updates"
* tag 'wq-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
kernel/workqueue.c: fix DEFINE_PER_CPU_SHARED_ALIGNED expansion
workqueue: Fix another htmldocs build warning
workqueue: fix null-ptr-deref on __alloc_workqueue() error
workqueue: Don't call va_start / va_end twice
workqueue: Fix htmldocs build warning
workqueue: Add interface for user-defined workqueue lockdep map
workqueue: Change workqueue lockdep map to pointer
workqueue: Split alloc_workqueue into internal function and lockdep init
Documentation: kernel-parameters: add workqueue.panic_on_stall
workqueue: add cmdline parameter workqueue.panic_on_stall
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cpuset isolation improvements
- cpuset cgroup1 support is split into its own file behind the new
config option CONFIG_CPUSET_V1. This makes it the second controller
which makes cgroup1 support optional after memcg
- Handling of unavailable v1 controller handling improved during
cgroup1 mount operations
- union_find applied to cpuset. It makes code simpler and more
efficient
- Reduce spurious events in pids.events
- Cleanups and other misc changes
- Contains a merge of cgroup/for-6.11-fixes to receive cpuset fixes
that further changes build upon
* tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (34 commits)
cgroup: Do not report unavailable v1 controllers in /proc/cgroups
cgroup: Disallow mounting v1 hierarchies without controller implementation
cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
cgroup/cpuset: Move cpu.h include to cpuset-internal.h
cgroup/cpuset: add sefltest for cpuset v1
cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1
cgroup/cpuset: rename functions shared between v1 and v2
cgroup/cpuset: move v1 interfaces to cpuset-v1.c
cgroup/cpuset: move validate_change_legacy to cpuset-v1.c
cgroup/cpuset: move legacy hotplug update to cpuset-v1.c
cgroup/cpuset: add callback_lock helper
cgroup/cpuset: move memory_spread to cpuset-v1.c
cgroup/cpuset: move relax_domain_level to cpuset-v1.c
cgroup/cpuset: move memory_pressure to cpuset-v1.c
cgroup/cpuset: move common code to cpuset-internal.h
cgroup/cpuset: introduce cpuset-v1.c
selftest/cgroup: Make test_cpuset_prs.sh deal with pre-isolated CPUs
cgroup/cpuset: Account for boot time isolated CPUs
cgroup/cpuset: remove use_parent_ecpus of cpuset
cgroup/cpuset: remove fetch_xcpus
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Thomas Gleixner:
"Updates for KCOV instrumentation on x86:
- Prevent spurious KCOV coverage in common_interrupt()
- Fixup the KCOV Makefile directive which got stale due to a source
file rename
- Exclude stack unwinding from KCOV as it creates large amounts of
uninteresting coverage
- Provide a self test to validate that KCOV coverage of the interrupt
handling code starts not before preempt count got updated"
* tag 'x86-build-2024-09-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Ignore stack unwinding in KCOV
module: Fix KCOV-ignored file name
kcov: Add interrupt handling self test
x86/entry: Remove unwanted instrumentation in common_interrupt()
|
|
Now that xol_mapping has its own ->fault() method we no longer need
xol_area->pages[1] == NULL, we need a single page.
Link: https://lkml.kernel.org/r/20240911131437.GC3448@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently each xol_area has its own instance of vm_special_mapping, this
is suboptimal and ugly. Kill xol_area->xol_mapping and add a single
global instance of vm_special_mapping, the ->fault() method can use
area->pages rather than xol_mapping->pages.
As a side effect this fixes the problem introduced by the recent commit
223febc6e557 ("mm: add optional close() to struct vm_special_mapping"), if
special_mapping_close() is called from the __mmput() paths, it will use
vma->vm_private_data = &area->xol_mapping freed by uprobe_clear_state().
Link: https://lkml.kernel.org/r/20240911131407.GB3448@redhat.com
Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Closes: https://lore.kernel.org/all/yt9dy149vprr.fsf@linux.ibm.com/
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This reverts commit 08e28de1160a712724268fd33d77b32f1bc84d1c.
A malicious application can munmap() its "[uprobes]" vma and in this case
xol_mapping.close == uprobe_clear_state() will free the memory which can
be used by another thread, or the same thread when it hits the uprobe bp
afterwards.
Link: https://lkml.kernel.org/r/20240911131320.GA3448@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "resource: Fix region_intersects() vs
add_memory_driver_managed()", v3.
The patchset fixes a bug of region_intersects() for systems with CXL
memory. The details of the bug can be found in [1/3]. To avoid similar
bugs in the future. A kunit test case for region_intersects() is added in
[3/3]. [2/3] is a preparation patch for [3/3].
This patch (of 3):
region_intersects() is important because it's used for /dev/mem permission
checking. To avoid possible bug of region_intersects() in the future, a
kunit test case for region_intersects() is added.
Link: https://lkml.kernel.org/r/20240906030713.204292-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20240906030713.204292-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
During developing a kunit test case for region_intersects(), some fake
resources need to be inserted into iomem_resource. To do that, a resource
hole needs to be found first in iomem_resource.
However, alloc_free_mem_region() cannot work for iomem_resource now.
Because the start address to check cannot be 0 to detect address wrapping
0 in gfr_continue(), while iomem_resource.start == 0. To make
alloc_free_mem_region() works for iomem_resource, gfr_start() is changed
to avoid to return 0 even if base->start == 0. We don't need to check 0
as start address.
Link: https://lkml.kernel.org/r/20240906030713.204292-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On a system with CXL memory, the resource tree (/proc/iomem) related to
CXL memory may look like something as follows.
490000000-50fffffff : CXL Window 0
490000000-50fffffff : region0
490000000-50fffffff : dax0.0
490000000-50fffffff : System RAM (kmem)
Because drivers/dax/kmem.c calls add_memory_driver_managed() during
onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL
Window X". This confuses region_intersects(), which expects all "System
RAM" resources to be at the top level of iomem_resource. This can lead to
bugs.
For example, when the following command line is executed to write some
memory in CXL memory range via /dev/mem,
$ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1
dd: error writing '/dev/mem': Bad address
1+0 records in
0+0 records out
0 bytes copied, 0.0283507 s, 0.0 kB/s
the command fails as expected. However, the error code is wrong. It
should be "Operation not permitted" instead of "Bad address". More
seriously, the /dev/mem permission checking in devmem_is_allowed() passes
incorrectly. Although the accessing is prevented later because ioremap()
isn't allowed to map system RAM, it is a potential security issue. During
command executing, the following warning is reported in the kernel log for
calling ioremap() on system RAM.
ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff
WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d
Call Trace:
memremap+0xcb/0x184
xlate_dev_mem_ptr+0x25/0x2f
write_mem+0x94/0xfb
vfs_write+0x128/0x26d
ksys_write+0xac/0xfe
do_syscall_64+0x9a/0xfd
entry_SYSCALL_64_after_hwframe+0x4b/0x53
The details of command execution process are as follows. In the above
resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a
top level resource. So, region_intersects() will report no System RAM
resources in the CXL memory region incorrectly, because it only checks the
top level resources. Consequently, devmem_is_allowed() will return 1
(allow access via /dev/mem) for CXL memory region incorrectly.
Fortunately, ioremap() doesn't allow to map System RAM and reject the
access.
So, region_intersects() needs to be fixed to work correctly with the
resource tree with "System RAM" not at top level as above. To fix it, if
we found a unmatched resource in the top level, we will continue to search
matched resources in its descendant resources. So, we will not miss any
matched resources in resource tree anymore.
In the new implementation, an example resource tree
|------------- "CXL Window 0" ------------|
|-- "System RAM" --|
will behave similar as the following fake resource tree for
region_intersects(, IORESOURCE_SYSTEM_RAM, ),
|-- "System RAM" --||-- "CXL Window 0a" --|
Where "CXL Window 0a" is part of the original "CXL Window 0" that
isn't covered by "System RAM".
Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com
Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk updates from Petr Mladek:
"This is the "last" part of the support for the new nbcon consoles.
Where "nbcon" stays for "No Big console lock CONsoles" aka not under
the console_lock.
New callbacks are added to struct console:
- write_thread() for flushing nbcon consoles in task context.
- write_atomic() for flushing nbcon consoles in atomic context,
including NMI.
- con->device_lock() and device_unlock() for taking the driver
specific lock, for example, port->lock.
New printk-specific kthreads are created:
- per-console kthreads which get responsible for flushing normal
priority messages on nbcon consoles.
- thread which gets responsible for flushing normal priority messages
on all consoles when CONFIG_RT enabled.
The new callbacks are called under a special per-console lock which
has already been added back in v6.7. It allows to distinguish three
severities: normal, emergency, and panic. A context with a higher
priority could take over the ownership when it is safe even in the
middle of handling a record. The panic context could do it even when
it is not safe. But it is allowed only for the final desperate flush
before entering the infinite loop.
The new lock helps to flush the messages directly in emergency and
panic contexts. But it is not enough in all situations:
- console_lock() is still need for synchronization against boot
consoles.
- con->device_lock() is need for synchronization against other
operations on the same HW, e.g. serial port speed setting,
non-printk related read/write.
The dependency on con->device_lock() is mutual. Any code taking the
driver specific lock has to acquire the related nbcon console context
as well. For example, see the new uart_port_lock() API. It provides
the necessary synchronization against emergency and panic contexts
where the messages are flushed only under the new per-console lock.
Maybe surprisingly, a quite tricky part is the decision how to flush
the consoles in various situations. It has to take into account:
- message priority: normal, emergency, panic
- scheduling context: task, atomic, deferred_legacy
- registered consoles: boot, legacy, nbcon
- threads are running: early boot, suspend, shutdown, panic
- caller: printk(), pr_flush(), printk_flush_in_panic(),
console_unlock(), console_start(), ...
The primary decision is made in printk_get_console_flush_type(). It
creates a hint what the caller should do:
- flush nbcon consoles directly or via the kthread
- call the legacy loop (console_unlock()) directly or via irq_work
The existing behavior is preserved for the legacy consoles. The only
exception is that they are not longer flushed directly from printk()
in panic() before CPUs are stopped. But this blocking happens only
when at least one nbcon console is registered. The motivation is to
increase a chance to produce the crash dump. They legacy consoles
might create a deadlock in compare with nbcon consoles. The nbcon
console should allow to see the messages even when the crash dump
fails.
There are three possible ways how nbcon consoles are flushed:
- The per-nbcon-console kthread is responsible for flushing messages
added with the normal priority. This is the default mode.
- The legacy loop, aka console_unlock(), is used when there is still
a boot console registered. There is no easy way how to match an
early console driver with a nbcon console driver. And the
console_lock() provides the only reliable serialization at the
moment.
The legacy loop uses either con->write_atomic() or
con->write_thread() callbacks depending on whether it is allowed to
schedule. The atomic variant has to be used from printk().
- In other situations, the messages are flushed directly using
write_atomic() which can be called in any context, including NMI.
It is primary needed during early boot or shutdown, in emergency
situations, and panic.
The emergency priority is used by a code called within
nbcon_cpu_emergency_enter()/exit(). At the moment, it is used in four
situations: WARN(), Oops, lockdep, and RCU stall reports.
Finally, there is no nbcon console at the moment. It means that the
changes should _not_ modify the existing behavior. The only exception
is CONFIG_RT which would force offloading the legacy loop, for normal
priority context, into the dedicated kthread"
* tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (54 commits)
printk: Avoid false positive lockdep report for legacy printing
printk: nbcon: Assign nice -20 for printing threads
printk: Implement legacy printer kthread for PREEMPT_RT
tty: sysfs: Add nbcon support for 'active'
proc: Add nbcon support for /proc/consoles
proc: consoles: Add notation to c_start/c_stop
printk: nbcon: Show replay message on takeover
printk: Provide helper for message prepending
printk: nbcon: Rely on kthreads for normal operation
printk: nbcon: Use thread callback if in task context for legacy
printk: nbcon: Relocate nbcon_atomic_emit_one()
printk: nbcon: Introduce printer kthreads
printk: nbcon: Init @nbcon_seq to highest possible
printk: nbcon: Add context to usable() and emit()
printk: Flush console on unregister_console()
printk: Fail pr_flush() if before SYSTEM_SCHEDULING
printk: nbcon: Add function for printers to reacquire ownership
printk: nbcon: Use raw_cpu_ptr() instead of open coding
printk: Use the BITS_PER_LONG macro
lockdep: Mark emergency sections in lockdep splats
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"Core:
- Overhaul of posix-timers in preparation of removing the workaround
for periodic timers which have signal delivery ignored.
- Remove the historical extra jiffie in msleep()
msleep() adds an extra jiffie to the timeout value to ensure
minimal sleep time. The timer wheel ensures minimal sleep time
since the large rewrite to a non-cascading wheel, but the extra
jiffie in msleep() remained unnoticed. Remove it.
- Make the timer slack handling correct for realtime tasks.
The procfs interface is inconsistent and does neither reflect
reality nor conforms to the man page. Show the correct 0 slack for
real time tasks and enforce it at the core level instead of having
inconsistent individual checks in various timer setup functions.
- The usual set of updates and enhancements all over the place.
Drivers:
- Allow the ACPI PM timer to be turned off during suspend
- No new drivers
- The usual updates and enhancements in various drivers"
* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
ntp: Make sure RTC is synchronized when time goes backwards
treewide: Fix wrong singular form of jiffies in comments
cpu: Use already existing usleep_range()
timers: Rename next_expiry_recalc() to be unique
platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
clocksource/drivers/jcore: Use request_percpu_irq()
clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
clocksource: acpi_pm: Add external callback for suspend/resume
clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
dt-bindings: timer: rockchip: Add rk3576 compatible
timers: Annotate possible non critical data race of next_expiry
timers: Remove historical extra jiffie for timeout in msleep()
hrtimer: Use and report correct timerslack values for realtime tasks
hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
timers: Add sparse annotation for timer_sync_wait_running().
signal: Replace BUG_ON()s
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Core:
- Remove a global lock in the affinity setting code
The lock protects a cpumask for intermediate results and the lock
causes a bottleneck on simultaneous start of multiple virtual
machines. Replace the lock and the static cpumask with a per CPU
cpumask which is nicely serialized by raw spinlock held when
executing this code.
- Provide support for giving a suffix to interrupt domain names.
That's required to support devices with subfunctions so that the
domain names are distinct even if they originate from the same
device node.
- The usual set of cleanups and enhancements all over the place
Drivers:
- Support for longarch AVEC interrupt chip
- Refurbishment of the Armada driver so it can be extended for new
variants.
- The usual set of cleanups and enhancements all over the place"
* tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits)
genirq: Use cpumask_intersects()
genirq/cpuhotplug: Use cpumask_intersects()
irqchip/apple-aic: Only access system registers on SoCs which provide them
irqchip/apple-aic: Add a new "Global fast IPIs only" feature level
irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi
dt-bindings: apple,aic: Document A7-A11 compatibles
irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy()
genirq/msi: Use kmemdup_array() instead of kmemdup()
genirq/proc: Change the return value for set affinity permission error
genirq/proc: Use irq_move_pending() in show_irq_affinity()
genirq/proc: Correctly set file permissions for affinity control files
genirq: Get rid of global lock in irq_do_set_affinity()
genirq: Fix typo in struct comment
irqchip/loongarch-avec: Add AVEC irqchip support
irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC
irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING
LoongArch: Architectural preparation for AVEC irqchip
LoongArch: Move irqchip function prototypes to irq-loongson.h
irqchip/loongson-pch-msi: Switch to MSI parent domains
softirq: Remove unused 'action' parameter from action callback
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull clocksource watchdog updates from Thomas Gleixner:
- Make the uncertainty margin handling more robust to prevent false
positives
- Clarify comments
* tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin
clocksource: Fix comments on WATCHDOG_THRESHOLD & WATCHDOG_MAX_SKEW
clocksource: Improve comments for watchdog skew bounds
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
- Prepare the core for supporting parallel hotplug on loongarch
- A small set of cleanups and enhancements
* tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Mark smp_prepare_boot_cpu() __init
cpu: Fix W=1 build kernel-doc warning
cpu/hotplug: Provide weak fallback for arch_cpuhp_init_parallel_bringup()
cpu/hotplug: Make HOTPLUG_PARALLEL independent of HOTPLUG_SMT
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
- Fix some remaining problems with PID/TGID reporting
When most users think about PIDs, what they are really thinking about
is the TGID. This commit shifts the audit PID logging and filtering
to use the TGID value which should provide a more meaningful audit
stream and filtering experience for users.
- Migrate to the str_enabled_disabled() helper
Evidently we have helper functions that help ensure if we mistype
"enabled" or "disabled" it is now caught at compile time. I guess
we're fancy now.
* tag 'audit-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: Make use of str_enabled_disabled() helper
audit: use task_tgid_nr() instead of task_pid_nr()
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual pile of misc updates:
Features:
- Add F_CREATED_QUERY fcntl() that allows userspace to query whether
a file was actually created. Often userspace wants to know whether
an O_CREATE request did actually create a file without using
O_EXCL. The current logic is that to first attempts to open the
file without O_CREAT | O_EXCL and if ENOENT is returned userspace
tries again with both flags. If that succeeds all is well. If it
now reports EEXIST it retries.
That works fairly well but some corner cases make this more
involved. If this operates on a dangling symlink the first openat()
without O_CREAT | O_EXCL will return ENOENT but the second openat()
with O_CREAT | O_EXCL will fail with EEXIST.
The reason is that openat() without O_CREAT | O_EXCL follows the
symlink while O_CREAT | O_EXCL doesn't for security reasons. So
it's not something we can really change unless we add an explicit
opt-in via O_FOLLOW which seems really ugly.
All available workarounds are really nasty (fanotify, bpf lsm etc)
so add a simple fcntl().
- Try an opportunistic lookup for O_CREAT. Today, when opening a file
we'll typically do a fast lookup, but if O_CREAT is set, the kernel
always takes the exclusive inode lock. This was likely done with
the expectation that O_CREAT means that we always expect to do the
create, but that's often not the case. Many programs set O_CREAT
even in scenarios where the file already exists (see related
F_CREATED_QUERY patch motivation above).
The series contained in the pr rearranges the pathwalk-for-open
code to also attempt a fast_lookup in certain O_CREAT cases. If a
positive dentry is found, the inode_lock can be avoided altogether
and it can stay in rcuwalk mode for the last step_into.
- Expose the 64 bit mount id via name_to_handle_at()
Now that we provide a unique 64-bit mount ID interface in statx(2),
we can now provide a race-free way for name_to_handle_at(2) to
provide a file handle and corresponding mount without needing to
worry about racing with /proc/mountinfo parsing or having to open a
file just to do statx(2).
While this is not necessary if you are using AT_EMPTY_PATH and
don't care about an extra statx(2) call, users that pass full paths
into name_to_handle_at(2) need to know which mount the file handle
comes from (to make sure they don't try to open_by_handle_at a file
handle from a different filesystem) and switching to AT_EMPTY_PATH
would require allocating a file for every name_to_handle_at(2) call
- Add a per dentry expire timeout to autofs
There are two fairly well known automounter map formats, the autofs
format and the amd format (more or less System V and Berkley).
Some time ago Linux autofs added an amd map format parser that
implemented a fair amount of the amd functionality. This was done
within the autofs infrastructure and some functionality wasn't
implemented because it either didn't make sense or required extra
kernel changes. The idea was to restrict changes to be within the
existing autofs functionality as much as possible and leave changes
with a wider scope to be considered later.
One of these changes is implementing the amd options:
1) "unmount", expire this mount according to a timeout (same as
the current autofs default).
2) "nounmount", don't expire this mount (same as setting the
autofs timeout to 0 except only for this specific mount) .
3) "utimeout=<seconds>", expire this mount using the specified
timeout (again same as setting the autofs timeout but only for
this mount)
To implement these options per-dentry expire timeouts need to be
implemented for autofs indirect mounts. This is because all map
keys (mounts) for autofs indirect mounts use an expire timeout
stored in the autofs mount super block info. structure and all
indirect mounts use the same expire timeout.
Fixes:
- Fix missing fput for FSCONFIG_SET_FD in autofs
- Use param->file for FSCONFIG_SET_FD in coda
- Delete the 'fs/netfs' proc subtreee when netfs module exits
- Make sure that struct uid_gid_map fits into a single cacheline
- Don't flush in-flight wb switches for superblocks without cgroup
writeback
- Correcting the idmapping mount example in the idmapping
documentation
- Fix a race between evice_inodes() and find_inode() and iput()
- Refine the show_inode_state() macro definition in writeback code
- Prevent dump_mapping() from accessing invalid dentry.d_name.name
- Show actual source for debugfs in /proc/mounts
- Annotate data-race of busy_poll_usecs in eventpoll
- Don't WARN for racy path_noexec check in exec code
- Handle OOM on mnt_warn_timestamp_expiry()
- Fix some spelling in the iomap design documentation
- Fix typo in procfs comment
- Fix typo in fs/namespace.c comment
Cleanups:
- Add the VFS git tree to the MAINTAINERS file
- Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode
bit in struct file bringing us to 5 free f_mode bits
- Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify
the wait mechanism
- Remove the unused path_put_init() helper
- Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi
specific
- Replace the unsigned long i_state member with a u32 i_state member
in struct inode freeing up 4 bytes in struct inode. Instead of
using the bit based wait apis we're now using the var event apis
and using the individual bytes of the i_state member to wait on
state changes
- Explain how per-syscall AT_* flags should be allocated
- Use in_group_or_capable() helper to simplify the posix acl mode
update code
- Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code
- Removed comment about d_rcu_to_refcount() as that function doesn't
exist anymore
- Add kernel documentation for lookup_fast()
- Don't re-zero evenpoll fields
- Remove outdated comment after close_fd()
- Fix imprecise wording in comment about the pipe filesystem
- Drop GFP_NOFAIL mode from alloc_page_buffers
- Missing blank line warnings and struct declaration improved in
file_table
- Annotate struct poll_list with __counted_by()
- Remove the unused read parameter in percpu-rwsem
- Remove linux/prefetch.h include from direct-io code
- Use kmemdup_array instead of kmemdup for multiple allocation in
mnt_idmapping code
- Remove unused mnt_cursor_del() declaration
Performance tweaks:
- Dodge smp_mb in break_lease and break_deleg in the common case
- Only read fops once in fops_{get,put}()
- Use RCU in ilookup()
- Elide smp_mb in iversion handling in the common case
- Drop one lock trip in evict()"
* tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits)
uidgid: make sure we fit into one cacheline
proc: Fix typo in the comment
fs/pipe: Correct imprecise wording in comment
fhandle: expose u64 mount id to name_to_handle_at(2)
uapi: explain how per-syscall AT_* flags should be allocated
fs: drop GFP_NOFAIL mode from alloc_page_buffers
writeback: Refine the show_inode_state() macro definition
fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name
mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation
netfs: Delete subtree of 'fs/netfs' when netfs module exits
fs: use LIST_HEAD() to simplify code
inode: make i_state a u32
inode: port __I_LRU_ISOLATING to var event
vfs: fix race between evice_inodes() and find_inode()&iput()
inode: port __I_NEW to var event
inode: port __I_SYNC to var event
fs: reorder i_state bits
fs: add i_state helpers
MAINTAINERS: add the VFS git tree
fs: s/__u32/u32/ for s_fsnotify_mask
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"By the number of new lines of code, the most visible change here is
the addition of hybrid CPU capacity scaling support to the
intel_pstate driver. Next are the amd-pstate driver changes related to
the calculation of the AMD boost numerator and preferred core
detection.
As far as new hardware support is concerned, the intel_idle driver
will now handle Granite Rapids Xeon processors natively, the
intel_rapl power capping driver will recognize family 1Ah of AMD
processors and Intel ArrowLake-U chipos, and intel_pstate will handle
Granite Rapids and Sierra Forest chips in the out-of-band (OOB) mode.
Apart from the above, there is a usual collection of assorted fixes
and code cleanups in many places and there are tooling updates.
Specifics:
- Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef)
- Add support for Granite Rapids and Sierra Forest in OOB mode to the
intel_pstate cpufreq driver (Srinivas Pandruvada)
- Add basic support for CPU capacity scaling on x86 and make the
intel_pstate driver set asymmetric CPU capacity on hybrid systems
without SMT (Rafael Wysocki)
- Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
driver (Jeff Johnson)
- Several OF related cleanups in cpufreq drivers (Rob Herring)
- Enable COMPILE_TEST for ARM drivers (Rob Herrring)
- Introduce quirks for syscon failures and use socinfo to get
revision for TI cpufreq driver (Dhruva Gole, Nishanth Menon)
- Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
Ugwekar)
- Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
(Danila Tikhonov, Huacai Chen, and Liu Jing)
- Make amd-pstate validate return of any attempt to update EPP
limits, which fixes the masking hardware problems (Mario
Limonciello)
- Move the calculation of the AMD boost numerator outside of
amd-pstate, correcting acpi-cpufreq on systems with preferred cores
(Mario Limonciello)
- Harden preferred core detection in amd-pstate to avoid potential
false positives (Mario Limonciello)
- Add extra unit test coverage for mode state machine (Mario
Limonciello)
- Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang
Liu)
- Add Granite Rapids Xeon support to intel_idle (Artem Bityutskiy)
- Disable promotion to C1E on Jasper Lake and Elkhart Lake in
intel_idle (Kai-Heng Feng)
- Use scoped device node handling to fix missing of_node_put() and
simplify walking OF children in the riscv-sbi cpuidle driver
(Krzysztof Kozlowski)
- Remove dead code from cpuidle_enter_state() (Dhruva Gole)
- Change an error pointer to NULL to fix error handling in the
intel_rapl power capping driver (Dan Carpenter)
- Fix off by one in get_rpi() in the intel_rapl power capping driver
(Dan Carpenter)
- Add support for ArrowLake-U to the intel_rapl power capping driver
(Sumeet Pawnikar)
- Fix the energy-pkg event for AMD CPUs in the intel_rapl power
capping driver (Dhananjay Ugwekar)
- Add support for AMD family 1Ah processors to the intel_rapl power
capping driver (Dhananjay Ugwekar)
- Remove unused stub for saveable_highmem_page() and remove
deprecated macros from power management documentation (Andy
Shevchenko)
- Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
sysfs interface (Xueqin Luo)
- Update the maintainers information for the
operating-points-v2-ti-cpu DT binding (Dhruva Gole)
- Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring)
- Add missing MODULE_DESCRIPTION() macros to devfreq governors (Jeff
Johnson)
- Use devm_clk_get_enabled() in the exynos-bus devfreq driver (Anand
Moon)
- Use of_property_present() instead of of_get_property() in the
imx-bus devfreq driver (Rob Herring)
- Update directory handling and installation process in the pm-graph
Makefile and add .gitignore to ignore sleepgraph.py artifacts to
pm-graph (Amit Vadhavana, Yo-Jung Lin)
- Make cpupower display residency value in idle-info (Aboorva
Devarajan)
- Add missing powercap_set_enabled() stub function to cpupower (John
B. Wyatt IV)
- Add SWIG support to cpupower (John B. Wyatt IV)"
* tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (62 commits)
cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
cpufreq/amd-pstate-ut: Add test case for mode switches
cpufreq/amd-pstate: Export symbols for changing modes
amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
x86/amd: Move amd_get_highest_perf() out of amd-pstate
ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
ACPI: CPPC: Drop check for non zero perf ratio
x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
PM: hibernate: Remove unused stub for saveable_highmem_page()
pm:cpupower: Add error warning when SWIG is not installed
MAINTAINERS: Add Maintainers for SWIG Python bindings
pm:cpupower: Include test_raw_pylibcpupower.py
pm:cpupower: Add SWIG bindings files for libcpupower
pm:cpupower: Add missing powercap_set_enabled() stub function
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"The highlights are support for Arm's "Permission Overlay Extension"
using memory protection keys, support for running as a protected guest
on Android as well as perf support for a bunch of new interconnect
PMUs.
Summary:
ACPI:
- Enable PMCG erratum workaround for HiSilicon HIP10 and 11
platforms.
- Ensure arm64-specific IORT header is covered by MAINTAINERS.
CPU Errata:
- Enable workaround for hardware access/dirty issue on Ampere-1A
cores.
Memory management:
- Define PHYSMEM_END to fix a crash in the amdgpu driver.
- Avoid tripping over invalid kernel mappings on the kexec() path.
- Userspace support for the Permission Overlay Extension (POE) using
protection keys.
Perf and PMUs:
- Add support for the "fixed instruction counter" extension in the
CPU PMU architecture.
- Extend and fix the event encodings for Apple's M1 CPU PMU.
- Allow LSM hooks to decide on SPE permissions for physical
profiling.
- Add support for the CMN S3 and NI-700 PMUs.
Confidential Computing:
- Add support for booting an arm64 kernel as a protected guest under
Android's "Protected KVM" (pKVM) hypervisor.
Selftests:
- Fix vector length issues in the SVE/SME sigreturn tests
- Fix build warning in the ptrace tests.
Timers:
- Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
non-determinism arising from the architected counter.
Miscellaneous:
- Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
don't succeed.
- Minor fixes and cleanups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
perf: arm-ni: Fix an NULL vs IS_ERR() bug
arm64: hibernate: Fix warning for cast from restricted gfp_t
arm64: esr: Define ESR_ELx_EC_* constants as UL
arm64: pkeys: remove redundant WARN
perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
MAINTAINERS: List Arm interconnect PMUs as supported
perf: Add driver for Arm NI-700 interconnect PMU
dt-bindings/perf: Add Arm NI-700 PMU
perf/arm-cmn: Improve format attr printing
perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
arm64/mm: use lm_alias() with addresses passed to memblock_free()
mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
arm64: Expose the end of the linear map in PHYSMEM_END
arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
arm64/mm: Delete __init region from memblock.reserved
perf/arm-cmn: Support CMN S3
dt-bindings: perf: arm-cmn: Add CMN S3
perf/arm-cmn: Refactor DTC PMU register access
perf/arm-cmn: Make cycle counts less surprising
perf/arm-cmn: Improve build-time assertion
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu"
"API:
- Make self-test asynchronous
Algorithms:
- Remove MPI functions added for SM3
- Add allocation error checks to remaining MPI functions (introduced
for SM3)
- Set default Jitter RNG OSR to 3
Drivers:
- Add hwrng driver for Rockchip RK3568 SoC
- Allow disabling SR-IOV VFs through sysfs in qat
- Fix device reset bugs in hisilicon
- Fix authenc key parsing by using generic helper in octeontx*
Others:
- Fix xor benchmarking on parisc"
* tag 'v6.12-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (96 commits)
crypto: n2 - Set err to EINVAL if snprintf fails for hmac
crypto: camm/qi - Use ERR_CAST() to return error-valued pointer
crypto: mips/crc32 - Clean up useless assignment operations
crypto: qcom-rng - rename *_of_data to *_match_data
crypto: qcom-rng - fix support for ACPI-based systems
dt-bindings: crypto: qcom,prng: document support for SA8255p
crypto: aegis128 - Fix indentation issue in crypto_aegis128_process_crypt()
crypto: octeontx* - Select CRYPTO_AUTHENC
crypto: testmgr - Hide ENOENT errors
crypto: qat - Remove trailing space after \n newline
crypto: hisilicon/sec - Remove trailing space after \n newline
crypto: algboss - Pass instance creation error up
crypto: api - Fix generic algorithm self-test races
crypto: hisilicon/qm - inject error before stopping queue
crypto: hisilicon/hpre - mask cluster timeout error
crypto: hisilicon/qm - reset device before enabling it
crypto: hisilicon/trng - modifying the order of header files
crypto: hisilicon - add a lock for the qp send operation
crypto: hisilicon - fix missed error branch
crypto: ccp - do not request interrupt on cmd completion when irqs disabled
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"The zero-copy changes are relatively significant, but regression risk
should be contained. The feature needs to be used to cause trouble.
Also it feels like we got an order of magnitude more semi-automated
"refactoring" chaff than usual, I wonder if it's just us.
Core & protocols:
- Support Device Memory TCP, ability to zero-copy receive TCP
payloads to a DMABUF region of memory while packet headers land
separately in normal kernel buffers, and TCP processes then as
usual.
- The ability to read the PTP PHC (Physical Hardware Clock) alongside
MONOTONIC_RAW timestamps with PTP_SYS_OFFSET_EXTENDED. Previously
only CLOCK_REALTIME was supported.
- Allow matching on all bits of IP DSCP for routing decisions.
Previously we only supported on matching TOS bits in IPv4 which is
a narrower interpretation of the same header field.
- Increase the range of weights used for multi-path routing from
8 bits to 16 bits.
- Add support for IPv6 PIO p flag in the Prefix Information Option
per draft-ietf-6man-pio-pflag.
- IPv6 IOAM6 support for new tunsrc encap mode for better
performance.
- Detect destinations which blackhole MPTCP traffic and avoid
initiating MPTCP connections to them for a certain period of time,
1h by default.
- Improve IPsec control path performance by removing the inexact
policies list.
- AF_VSOCK: add support for SIOCOUTQ ioctl.
- Add enum for reasons TCP reset was sent for easier tracing.
- Add SMC ringbufs usage statistics.
Drivers:
- Handle netconsole setup failures more gracefully, don't fail
loading, retain the specified target as disabled.
- Extend bonding's IPsec offload pass thru capabilities (ESN, stats).
Filtering:
- Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case
when long-lived sockets miss a chance to set additional callbacks
if a sockops program was not attached early in their lifetime.
- Support using BPF skb helpers in tracepoints.
- Conntrack Netlink: support CTA_FILTER for flush.
- Improve SCTP support in nfnetlink_queue.
- Improve performance of large nftables flush transactions.
Things we sprinkled into general kernel code:
- selftests: support setting an "interpreter" for script files; make
it easy to run as separate cases tests where one "interpreter" is
fed various test descriptions (in our case packet sequences).
Driver API:
- Extend core and ethtool APIs to support many PHYs connected to a
single interface (PHY topologies).
- Extend cable diagnostics to specify whether Time Domain
Reflectometry (TDR) or Active Link Cable Diagnostic (ALCD) was
used.
- Add library for implementing MAC-PHY Ethernet drivers for SPI
devices compatible with Open Alliance 10BASE-T1x MAC-PHY Serial
Interface (TC6) standard.
- Add helpers to the PHY framework, for PHYs following the Open
Alliance standards:
- 1000BaseT1 link settings
- cable test and diagnostics
- Support listing / dumping all allocated RSS contexts.
- Add configuration for frequency Embedded SYNC in DPLL, which
magically embeds sync pulses into Ethernet signaling.
Device drivers:
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- use better FW APIs for queue reset
- support QOS and TPID settings for the SR-IOV VLAN
- support dynamic MSI-X allocation
- Intel (100G, ice, idpf):
- ice: support PCIe subfunctions
- iavf: add support for TC U32 filters on VFs
- ice: support Embedded SYNC in DPLL
- nVidia/Mellanox (mlx5):
- support HW managed steering tables
- support PCIe PTM cross timestamping
- AMD/Pensando:
- ionic: use page_pool to increase Rx performance
- Cisco (enic):
- report per-queue statistics
- Ethernet virtual:
- Microsoft vNIC:
- mana: support configuring ring length
- netvsc: enable more channels on systems with many CPUs
- IBM veth:
- optimize polling to improve TCP_RR performance
- optimize performance of Tx handling
- VirtIO net:
- synchronize the operstate with the admin state to allow a
lower virtio-net to propagate the link status to an upper
device like macvlan
- Ethernet NICs consumer, and embedded:
- Add driver for Realtek automotive PCIe devices (RTL9054,
RTL9068, RTL9072, RTL9075, RTL9068, RTL9071)
- Add driver for Microchip LAN8650/1 10BASE-T1S MAC-PHY.
- Microchip:
- lan743x: use phylink - support WOL, EEE, pause, link settings
- add Wake-on-LAN support for KSZ87xx family
- add KSZ8895/KSZ8864 switch support
- factor out FDMA code and use it in sparx5 and lan966x
(including DCB support in both)
- Synopsys (stmmac):
- support frame preemption (configured using TC and ethtool)
- support Loongson DWMAC (GMAC v3.73)
- support RockChips RK3576 DWMAC
- TI:
- am65-cpsw: add multi queue RX support
- icssg-prueth: HSR offload support
- Cadence (macb):
- enable software (hrtimer based) IRQ coalescing by default
- Xilinx (axinet):
- expose HW statistics
- improve multicast filtering
- relax Rx checksum offload constraints
- MediaTek:
- mt7530: add EN7581 support
- Aspeed (ftgmac100):
- report link speed and duplex
- Intel:
- igc: add mqprio offload
- igc: report EEE configuration
- RealTek (r8169):
- add support for RTL8126A rev.b
- Vitesse (vsc73xx):
- implement FDB add/del/dump operations
- Freescale (fs_enet):
- use phylink
- Ethernet PHYs:
- vitesse: implement downshift and MDI-X in vsc73xx PHYs
- microchip: support LAN887x, supporting IEEE 802.3bw (100BASE-T1)
and IEEE 802.3bp (1000BASE-T1) specifications
- add Applied Micro QT2025 PHY driver (in Rust)
- add Motorcomm yt8821 2.5G Ethernet PHY driver
- CAN:
- add driver for Rockchip RK3568 CAN-FD controller
- flexcan: add wakeup support for imx95
- kvaser_usb: set hardware timestamp on transmitted packets
- WiFi:
- mac80211/cfg80211:
- EHT rate support in AQL airtime fairness
- handle DFS (radar detection) per link in Multi-Link Operation
- RealTek (rtw89):
- support RTL8852BT and 8852BE-VT (WiFi 6)
- support hardware rfkill
- support HW encryption in unicast management frames
- support Wake-on-WLAN with supported network detection
- RealTek (rtw89):
- improve Rx performance by using USB frame aggregation
- support USB 3 with RTL8822CU/RTL8822BU
- Intel (iwlwifi/mvm):
- offload RLC/SMPS functionality to firmware
- Marvell (mwifiex):
- add host based MLME to enable WPA3
- Bluetooth:
- add support for Amlogic HCI UART protocol
- add support for ISO data/packets to Intel and NXP drivers"
* tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1303 commits)
net/mlx5: HWS, check the correct variable in hws_send_ring_alloc_sq()
netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level()
ice: Fix a NULL vs IS_ERR() check in probe()
ice: Fix a couple NULL vs IS_ERR() bugs
net: ethernet: fs_enet: Make the per clock optional
net: ti: icssg-prueth: Add multicast filtering support in HSR mode
net: ti: icssg-prueth: Enable HSR Tx duplication, Tx Tag and Rx Tag offload
net: ti: icssg-prueth: Add support for HSR frame forward offload
net: ti: icssg-prueth: Stop hardcoding def_inc
net: ti: icss-iep: Move icss_iep structure
net: ibm: emac: get rid of wol_irq
net: ibm: emac: remove all waiting code
net: ibm: emac: replace of_get_property
net: ibm: emac: use netdev's phydev directly
net: ibm: emac: use devm for register_netdev
net: ibm: emac: remove mii_bus with devm
net: ibm: emac: use devm for of_iomap
net: ibm: emac: manage emac_irq with devm
net: ibm: emac: use devm for alloc_etherdev
octeontx2-af: debugfs: Add Channel info to RPM map
...
|
|
Call the missed kfree() in btf_parse_struct_metas() when there is no
special field in btf, otherwise will get the following kmemleak report:
unreferenced object 0xffff888101033620 (size 8):
comm "test_progs", pid 604, jiffies 4295127011
......
backtrace (crc e77dc444):
[<00000000186f90f3>] kmemleak_alloc+0x4b/0x80
[<00000000ac8e9c4d>] __kmalloc_cache_noprof+0x2a1/0x310
[<00000000d99d68d6>] btf_new_fd+0x72d/0xe90
[<00000000f010b7f8>] __sys_bpf+0xec3/0x2410
[<00000000e077ed6f>] __x64_sys_bpf+0x1f/0x30
[<00000000a12f9e55>] x64_sys_call+0x199/0x9f0
[<00000000f3029ea6>] do_syscall_64+0x3b/0xc0
[<000000005640913a>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: 7a851ecb1806 ("bpf: Search for kptrs in prog BTF structs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240912012845.3458483-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When security_bpf_map_create() in map_create() fails, map_create() will
call btf_put() and ->map_free() callback to free the map. It doesn't
free the btf_record of map value, so add the missed btf_record_free()
when map creation fails.
However btf_record_free() needs to be called after ->map_free() just
like bpf_map_free_deferred() did, because ->map_free() may use the
btf_record to free the special fields in preallocated map value. So
factor out bpf_map_free() helper to free the map, btf_record, and btf
orderly and use the helper in both map_create() and
bpf_map_free_deferred().
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240912012845.3458483-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input
arguments, zero the value for the case of an error as otherwise it could leak
memory. For tracing, it is not needed given CAP_PERFMON can already read all
kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped
in here.
Also, the MTU helpers mtu_len pointer value is being written but also read.
Technically, the MEM_UNINIT should not be there in order to always force init.
Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now
implies two things actually: i) write into memory, ii) memory does not have
to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory,
ii) memory must be initialized. This means that for bpf_*_check_mtu() we're
readding the issue we're trying to fix, that is, it would then be able to
write back into things like .rodata BPF maps. Follow-up work will rework the
MEM_UNINIT semantics such that the intent can be better expressed. For now
just clear the *mtu_len on error path which can be lifted later again.
Fixes: 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types")
Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net
Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When checking malformed helper function signatures, also take other argument
types into account aside from just ARG_PTR_TO_UNINIT_MEM.
This concerns (formerly) ARG_PTR_TO_{INT,LONG} given uninitialized memory can
be passed there, too.
The func proto sanity check goes back to commit 435faee1aae9 ("bpf, verifier:
add ARG_PTR_TO_RAW_STACK type"), and its purpose was to detect wrong func protos
which had more than just one MEM_UNINIT-tagged type as arguments.
The reason more than one is currently not supported is as we mark stack slots with
STACK_MISC in check_helper_call() in case of raw mode based on meta.access_size to
allow uninitialized stack memory to be passed to helpers when they just write into
the buffer.
Probing for base type as well as MEM_UNINIT tagging ensures that other types do not
get missed (as it used to be the case for ARG_PTR_TO_{INT,LONG}).
Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types")
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240913191754.13290-4-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Lonial found an issue that despite user- and BPF-side frozen BPF map
(like in case of .rodata), it was still possible to write into it from
a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT}
as arguments.
In check_func_arg() when the argument is as mentioned, the meta->raw_mode
is never set. Later, check_helper_mem_access(), under the case of
PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the
subsequent call to check_map_access_type() and given the BPF map is
read-only it succeeds.
The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT
when results are written into them as opposed to read out of them. The
latter indicates that it's okay to pass a pointer to uninitialized memory
as the memory is written to anyway.
However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM
just with additional alignment requirement. So it is better to just get
rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the
fixed size memory types. For this, add MEM_ALIGNED to additionally ensure
alignment given these helpers write directly into the args via *<ptr> = val.
The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>).
MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated
argument types, since in !MEM_FIXED_SIZE cases the verifier does not know
the buffer size a priori and therefore cannot blindly write *<ptr> = val.
Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Both bpf_strtol() and bpf_strtoul() helpers passed a temporary "long long"
respectively "unsigned long long" to __bpf_strtoll() / __bpf_strtoull().
Later, the result was checked for truncation via _res != ({unsigned,} long)_res
as the destination buffer for the BPF helpers was of type {unsigned,} long
which is 32bit on 32bit architectures.
Given the latter was a bug in the helper signatures where the destination buffer
got adjusted to {s,u}64, the truncation check can now be removed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240913191754.13290-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The bpf_strtol() and bpf_strtoul() helpers are currently broken on 32bit:
The argument type ARG_PTR_TO_LONG is BPF-side "long", not kernel-side "long"
and therefore always considered fixed 64bit no matter if 64 or 32bit underlying
architecture.
This contract breaks in case of the two mentioned helpers since their BPF_CALL
definition for the helpers was added with {unsigned,}long *res. Meaning, the
transition from BPF-side "long" (BPF program) to kernel-side "long" (BPF helper)
breaks here.
Both helpers call __bpf_strtoll() with "long long" correctly, but later assigning
the result into 32-bit "*(long *)" on 32bit architectures. From a BPF program
point of view, this means upper bits will be seen as uninitialised.
Therefore, fix both BPF_CALL signatures to {s,u}64 types to fix this situation.
Now, changing also uapi/bpf.h helper documentation which generates bpf_helper_defs.h
for BPF programs is tricky: Changing signatures there to __{s,u}64 would trigger
compiler warnings (incompatible pointer types passing 'long *' to parameter of type
'__s64 *' (aka 'long long *')) for existing BPF programs.
Leaving the signatures as-is would be fine as from BPF program point of view it is
still BPF-side "long" and thus equivalent to __{s,u}64 on 64 or 32bit underlying
architectures.
Note that bpf_strtol() and bpf_strtoul() are the only helpers with this issue.
Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/481fcec8-c12c-9abb-8ecb-76c71c009959@iogearbox.net
Link: https://lore.kernel.org/r/20240913191754.13290-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Zac Ecob reported a problem where a bpf program may cause kernel crash due
to the following error:
Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI
The failure is due to the below signed divide:
LLONG_MIN/-1 where LLONG_MIN equals to -9,223,372,036,854,775,808.
LLONG_MIN/-1 is supposed to give a positive number 9,223,372,036,854,775,808,
but it is impossible since for 64-bit system, the maximum positive
number is 9,223,372,036,854,775,807. On x86_64, LLONG_MIN/-1 will
cause a kernel exception. On arm64, the result for LLONG_MIN/-1 is
LLONG_MIN.
Further investigation found all the following sdiv/smod cases may trigger
an exception when bpf program is running on x86_64 platform:
- LLONG_MIN/-1 for 64bit operation
- INT_MIN/-1 for 32bit operation
- LLONG_MIN%-1 for 64bit operation
- INT_MIN%-1 for 32bit operation
where -1 can be an immediate or in a register.
On arm64, there are no exceptions:
- LLONG_MIN/-1 = LLONG_MIN
- INT_MIN/-1 = INT_MIN
- LLONG_MIN%-1 = 0
- INT_MIN%-1 = 0
where -1 can be an immediate or in a register.
Insn patching is needed to handle the above cases and the patched codes
produced results aligned with above arm64 result. The below are pseudo
codes to handle sdiv/smod exceptions including both divisor -1 and divisor 0
and the divisor is stored in a register.
sdiv:
tmp = rX
tmp += 1 /* [-1, 0] -> [0, 1]
if tmp >(unsigned) 1 goto L2
if tmp == 0 goto L1
rY = 0
L1:
rY = -rY;
goto L3
L2:
rY /= rX
L3:
smod:
tmp = rX
tmp += 1 /* [-1, 0] -> [0, 1]
if tmp >(unsigned) 1 goto L1
if tmp == 1 (is64 ? goto L2 : goto L3)
rY = 0;
goto L2
L1:
rY %= rX
L2:
goto L4 // only when !is64
L3:
wY = wY // only when !is64
L4:
[1] https://lore.kernel.org/bpf/tPJLTEh7S_DxFEqAI2Ji5MBSoZVg7_G-Py2iaZpAaWtM961fFTWtsnlzwvTbzBzaUzwQAoNATXKUlt0LZOFgnDcIyKCswAnAGdUF3LBrhGQ=@protonmail.com/
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240913150326.1187788-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
commit ac3b43283923 ("module: replace module_layout with module_memory")
introduced a set of memory regions for the module layout sharing the
same attributes. However, it didn't update the kmemleak scanned areas
which intended to limit kmemleak scan to sections containing writable
data. This means sections such as .text and .rodata are scanned by
kmemleak.
Refine the scanned areas for modules by limiting it to MOD_TEXT and
MOD_INIT_TEXT mod_mem regions.
CC: Song Liu <song@kernel.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
When insmod a kernel module, if fails in add_notes_attrs or
add_sysfs_attrs such as memory allocation fail, mod_sysfs_setup
will still return success, but we can't access user interface
on android device.
Patch for make mod_sysfs_setup can check the error of
add_notes_attrs and add_sysfs_attrs
[mcgrof: the section stuff comes from linux history.git [0]]
Fixes: 3f7b0672086b ("Module section offsets in /sys/module") [0]
Fixes: 6d76013381ed ("Add /sys/module/name/notes")
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409010016.3XIFSmRA-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202409072018.qfEzZbO7-lkp@intel.com/
Link: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=3f7b0672086b97b2d7f322bdc289cbfa203f10ef [0]
Signed-off-by: Xion Wang <xion.wang@mediatek.com>
Signed-off-by: Chunhui Li <chunhui.li@mediatek.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-09-11
We've added 12 non-merge commits during the last 16 day(s) which contain
a total of 20 files changed, 228 insertions(+), 30 deletions(-).
There's a minor merge conflict in drivers/net/netkit.c:
00d066a4d4ed ("netdev_features: convert NETIF_F_LLTX to dev->lltx")
d96608794889 ("netkit: Disable netpoll support")
The main changes are:
1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used
to easily parse skbs in BPF programs attached to tracepoints,
from Philo Lu.
2) Add a cond_resched() point in BPF's sock_hash_free() as there have
been several syzbot soft lockup reports recently, from Eric Dumazet.
3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which
got noticed when zero copy ice driver started to use it,
from Maciej Fijalkowski.
4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs
up via netif_receive_skb_list() to better measure latencies,
from Daniel Xu.
5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann.
6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but
instead gather the actual value via /proc/sys/net/core/max_skb_frags,
also from Maciej Fijalkowski.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
sock_map: Add a cond_resched() in sock_hash_free()
selftests/bpf: Expand skb dynptr selftests for tp_btf
bpf: Allow bpf_dynptr_from_skb() for tp_btf
tcp: Use skb__nullable in trace_tcp_send_reset
selftests/bpf: Add test for __nullable suffix in tp_btf
bpf: Support __nullable argument suffix for tp_btf
bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv
selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob
xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc()
tcp_bpf: Remove an unused parameter for bpf_tcp_ingress()
bpf, sockmap: Correct spelling skmsg.c
netkit: Disable netpoll support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Keep file reference through the entire thing, don't bother with grabbing
struct path reference and while we are at it, don't confuse the hell out
of readers by random mix of path.dentry->d_sb and path.mnt->mnt_sb uses -
these two are equal, so just put one of those into a local variable and
use that.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"A fix for a NULL worker->pool deref bug which can be triggered when a
worker is created and then destroyed immediately"
* tag 'wq-for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Clear worker->pool in the worker thread context
|
|
dma_supported has become too much spaghetti for my taste. Reflow it to
remove the duplicate use_dma_iommu condition and make the main path more
obvious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
|
|
When I expanded uidgid mappings I intended for a struct uid_gid_map to
fit into a single cacheline on x86 as they tend to be pretty
performance sensitive (idmapped mounts etc). But a 4 byte hole was added
that brought it over 64 bytes. Fix that by adding the static extent
array and the extent counter into a substruct. C's type punning for
unions guarantees that we can access ->nr_extents even if the last
written to member wasn't within the same object. This is also what we
rely on in struct_group() and friends. This of course relies on
non-strict aliasing which we don't do.
99) If the member used to read the contents of a union object is not the
same as the member last used to store a value in the object, the
appropriate part of the object representation of the value is
reinterpreted as an object representation in the new type as
described in 6.2.6 (a process sometimes called "type punning").
Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf
Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
If the DMA IOMMU path is going to be used, the appropriate check should
return that DMA is supported.
Fixes: b5c58b2fdc42 ("dma-mapping: direct calls for dma-iommu")
Closes: https://lore.kernel.org/all/181e06ff-35a3-434f-b505-672f430bd1cb@notapiano
Reported-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> #KernelCI
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from
__update_blocked_others()") added update_other_load_avgs() in
kernel/sched/syscalls.c right above effective_cpu_util(). This location
didn't fit that well in the first place, and with 5d871a63997f ("sched/fair:
Move effective_cpu_util() and effective_cpu_util() in fair.c") moving
effective_cpu_util() to kernel/sched/fair.c, it looks even more out of
place.
Relocate the function to kernel/sched/pelt.c where all its callees are.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
|
|
Marc Hartmayer reported:
[ 23.133876] Unable to handle kernel pointer dereference in virtual kernel address space
[ 23.133950] Failing address: 0000000000000000 TEID: 0000000000000483
[ 23.133954] Fault in home space mode while using kernel ASCE.
[ 23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d
[ 23.134207] Oops: 0004 ilc:2 [#1] SMP
(snip)
[ 23.134516] Call Trace:
[ 23.134520] [<0000024e326caf28>] worker_thread+0x48/0x430
[ 23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430)
[ 23.134528] [<0000024e326d3a3e>] kthread+0x11e/0x130
[ 23.134533] [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60
[ 23.134536] [<0000024e333fb37a>] ret_from_fork+0xa/0x38
[ 23.134552] Last Breaking-Event-Address:
[ 23.134553] [<0000024e333f4c04>] mutex_unlock+0x24/0x30
[ 23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops
With debuging and analysis, worker_thread() accesses to the nullified
worker->pool when the newly created worker is destroyed before being
waken-up, in which case worker_thread() can see the result detach_worker()
reseting worker->pool to NULL at the begining.
Move the code "worker->pool = NULL;" out from detach_worker() to fix the
problem.
worker->pool had been designed to be constant for regular workers and
changeable for rescuer. To share attaching/detaching code for regular
and rescuer workers and to avoid worker->pool being accessed inadvertently
when the worker has been detached, worker->pool is reset to NULL when
detached no matter the worker is rescuer or not.
To maintain worker->pool being reset after detached, move the code
"worker->pool = NULL;" in the worker thread context after detached.
It is either be in the regular worker thread context after PF_WQ_WORKER
is cleared or in rescuer worker thread context with wq_pool_attach_mutex
held. So it is safe to do so.
Cc: Marc Hartmayer <mhartmay@linux.ibm.com>
Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Fixes: f4b7b53c94af ("workqueue: Detach workers directly in idle_cull_fn()")
Cc: stable@vger.kernel.org # v6.11+
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Salvatore Benedetto reported an issue that when doing syscall tracepoint
tracing the kernel stack is empty. For example, using the following
command line
bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }'
the output for both commands is
===
Kernel Stack
===
Further analysis shows that pt_regs used for bpf syscall tracepoint
tracing is from the one constructed during user->kernel transition.
The call stack looks like
perf_syscall_enter+0x88/0x7c0
trace_sys_enter+0x41/0x80
syscall_trace_enter+0x100/0x160
do_syscall_64+0x38/0xf0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The ip address stored in pt_regs is from user space hence no kernel
stack is printed.
To fix the issue, kernel address from pt_regs is required.
In kernel repo, there are already a few cases like this. For example,
in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
instances are used to supply ip address or use ip address to construct
call stack.
Instead of allocate fake_regs in the stack which may consume
a lot of bytes, the function perf_trace_buf_alloc() in
perf_syscall_{enter, exit}() is leveraged to create fake_regs,
which will be passed to perf_call_bpf_{enter,exit}().
For the above bpftrace script, I got the following output with this patch:
for tracepoint:syscalls:sys_enter_read
===
Kernel Stack
syscall_trace_enter+407
syscall_trace_enter+407
do_syscall_64+74
entry_SYSCALL_64_after_hwframe+75
===
and for tracepoint:syscalls:sys_exit_read
===
Kernel Stack
syscall_exit_work+185
syscall_exit_work+185
syscall_exit_to_user_mode+305
do_syscall_64+118
entry_SYSCALL_64_after_hwframe+75
===
Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev
|