linux-next.git/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c, branch master

drm/amdgpu: flush pending RCU callbacks on module unload

2026-07-01T15:54:54+00:00

Call rcu_barrier() in module exit to wait for outstanding call_rcu() callbacks before freeing module text, preventing late callback execution in freed memory. BUG: unable to handle page fault for address: ffffffffc1d59c40 PGD 6a12067 P4D 6a12067 PUD 6a14067 PMD 13698b067 PTE 0 Oops: 0010 [#1] SMP NOPTI RIP: 0010:0xffffffffc1d59c40 Code: Unable to access opcode bytes at RIP 0xffffffffc1d59c16. RSP: 0018:ffffc900198c0f28 EFLAGS: 00010286 RAX: ffffffffc1d59c40 RBX: ffff897c7d6b61c0 RCX: ffff88826aff4590 RDX: ffff8884d8b35490 RSI: ffffc900198c0f30 RDI: ffff88812af67290 RBP: 000000000000000a (DONE segment entries) R08: 0000000000000000 R09: 0000000000000100 R10: 0000000000000000 R11: ffffffff82a06100 R12: ffff88811a4e3700 R13: 0000000000000000 R14: ffff897c7d6b6270 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff897c7d680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc1d59c16 CR3: 00000104a980a001 CR4: 0000000002770ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? rcu_do_batch+0x163/0x450 ? rcu_core+0x177/0x1c0 ? __do_softirq+0xc1/0x280 ? asm_call_irq_on_stack+0xf/0x20 ? do_softirq_own_stack+0x37/0x50 ? irq_exit_rcu+0xc4/0x100 ? sysvec_apic_timer_interrupt+0x36/0x80 ? asm_sysvec_apic_timer_interrupt+0x12/0x20 ? cpuidle_enter_state+0xd4/0x360 ? cpuidle_enter+0x29/0x40 ? cpuidle_idle_call+0x108/0x1a0 ? do_idle+0x77/0xf0 ? cpu_startup_entry+0x19/0x20 ? secondary_startup_64_no_verify+0xbf/0xcb Signed-off-by: Perry Yuan Reviewed-by: Yifan Zhang Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Support some Barco AMD based graphics adapters

2026-07-01T15:15:02+00:00

These adapters typically are only supported by Barco on the Windows platform. However, with these changes in the linux driver, multiple monitor support should work correctly. Signed-off-by: Matthew Jacob Signed-off-by: Alex Deucher

drm/amdgpu: Add IP block soft reset as a GPU recovery method

2026-07-01T15:14:11+00:00

Implement IP block soft reset as a recovery method that fits into the current GPU recovery code as opposed to being hacked into the full GPU reset code path. This can gracefully handle GPU hangs when other reset methods are not available or have failed. It makes sure to minimize collateral damage (ie. affected non-guilty jobs) and does a backup and restore on all affected queues. Note that some of the new helpers may be useful for other reset types as well, which we can explore later. Reviewed-by: Alex Deucher Signed-off-by: Timur Kristóf Signed-off-by: Alex Deucher

drm/amdgpu: Clarify name of soft recovery to avoid confusion

2026-07-01T15:11:33+00:00

Soft recovery is not the same as soft reset: * Soft recovery attempts to resolve a GPU hang by sending a command to terminate shaders. * Soft reset completely re-initializes an entire device IP block, which may affect multiple rings and jobs at the same time. Reviewed-by: Alex Deucher Signed-off-by: Timur Kristóf Reviewed-by: Christian König Signed-off-by: Alex Deucher

drm/amdgpu: Export ip_discovery sysfs on probe failure

2026-06-17T20:23:22+00:00

When driver probe fails (missing firmware, unsupported hardware, etc.), the entire device is torn down including the ip_discovery sysfs folder, preventing users from identifying what hardware is present. Export ip_discovery sysfs even when probe fails by creating it early in the probe flow and tying its lifetime to the PCI device rather than the driver. The sysfs folder persists across probe failures and module reloads, but is cleaned up on driver unbind. Acked-by: Alex Deucher Signed-off-by: Mario Limonciello Signed-off-by: Alex Deucher

drm/amd: add AMDGPU_DEBUG_HIBERNATION_THAW_RESUME_GPU debug mask

2026-06-17T20:21:48+00:00

Kernel parameter `no_console_suspend` is required to capture all hibernation kernel log via serial console. But when the parameter is set, GPU will be resumed in thaw stage. This causes many issues on alinux3 kernel. Fix: add new debug mask `AMDGPU_DEBUG_HIBERNATION_THAW_RESUME_GPU` to replace the check of `console_suspend_enabled` in thaw() callback. User can enable it using `amdgpu.debug_mask=0x800`. Signed-off-by: Samuel Zhang Reviewed-by: Mario Limonciello (AMD) Signed-off-by: Alex Deucher

drm/amdgpu: add ioctl to handle RAS poison error

2026-06-17T19:51:36+00:00

Add a new DRM_IOCTL_AMDGPU_PROC_OPTIONS ioctl with the AMDGPU_PROC_OPTIONS_OP_KFD_SIGBUS_DELAY option, allowing userspace (ROCr) to control per-process SIGBUS delivery. Userspace for this can be found at: https://github.com/ROCm/rocm-systems/pull/6190 Reviewed-by: Lijo Lazar Reviewed-by: Alex Deucher Signed-off-by: Yifan Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Add lockdep annotations for lock ordering validation

2026-06-04T19:24:22+00:00

Add lockdep annotations to teach lockdep the correct lock hierarchy and catch ordering violations during development. This follows the pattern established by dma-resv in drivers/dma-buf/dma-resv.c. Lock ordering hierarchy (outermost to innermost): 1. userq_sch_mutex - Global userq scheduler (enforce_isolation) 2. userq_mutex - Per-context userq (held across queue create/destroy) 3. notifier_lock - MMU notifier synchronization 4. vram_lock - VRAM memory allocator 5. reset_domain->sem - GPU reset synchronization 6. reset_lock - Reset control mutex 7. srbm_mutex - SRBM register access 8. grbm_idx_mutex - GRBM index register access 9. mmio_idx_lock - MMIO index access (spinlock) The implementation provides: - Lock ordering training at module init (amdgpu_lockdep_init) - Lock class association for real driver locks (amdgpu_lockdep_set_class) Dummy locks are associated with the same class keys as real driver locks via lockdep_set_class(), ensuring lockdep connects the training ordering with actual runtime locks. Testing: Build the kernel with CONFIG_PROVE_LOCKING=y (enables CONFIG_LOCKDEP): scripts/config --enable PROVE_LOCKING scripts/config --enable DEBUG_LOCKDEP make -j$(nproc) On boot, dmesg should show: AMDGPU: Lockdep annotations initialized (9 lock levels) The companion IGT test (tests/amdgpu/amd_lockdep) exercises lock-heavy GPU code paths concurrently to trigger lockdep warnings on violations: sudo ./build/tests/amdgpu/amd_lockdep sudo dmesg | grep -A 50 "circular locking dependency" IGT subtests: concurrent-reset-and-submit - reset_sem vs submission locks concurrent-mmap-and-evict - mmap_lock vs vram_lock concurrent-userptr-and-reset - notifier_lock vs reset_sem stress-all-paths - all of the above simultaneously A clean dmesg (no "circular locking dependency" or "possible recursive locking detected" messages) confirms no lock ordering violations. For CI integration, the test should be run on kernels compiled with CONFIG_LOCKDEP=y; dmesg is scanned post-run for lockdep splats. v2: (Christian) - Move notifier_lock and vram_lock before reset locks in hierarchy. HMM invalidation holds notifier_lock and can wait for GPU reset completion, so notifier_lock must be outer to reset_domain->sem. - Associate dummy locks with lock class keys via lockdep_set_class() so lockdep connects training with real driver locks. - Update commit message to list all 9 lock levels. Requires CONFIG_PROVE_LOCKING=y to activate. Cc: Christian Konig Cc: Alex Deucher Signed-off-by: Vitaly Prosyak Reviewed-by: Christian Konig Signed-off-by: Alex Deucher

drm/amd: Reduce code duplication in runtime PM

2026-05-18T22:19:39+00:00

[Why] amdgpu_pmops_runtime_suspend() runs almost the same code that amdgpu_pmops_runtime_idle() runs. That is there is pointless code duplication. [How] Move amdgpu_pmops_runtime_idle() up, extract common code and then call from both functions. No intended functional changes. Reviewed-by: Alex Deucher Signed-off-by: Mario Limonciello Signed-off-by: Alex Deucher

drm/amdgpu: add amdgpu.ptl module parameter for PTL control

2026-05-11T19:55:56+00:00

Add a new kernel module parameter 'amdgpu.ptl' to allow users to enable or disable PTL feature at driver loading time. Parameter values: *) 0 or -1: disable PTL (default) *) 1: enable PTL *) 2: permanently disable PTL Signed-off-by: Perry Yuan Reviewed-by: Yifan Zhang Acked-by: Alex Deucher Signed-off-by: Alex Deucher