linux-next.git/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c, branch master

drm/amdgpu: Do not fiddle with the idle workers too much

2026-07-01T15:56:47+00:00

Idle workers only need to be canceled or pushed back if we are potentially idle. Make the both operations conditional on the pre-increment and post- decrement status of the in-flight job counter. Reviewed-by: Timur Kristóf Signed-off-by: Tvrtko Ursulin Cc: Alex Deucher Cc: Christian König Cc: Timur Kristóf Signed-off-by: Alex Deucher

drm/amdgpu: Fix false error return to non-KCQ

2026-07-01T15:41:10+00:00

amdgpu_gfx_reset_mes_compute is used to coordinate suspend_all, reset, and resume_all between KCQ and compute user queues. When a hung queue comes from the compute user queues and the reset is successful, the KCQ failure after reset should be sent to KCQ only and not the compute user queues. Compute user queues can operate after a successful reset without a mode reset. Fixes: a4e4d945cba8 ("drm/amdgpu/gfx: defer per-queue helper_end until after MES resume") Signed-off-by: Amber Lin Acked-by: Jesse Zhang Signed-off-by: Alex Deucher

Revert "drm/amdgpu: defer KCQ remap until after MES resume in reset flow"

2026-07-01T15:30:04+00:00

This reverts commit 36b6c723d82c07dbbeae95d5883d4ecf0a643727. It introduced a regression on gfx11: the kfd negative test failed. Signed-off-by: Jesse Zhang Reviewed-by: Amber Lin Signed-off-by: Alex Deucher

drm/amdgpu: defer KCQ remap until after MES resume in reset flow

2026-07-01T15:15:47+00:00

Split amdgpu_gfx_mes_reset_queue_start() into reset+unmap now and queue reinit later, and do the remap only after amdgpu_mes_resume(). Avoids re-adding legacy queues while MES gangs are still suspended. Suggested-by: Shaoyun Liu Acked-by: Alex Deucher Signed-off-by: Jesse Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Fix mes remove_hw_queue lock

2026-07-01T15:07:40+00:00

down_read/up_read adev->reset_domain semaphore should be placed around remove queue. v2: remove the empty function, recover_bad_queue_mes to avoid compile error on rhel Fixes: f401a2633e02 ("drm/amdgpu: Remove faulty queue before resume") Signed-off-by: Amber Lin Reviewed-by: Jesse Zhang Signed-off-by: Alex Deucher

drm/amdgpu/gfx: fix cleaner shader IB buffer overflow

2026-06-17T20:13:02+00:00

The cleaner shader sysfs path allocates a 16-dword (64 byte) IB but incorrectly fills (align_mask + 1) dwords. On GFX rings align_mask is 0xff, so the loop wrote 256 dwords into a 64-byte buffer, causing a kernel page fault. The IB only needs to be a minimal NOP shell to schedule the job; the cleaner shader itself is emitted on the ring via emit_cleaner_shader(). Fill 16 dwords to match the allocation. v2: Use ib_size_dw variable (Lijo) Fixes: d361ad5d2fc0 ("drm/amdgpu: Add sysfs interface for running cleaner shader") Suggested-by: Lijo Lazar Signed-off-by: Asad Kamal Reviewed-by: Lijo Lazar Signed-off-by: Alex Deucher

drm/amdgpu/gfx: defer per-queue helper_end until after MES resume

2026-06-17T20:11:54+00:00

amdgpu_gfx_reset_mes_compute() runs amdgpu_mes_suspend(adev, 0) to quiesce all gangs, resets the offending queue(s), then resumes. The existing amdgpu_gfx_mes_reset_queue() called amdgpu_ring_reset_helper_end() right after unmap/restore/map of the reset queue, which re-emits backed-up commands and rings the doorbell. That doorbell hits a still-suspended CP: on the subsequent resume the queue partially wedges -- the first new IB after the reset may execute but later submissions stall, which surfaces as repeated timeouts on the same ring under concurrent workloads. Split out amdgpu_gfx_mes_reset_queue_start() (backup + MES reset + unmap/restore/map only) and defer helper_end. amdgpu_gfx_reset_mes_compute() collects the (ring, fence) pair for every queue it resets and runs helper_end on each after amdgpu_mes_resume(), so the re-emit doorbells land on a running CP. amdgpu_gfx_reset_mes_kcq() now reports the matched ring/fence back to the caller for the same reason. Reviewed-by: Alex Deucher Signed-off-by: Jesse Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Remove faulty queue before resume

2026-06-17T19:51:36+00:00

When driver already knows a bad queue but MES suspend_all is successful and MES hung queue detection doesn't detect it, remove this queue refore resume_all. Signed-off-by: Amber Lin Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher

drm/amdgpu/gfx: add a common helper to handle MES compute resets

2026-06-17T19:51:35+00:00

Add helpers to handle MES compute queue resets when multiple queues are affected. Can you be used by both KGD and KFD. v2: sqaush in updates v3: squash in userq updates Co-developed-by: Jesse Zhang Co-developed-by: Amber Lin Signed-off-by: Amber Lin Signed-off-by: Jesse Zhang Reviewed-by: Jesse Zhang Signed-off-by: Alex Deucher

drm/amdgpu: Use a common KGQ and KCQ reset helper for gfx11/12

2026-06-17T19:51:35+00:00

They are all the same so use a common implementation. Reviewed-by: Jesse Zhang Signed-off-by: Alex Deucher