summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
2026-06-03drm/amdgpu: Add size guard before copy discovery binaryFeifei Xu
Fix the firmware blob copied into fixed-size buffer without length check. Signed-off-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Le Ma <le.ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-06-03drm/amd: Fix amdgpu_device_find_parent()Mario Limonciello
commit eb53125a7ad9 ("drm/amd: Add dedicated helper for amdgpu_device_find_parent()") created a dedicated helper to find the parent device outside of the dGPU but it had a logic error that caused it to walk all the way up the topology and return the wrong device. Break out of the loop when the device is found. Reviewed-by: Alexander Deucher <alexander.deucher@amd.com> Fixes: eb53125a7ad9 ("drm/amd: Add dedicated helper for amdgpu_device_find_parent()") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-06-03drm/amd/display: Enable DM for DCN 4.2.1Matthew Stewart
[Why & How] Add DM IP block to amdgpu_discovery Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Matthew Stewart <Matthew.Stewart2@amd.com> Signed-off-by: Ray Wu <ray.wu@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-06-03Merge tag 'amd-drm-next-7.2-2026-05-29' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-7.2-2026-05-29: amdgpu: - GEM_OP warning fix - GEM_OP locking fix - Userq fixes - DCN 2.1 refclk fix - SI fixes - HMM fixes - Add DC KUNIT tests - UML fixes - Switch to system_dfl_wq - Old DC power state cleanup - RAS fixes amdkfd: - svm_range_set_attr locking fix - CRIU restore fix - KFD debugger fix radeon: - Use struct drm_edid instead of struct edid Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patch.msgid.link/20260529214346.2328355-1-alexander.deucher@amd.com
2026-05-28drm/amdgpu: fix amdgpu_vm_bo_reset_state_machineChristian König
Can't splice the list but need to handle each entry individually. Otherwise we run into issues after a GPU reset. Fixes: 4cdbba5a16aa ("drm/amdgpu: restructure VM state machine v4") Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Jesse Zhang <jesse.zhang@amd.com> Tested-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-28Merge drm/drm-next into drm-misc-nextThomas Zimmermann
Backmerging to get GEM LRU fixes from commit 379e8f1c ("drm/gem: Make the GEM LRU lock part of drm_device") and other updates from v7.1-rc5. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2026-05-27drm/amdgpu: fix calling VM invalidation in amdgpu_hmm_invalidate_gfxChristian König
Otherwise we don't invalidate page tables on next CS. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Tested-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu: fix amdgpu_hmm_range_get_pagesChristian König
The notifier sequence must only be read once or otherwise we could work with invalid pages. While at it also fix the coding style, e.g. drop the pre-initialized return value and use the common define for 2G range. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Tested-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu: init locals in umc_v12_0_convert_error_addressStanley.Yang
row, col, col_lower, row_lower, row_high and bank could be read on code paths that never assign them. Initialize them to 0. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu/userq: use array instead of list for userq_vasSunil Khatri
Use arrays instead of list for userq_vas since we have fixed no of bos. Also, we dont have to worry to free that memory later since this array would be free along with queue only. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu/userq: move mqd_destroy to later stage to keep core obj validSunil Khatri
mqd_destroy cleans up queue core objects like mqd and fw_object which are needed for any pending fence to signal properly. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amd: Add dedicated helper for amdgpu_device_find_parent()Mario Limonciello
There are a few cases that code walks up the topology to find the link partner of the integrated switch in a dGPU. Split this out to a helper and call in all places. This does have a functional change that amdgpu_device_gpu_bandwidth() doesn't cache the internal link but only the parent. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu/userq: remove amdgpu_userq_create/destroy_object wrapperSunil Khatri
Remove the amdgpu_userq_create/destroy_object wrappers and use directly the kernel bo allocation function which does all the things which are done in wrapper. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu: bound SR-IOV RAS CPER dump parsing against used_sizeChenglei Xie
The VF copies a PF-provided CPER telemetry blob and walks records using cper_dump->count and each entry's record_length. count is u64 while the loop used u32, so a large count could loop indefinitely. record_length was not limited to the kmemdup'd region, so the first iteration could read far past the allocation; record_length == 0 could spin forever on the same entry. Together that allowed a malicious hypervisor to leak heap past the blob into the CPER ring or hang the guest. Require used_size to cover the fixed header before buf and stay within the telemetry cap. Track remaining bytes in buf, cap iterations with u64 and CPER_MAX_ALLOWED_COUNT, and reject record_length outside [sizeof(cper_hdr), remaining] before writing to the ring. Signed-off-by: Chenglei Xie <Chenglei.Xie@amd.com> Reviewed-by: YiPeng Chai <YiPeng.Chai@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu: fix potential overflow in fs_info.debugfs_nameStanley.Yang
Use snprintf() with sizeof(fs_info.debugfs_name) so a long RAS block name plus the "_err_inject" suffix cannot overflow the 32-byte buffer. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu/userq: make sure queue is valid in the hang_detect_workSunil Khatri
Thread 1: Running amdgpu_userq_destroy which eventually remove the queue from door bell and set userq_mgr = NULL. Thread2: An interrupt might have scheduled the hang_detect_work which still need userq_mgr to be valid but could get an NULL ptrs. To fix that make sure we cancel the hang_detect_work again before setting userq_mgr to NULL. Along with that we also need all the queue va to remain valid till we could be running anything on the queue and hence moving the userq_va post hang_detect handler is cancelled. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu/userq: reserve root bo without interruptionSunil Khatri
Fix the code to make it an uninterruptible reservation for root bo. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu/userq: add amdgpu_bo_unpin when amdgpu_ttm_alloc_gart failsSunil Khatri
Unpin the wptr_obj->obj when amdgpu_ttm_alloc_gart fails. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu: simplify return value in amdgpu_userq_get_doorbell_indexSunil Khatri
amdgpu_userq_get_doorbell_index returns a uint64 type index as well as a int type failure values. Simplifying this and using a int type return value and getting the index in input pointer of type uint64 type. Also since it's used at once place making it static would be better. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu/userq: Fix the mutex_init cleanup for fence_drv_lockSunil Khatri
mutex fence_drv_lock is destroyed in amdgpu_userq_fence_driver_free also in one of the jump condition mutex_destroy is also called leading to double mutex_destroy. So rearranging the code so amdgpu_userq_fence_driver_free takes care of the clean up along with mutex_destroy. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu/userq: Fix doorbell object cleanup of queueSunil Khatri
Unpin and unref the door bell obj if queue creation fails before initialization is complete. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu: Replace use of system_unbound_wq with system_dfl_wqMarco Crivellari
This patch continues the effort to refactor workqueue APIs, which has begun with the changes introducing new workqueues and a new alloc_workqueue flag: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") The point of the refactoring is to eventually alter the default behavior of workqueues to become unbound by default so that their workload placement is optimized by the scheduler. Before that to happen, workqueue users must be converted to the better named new workqueues with no intended behaviour changes: system_wq -> system_percpu_wq system_unbound_wq -> system_dfl_wq This way the old obsolete workqueues (system_wq, system_unbound_wq) can be removed in the future. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu: check num_entries in GEM_OP GET_MAPPING_INFOZiyi Guo
kvcalloc(args->num_entries, sizeof(*vm_entries), GFP_KERNEL) at amdgpu_gem.c:1050 uses the user-supplied num_entries directly without any upper bounds check. Since num_entries is a __u32 and sizeof(drm_amdgpu_gem_vm_entry) is 32 bytes, a large num_entries produces an allocation exceeding INT_MAX, triggering WARNING in __kvmalloc_node_noprof(), causing a kernel WARNING, TAINT_WARN, and panic on CONFIG_PANIC_ON_WARN=y systems. Add a size bounds check before we invoke the kvzalloc() to reject oversized num_entries early with -EINVAL. Fixes: 4d82724f7f2b ("drm/amdgpu: Add mapping info option for GEM_OP ioctl") Signed-off-by: Ziyi Guo <n7l8m4@u.northwestern.edu> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-27drm/amdgpu: fix lock leak on ENOMEM in AMDGPU_GEM_OP_GET_MAPPING_INFOMichael Bommarito
The AMDGPU_GEM_OP_GET_MAPPING_INFO branch of amdgpu_gem_op_ioctl() holds three cleanup-tracked resources before calling kvcalloc(): the drm_gem_object reference from drm_gem_object_lookup(), the drm_exec lock on the looked-up GEM via drm_exec_lock_obj(), and the drm_exec lock on the per-process VM root page directory via amdgpu_vm_lock_pd(). All three are released by the out_exec label that every other error path in this function jumps to. The kvcalloc() failure path returns -ENOMEM directly, skipping out_exec and leaking all three. The leaked per-process VM root PD dma_resv lock is the load-bearing leak: any subsequent operation on the same VM (further GEM ops, command-submission, eviction, TTM shrinker callbacks) blocks on the held lock. DRM_IOCTL_AMDGPU_GEM_OP is DRM_AUTH | DRM_RENDER_ALLOW, so this is an unprivileged-local denial of service against the caller's GPU context, reachable by any process with /dev/dri/renderD* access. Route the failure through out_exec so drm_exec_fini() and drm_gem_object_put() run. Reproduced on stock 7.0.0-10, Ryzen 7 5700U / Radeon Vega (Lucienne): the failing ioctl returns -ENOMEM and a second GET_MAPPING_INFO on the same fd then blocks in drm_exec_lock_obj() on the leaked dma_resv. SIGKILL on the caller does not reap the task; the fd-release path during process exit goes through amdgpu_gem_object_close() -> drm_exec_prepare_obj() on the same lock, leaving the task in D state until the box is rebooted. The patched kernel was not rebuilt and re-tested on this hardware; the fix is mechanical. Tested on a single Lucienne / Vega box only. Ziyi Guo posted an independent INT_MAX-bound check for args->num_entries in the same branch [1]; the two patches are complementary and can land in either order. Fixes: 4d82724f7f2b ("drm/amdgpu: Add mapping info option for GEM_OP ioctl") Link: https://lore.kernel.org/all/20260208000255.4073363-1-n7l8m4@u.northwestern.edu/ # [1] Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-25drm/exec, drm/xe, drm/amdgpu: Add an accessor for struct drm_exec::ticketThomas Hellström
Drivers were accessing this drm_exec member directly. While that may seem harmless, it will require action if the drm_exec utility is made a subclass of a dma-resv transaction utility as outlined in the cover-letter. Provide an accessor, drm_exec_ticket() to avoid that. v2: - Fix amdgpu compile error (Intel CI) - Update the commit message. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patch.msgid.link/20260520101616.41284-5-thomas.hellstrom@linux.intel.com
2026-05-25drm/exec: Remove the index parameter from drm_exec_for_each_locked_obj[_reverse]Thomas Hellström
Nobody makes any use of it. Possible internal future users can instead use the _index variable. External users shouldn't use it since the array it's pointing into is internal drm_exec state. v2: - Use a unique id for the loop variable (Christian) Assisted-by: GitHub Copilot:claude-sonnet-4.6 Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patch.msgid.link/20260520101616.41284-2-thomas.hellstrom@linux.intel.com
2026-05-19drm/amdgpu: restructure VM state machine v4Christian König
Instead of coming up with more sophisticated names for states a VM BO can be in, group them by the type of BO first and then by the state. So we end with BO type kernel, always_valid and individual and then states evicted, moved and idle. Not much functional change, except that evicted_user is moved back together with the other BOs again which makes the handling in amdgpu_vm_validate() a bit more complex. Also fixes a problem with user queues and amdgpu_vm_ready(). We didn't considered the VM ready when user BOs were not ideally placed, harmless performance impact for kernel queues but a complete show stopper for userqueues. v2: fix a few typos in comments, rename the BO types to make them more descriptive, fix a couple of bugs found during testing v3: squashed together with revert to old status lock handling, looks like the first patch still had some bug which this one here should fix. Fix a missing lock around debugfs printing. v4: fix merge clash pointed out by Prike Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: check and drop invalid bad page recordsYiPeng Chai
Check and drop invalid bad page records. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: validate and share PSP fw_pri_buf copies via psp_copy_fwCandice Li
Change psp_copy_fw from void to int: return -ENODEV when drm_dev_enter fails, and -EINVAL when the image size is zero or larger than the 1 MiB PSP private buffer. Replace open-coded memset/memcpy into fw_pri_buf with psp_copy_fw. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: fix handling in amdgpu_userq_createChristian König
Well mostly the same issues the other code had as well: 1. Memory allocation while holding the userq_mutex lock is forbidden! 2. Things were created/started/published in the wrong order. 3. The reset lock was taken in the wrong order and seems to be unecessary in the first place. 4. Error messages on invalid input parameters can spam the logs. 5. Error messages on memory allocation failures are usually superflous as well. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Prike Liang <Prike.Liang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: add first record offset checkGangliang Xie
check the upper and lower limits of first record offset Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: Bound GPIO I2C table entry count from VBIOSCandice Li
Reject undersized tables and cap the derived entry count to AMDGPU_MAX_I2C_BUS so we do not overrun adev->i2c_bus[] or walk an absurd number of entries on corrupt size fields. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: cap ATOM command table nesting depthCandice Li
Cap nesting at 32 levels with execute_depth and return -ELOOP when exceeded. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: Fix memory leak of i2s_pdata in ACP initializationCe Sun
Currently, the i2s_pdata structure is dynamically allocated in acp_hw_init() but never freed in both the error handling path and the acp_hw_fini() cleanup path, causing a permanent memory leak. Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amd/ras: Fix SMU EEPROM record field decodingXiang Liu
The SMU EEPROM read paths pass byte-sized record field addresses to mca_ipid_parse(), whose outputs are u32 pointers. Writing through those widened pointers can clobber adjacent fields and bytes beyond the record storage. Parse the IPID values into local u32 temporaries instead, then explicitly narrow the values when storing them in the EEPROM record. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amd/ras: reset CPER ring on corrupt entry sizeXiang Liu
When CPER ring overflow handling advances the read pointer, it trusts the parsed entry size from the current ring contents. Corrupt CPER data can produce an entry size that does not advance rptr after dword conversion and pointer masking. In that case the recovery loop keeps testing the same location while holding the CPER ring mutex. This can hang the worker that is writing the next CPER record. Detect a no-progress rptr update and reset the CPER ring to an empty state instead. This drops the corrupt contents and lets the writer leave the recovery path without spinning. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: userq_va_mapped should remain true once doneSunil Khatri
Multiple queues needs these bo_va objects belonging to the same uq_mgr. So once they are mapped lets not unmap them as at any point of time any of the queues might be using it. Also userq_va_mapped should be a boolean than atomic. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: avoid integer overflow in VA range checkCe Sun
The original addition operation in 64-bit unsigned type may encounter overflow situations. To prevent such issues and safely reject invalid inputs, the check_add_overflow() function is used. Signed-off-by: Ce Sun <cesun102@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amd/ras: Fix UMC error address allocation leakXiang Liu
amdgpu_umc_handle_bad_pages() allocates err_data->err_addr before querying UMC error information. In the direct and firmware query paths, the pointer is reassigned to a fresh allocation before the original buffer is released, so the initial allocation is leaked on each handled event. Free the existing buffer before replacing it in those query paths so the function exit cleanup only owns the active allocation. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-19drm/amdgpu: unmap all user mappings of framebuffer and doorbell before mode1 ↵Yifan Zhang
reset During Mode 1 reset, the ASIC undergoes a reset cycle and becomes temporarily inaccessible via PCIe. Any attempt to access framebuffer or MMIO registers during this window can result in uncompleted PCIe transactions, leading to NMI panics or system hangs. To prevent this, Unmap all of the applications mappings of the framebuffer and doorbell BARs before mode1 reset. Also prevent new mappings from coming in during the reset process. v2: remove inode in kfd_dev (Christian) v3: correct unmap offset (Felix), remove prevent new mappings part to avoid deadlock (Christian) Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amd: Reduce code duplication in runtime PMMario Limonciello
[Why] amdgpu_pmops_runtime_suspend() runs almost the same code that amdgpu_pmops_runtime_idle() runs. That is there is pointless code duplication. [How] Move amdgpu_pmops_runtime_idle() up, extract common code and then call from both functions. No intended functional changes. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu: use atomic operation to achieve lockless serializationSunil Khatri
In amdgpu_seq64_alloc there is a possibility that two difference cores from two separate NODES can try to and could get the same free slot. So this fixes that race here using atomic test_and_set clear operations. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu/vce3: Fix VCE 3 firmware size and offsetsTimur Kristóf
The VCPU BO contains the actual FW at an offset, but it was not calculated into the VCPU BO size. Subtract this from the FW size to make sure there is no out of bounds access. This may fix VM faults when using VCE 3. Cc: John Olender <john.olender@gmail.com> Fixes: e98226221467 ("drm/amdgpu: recalculate VCE firmware BO size") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu/vce2: Fix VCE 2 firmware size and offsetsTimur Kristóf
The VCPU BO contains the actual FW at an offset, but it was not calculated into the VCPU BO size. Subtract this from the FW size to make sure there is no out of bounds access. Additionally, increase the VCE_V2_0_DATA_SIZE to have extra space after the VCE handles. Also increase the data size used for each VCE handle. The FW needs 23744 bytes, use 24K to be safe. This fixes VM faults when using VCE 2. Cc: John Olender <john.olender@gmail.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/work_items/4802 Fixes: e98226221467 ("drm/amdgpu: recalculate VCE firmware BO size") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu/vce1: Stop using amdgpu_vce_resumeTimur Kristóf
The VCE1 firmware works slightly differently and is already loaded by vce_v1_0_load_fw(). It doesn't actually need to call amdgpu_vce_resume(). Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu/vce1: Fix VCE 1 firmware size and offsetsTimur Kristóf
The VCPU BO contains the actual FW at an offset, but it was not calculated into the VCPU BO size. Subtract this from the FW size to make sure there is no out of bounds access. Make sure the stack and data offsets are aligned to the 32K TLB size. Check that the FW microcode actually fits in the space that is reserved for it. Fixes: d4a640d4b9f3 ("drm/amdgpu/vce1: Implement VCE1 IP block (v2)") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu/vce1: Don't repeat GTT MGR node allocationTimur Kristóf
Only allocate entries from the GTT manager when the VCE GTT node is not allocated yet. This prevents the possibility of allocating them multiple times, which causes issues during GPU reset and suspend/resume. Fixes: 71aec08f80e7 ("amdgpu/vce: use amdgpu_gtt_mgr_alloc_entries") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu/vce1: Check if VRAM address is lower than GART.Timur Kristóf
Previously, I had assumed this was not possible so it was OK to not handle it, but now we got a report from a user who has a board that is configured this way. When the VCPU BO is already located in a low 32-bit address in VRAM (eg. when VRAM is mapped to the low address space), don't do the workaround. Fixes: 71aec08f80e7 ("amdgpu/vce: use amdgpu_gtt_mgr_alloc_entries") Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu/vce1: Remove superfluous address checkTimur Kristóf
The same thing is already checked a few lines above. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2026-05-18drm/amdgpu/vce1: Check that the GPU address is < 128 MiBTimur Kristóf
When ensuring the low 32-bit address, make sure it is less than 128 MiB, otherwise the VCE seems to fail to initialize. This seems to be an undocumented limitation of the firmware validation mechanism. Note that in case of VCE1 the BAR address is zero and we can't change it also due to the firmware validator. When programming the mmVCE_VCPU_CACHE_OFFSETn registers, don't AND them with a mask. This is incorrect because the register mask is actually 0x0fffffff and useless because we already ensure the addresses are below the limit. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>