drm/xe/guc: Plumb GuC-capture into dev coredump

When we decide to kill a job, (from guc_exec_queue_timedout_job), we could end up with 4 possible scenarios at this starting point of this decision: 1. the guc-captured register-dump is already there. 2. the driver is wedged.mode > 1, so GuC-engine-reset / GuC-err-capture will not happen. 3. the user has started the driver in execlist-submission mode. 4. the guc-captured register-dump is not ready yet so we force GuC to kill that context now, but: A. we don't know yet if GuC will be successful on the engine-reset and get the guc-err-capture, else kmd will do a manual reset later OR B. guc will be successful and we will get a guc-err-capture shortly. So to accomdate the scenarios of 2 and 4A, we will need to do a manual KMD capture first(which is not be reliable in guc-submission mode) and decide later if we need to use that for the cases of 2 or 4A. So this flow is part of the implementation for this patch. Provide xe_guc_capture_get_reg_desc_list to get the register dscriptor list. Add manual capture by read from hw engine if GuC capture is not ready. If it becomes ready at later time, GuC sourced data will be used. Although there may only be a small delay between (1) the check for whether guc-err-capture is available at the start of guc_exec_queue_timedout_job and (2) the decision on using a valid guc-err-capture or manual-capture, lets not take any chances and lock the matching node down so it doesn't get re-claimed if GuC-Err-Capture subsystem is running out of pre-cached nodes. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241004193428.3311145-6-zhanjun.dong@intel.com
author: Zhanjun Dong <zhanjun.dong@intel.com> 2024-10-04 12:34:27 -0700
committer: Matt Roper <matthew.d.roper@intel.com> 2024-10-08 09:39:58 -0700
commit: ecb6336463911d6eb684998754f8701d0f437f18 (patch)
tree: 80ef747b7088e9183142cd5c864fe6ff81857830 /drivers/gpu/drm/xe/xe_hw_engine.h
parent: 8bfc496327ce0f3bd02445048e3a70cc97accc6d (diff)
download: lwn-ecb6336463911d6eb684998754f8701d0f437f18.tar.gz
lwn-ecb6336463911d6eb684998754f8701d0f437f18.zip
1 files changed, 2 insertions, 2 deletions
diff --git a/drivers/gpu/drm/xe/xe_hw_engine.h b/drivers/gpu/drm/xe/xe_hw_engine.h
index 022819a4a8eb..c2428326a366 100644
--- a/drivers/gpu/drm/xe/xe_hw_engine.h
+++ b/drivers/gpu/drm/xe/xe_hw_engine.h
@@ -11,6 +11,7 @@
 struct drm_printer;
 struct drm_xe_engine_class_instance;
 struct xe_device;
+struct xe_sched_job;
 
 #ifdef CONFIG_DRM_XE_JOB_TIMEOUT_MIN
 #define XE_HW_ENGINE_JOB_TIMEOUT_MIN CONFIG_DRM_XE_JOB_TIMEOUT_MIN
@@ -54,9 +55,8 @@ void xe_hw_engine_handle_irq(struct xe_hw_engine *hwe, u16 intr_vec);
 void xe_hw_engine_enable_ring(struct xe_hw_engine *hwe);
 u32 xe_hw_engine_mask_per_class(struct xe_gt *gt,
 				enum xe_engine_class engine_class);
-
 struct xe_hw_engine_snapshot *
-xe_hw_engine_snapshot_capture(struct xe_hw_engine *hwe);
+xe_hw_engine_snapshot_capture(struct xe_hw_engine *hwe, struct xe_sched_job *job);
 void xe_hw_engine_snapshot_free(struct xe_hw_engine_snapshot *snapshot);
 void xe_hw_engine_snapshot_print(struct xe_hw_engine_snapshot *snapshot,
 				 struct drm_printer *p);
author	Zhanjun Dong <zhanjun.dong@intel.com>	2024-10-04 12:34:27 -0700
committer	Matt Roper <matthew.d.roper@intel.com>	2024-10-08 09:39:58 -0700
commit	ecb6336463911d6eb684998754f8701d0f437f18 (patch)
tree	80ef747b7088e9183142cd5c864fe6ff81857830 /drivers/gpu/drm/xe/xe_hw_engine.h
parent	8bfc496327ce0f3bd02445048e3a70cc97accc6d (diff)
download	lwn-ecb6336463911d6eb684998754f8701d0f437f18.tar.gz lwn-ecb6336463911d6eb684998754f8701d0f437f18.zip