git.baikalelectronics.ru Git - kernel.git/commit

author	Jack Zhang <Jack.Zhang1@amd.com>
	Mon, 8 Mar 2021 04:41:27 +0000 (12:41 +0800)
committer	Alex Deucher <alexander.deucher@amd.com>
	Fri, 9 Apr 2021 20:45:45 +0000 (16:45 -0400)
commit	17e2c642941d14b0342a02d33e4d3a4fd13a93dd
tree	589404ebe414ab0b537e4bbfe5388dce0e444557	tree \| snapshot
parent	1345df5f12a44c2914e8a61723528c88a302ff64	commit \| diff

drm/amd/amdgpu implement tdr advanced mode

[Why]
Previous tdr design treats the first job in job_timeout as the bad job.
But sometimes a later bad compute job can block a good gfx job and
cause an unexpected gfx job timeout because gfx and compute ring share
internal GC HW mutually.

[How]
This patch implements an advanced tdr mode.It involves an additinal
synchronous pre-resubmit step(Step0 Resubmit) before normal resubmit
step in order to find the real bad job.

1. At Step0 Resubmit stage, it synchronously submits and pends for the
first job being signaled. If it gets timeout, we identify it as guilty
and do hw reset. After that, we would do the normal resubmit step to
resubmit left jobs.

2. For whole gpu reset(vram lost), do resubmit as the old way.

v2: squash in build fix (Alex)

Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

drivers/gpu/drm/amd/amdgpu/amdgpu_device.c		diff \| blob \| history
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c		diff \| blob \| history
drivers/gpu/drm/scheduler/sched_main.c		diff \| blob \| history
include/drm/gpu_scheduler.h		diff \| blob \| history