Skip to content

Commit 93db7d6

Browse files
Andrey GrodzovskyTerminus-IMRC
authored andcommitted
drm/scheduler: Avoid accessing freed bad job.
Problem: Due to a race between drm_sched_cleanup_jobs in sched thread and drm_sched_job_timedout in timeout work there is a possiblity that bad job was already freed while still being accessed from the timeout thread. Fix: Instead of just peeking at the bad job in the mirror list remove it from the list under lock and then put it back later when we are garanteed no race with main sched thread is possible which is after the thread is parked. v2: Lock around processing ring_mirror_list in drm_sched_cleanup_jobs. v3: Rebase on top of drm-misc-next. v2 is not needed anymore as drm_sched_get_cleanup_job already has a lock there. v4: Fix comments to relfect latest code in drm-misc. Signed-off-by: Andrey Grodzovsky <[email protected]> Reviewed-by: Christian König <[email protected]> Reviewed-by: Emily Deng <[email protected]> Tested-by: Emily Deng <[email protected]> Signed-off-by: Christian König <[email protected]> Link: https://patchwork.freedesktop.org/patch/342356
1 parent 3c37926 commit 93db7d6

File tree

1 file changed

+27
-0
lines changed

1 file changed

+27
-0
lines changed

drivers/gpu/drm/scheduler/sched_main.c

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -284,10 +284,21 @@ static void drm_sched_job_timedout(struct work_struct *work)
284284
unsigned long flags;
285285

286286
sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work);
287+
288+
/* Protects against concurrent deletion in drm_sched_get_cleanup_job */
289+
spin_lock_irqsave(&sched->job_list_lock, flags);
287290
job = list_first_entry_or_null(&sched->ring_mirror_list,
288291
struct drm_sched_job, node);
289292

290293
if (job) {
294+
/*
295+
* Remove the bad job so it cannot be freed by concurrent
296+
* drm_sched_cleanup_jobs. It will be reinserted back after sched->thread
297+
* is parked at which point it's safe.
298+
*/
299+
list_del_init(&job->node);
300+
spin_unlock_irqrestore(&sched->job_list_lock, flags);
301+
291302
job->sched->ops->timedout_job(job);
292303

293304
/*
@@ -298,6 +309,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
298309
job->sched->ops->free_job(job);
299310
sched->free_guilty = false;
300311
}
312+
} else {
313+
spin_unlock_irqrestore(&sched->job_list_lock, flags);
301314
}
302315

303316
spin_lock_irqsave(&sched->job_list_lock, flags);
@@ -369,6 +382,20 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
369382

370383
kthread_park(sched->thread);
371384

385+
/*
386+
* Reinsert back the bad job here - now it's safe as
387+
* drm_sched_get_cleanup_job cannot race against us and release the
388+
* bad job at this point - we parked (waited for) any in progress
389+
* (earlier) cleanups and drm_sched_get_cleanup_job will not be called
390+
* now until the scheduler thread is unparked.
391+
*/
392+
if (bad && bad->sched == sched)
393+
/*
394+
* Add at the head of the queue to reflect it was the earliest
395+
* job extracted.
396+
*/
397+
list_add(&bad->node, &sched->ring_mirror_list);
398+
372399
/*
373400
* Iterate the job list from later to earlier one and either deactive
374401
* their HW callbacks or remove them from mirror list if they already

0 commit comments

Comments
 (0)