Skip to content

Commit 62b5b96

Browse files
committed
BUG/MINOR: sched: properly account for the CPU time of dying tasks
When task profiling is enabled, the scheduler can measure and report the cumulated time spent in each task and their respective latencies. But this was wrong for tasks with few wakeups as well as for self-waking ones, because the call date needed to measure how long it takes to process the task is retrieved in the task itself (->wake_date was turned to the call date), and we could face two conditions: - a new wakeup while the task is executing would reset the ->wake_date field before returning and make abnormally low values being reported; that was likely the case for taskèrun_applet for self-waking applets; - when the task dies, NULL is returned and the call date couldn't be retrieved, so that CPU time was not being accounted for. This was particularly visible with process_stream() which is usually called only twice per request, and whose time was systematically halved. The cleanest solution here is to keep in mind that the scheduler already uses quite a bit of local context in th_ctx, and place the intermediary values there so that they cannot vanish. The wake_date has to be reset immediately once read, and only its copy is used along the function. Note that this must be done both for tasks and tasklet, and that until recently tasklets were also able to report wrong values due to their sole dependency on TH_FL_TASK_PROFILING between tests. One nice benefit for future improvements is that such information will now be available from the task without having to be stored into the task itself anymore. Since the tasklet part was computed on wrapping 32-bit arithmetics and the task one was on 64-bit, the values were now consistently moved to 32-bit as it's already largely sufficient (4s spent in a task is more than twice what the watchdog would tolerate). Some further cleanups might be necessary, but the patch aimed at staying minimal. Task profiling output after 1 million HTTP request previously looked like this: Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 2012338 4.850s 2.410us 12.91s 6.417us process_stream 2000136 9.594s 4.796us 34.26s 17.13us sc_conn_io_cb 2000135 1.973s 986.0ns 30.24s 15.12us h1_timeout_task 137 - - 2.649ms 19.34us accept_queue_process 49 152.3us 3.107us 321.7yr 6.564yr main+0x146430 7 5.250us 750.0ns 25.92us 3.702us srv_cleanup_idle_conns 1 559.0ns 559.0ns 918.0ns 918.0ns task_run_applet 1 - - 2.162us 2.162us Now it looks like this: Tasks activity: function calls cpu_tot cpu_avg lat_tot lat_avg h1_io_cb 2014194 4.794s 2.380us 13.75s 6.826us process_stream 2000151 20.01s 10.00us 36.04s 18.02us sc_conn_io_cb 2000148 2.167s 1.083us 32.27s 16.13us h1_timeout_task 198 54.24us 273.0ns 3.487ms 17.61us accept_queue_process 52 158.3us 3.044us 409.9us 7.882us main+0x1466e0 18 16.77us 931.0ns 63.98us 3.554us srv_cleanup_toremove_conns 8 282.1us 35.26us 546.8us 68.35us srv_cleanup_idle_conns 3 149.2us 49.73us 8.131us 2.710us task_run_applet 3 268.1us 89.38us 11.61us 3.871us Note the two-fold difference on process_stream(). This feature is essentially used for debugging so it has extremely limited impact. However it's used quite a bit more in bug reports and it would be desirable that at least 2.6 gets this fix backported. It depends on at least these two previous patches which will then also have to be backported: MINOR: task: permanently enable latency measurement on tasklets CLEANUP: task: rename ->call_date to ->wake_date
1 parent 04e50b3 commit 62b5b96

File tree

3 files changed

+40
-34
lines changed

3 files changed

+40
-34
lines changed

include/haproxy/task-t.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ struct task {
112112
int expire; /* next expiration date for this task, in ticks */
113113
short nice; /* task prio from -1024 to +1024 */
114114
short tid; /* TID where it's allowed to run, <0 if anywhere */
115-
uint64_t wake_date; /* date of the last task wakeup */
115+
uint32_t wake_date; /* date of the last task wakeup */
116116
uint64_t lat_time; /* total latency time experienced */
117117
uint64_t cpu_time; /* total CPU time consumed */
118118
};

include/haproxy/tinfo-t.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,9 @@ struct thread_ctx {
129129
uint flags; /* thread flags, TH_FL_*, atomic! */
130130
/* 32-bit hole here */
131131

132+
uint32_t sched_wake_date; /* current task/tasklet's wake date or 0 */
133+
uint32_t sched_call_date; /* current task/tasklet's call date (valid if sched_wake_date > 0) */
134+
132135
uint64_t prev_cpu_time; /* previous per thread CPU time */
133136
uint64_t prev_mono_time; /* previous system wide monotonic time */
134137

src/task.c

Lines changed: 36 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -564,19 +564,19 @@ unsigned int run_tasks_from_lists(unsigned int budgets[])
564564
_HA_ATOMIC_DEC(&th_ctx->rq_total);
565565

566566
if (t->state & TASK_F_TASKLET) {
567-
uint64_t before = 0;
568-
569567
LIST_DEL_INIT(&((struct tasklet *)t)->list);
570568
__ha_barrier_store();
571569

572-
if (unlikely(_HA_ATOMIC_LOAD(&th_ctx->flags) & TH_FL_TASK_PROFILING)) {
573-
profile_entry = sched_activity_entry(sched_activity, t->process);
574-
before = now_mono_time();
570+
th_ctx->sched_wake_date = ((struct tasklet *)t)->wake_date;
571+
if (th_ctx->sched_wake_date) {
572+
uint32_t now_ns = now_mono_time();
573+
uint32_t lat = now_ns - th_ctx->sched_wake_date;
575574

576-
if (((struct tasklet *)t)->wake_date) {
577-
HA_ATOMIC_ADD(&profile_entry->lat_time, (uint32_t)(before - ((struct tasklet *)t)->wake_date));
578-
((struct tasklet *)t)->wake_date = 0;
579-
}
575+
((struct tasklet *)t)->wake_date = 0;
576+
th_ctx->sched_call_date = now_ns;
577+
profile_entry = sched_activity_entry(sched_activity, t->process);
578+
HA_ATOMIC_ADD(&profile_entry->lat_time, lat);
579+
HA_ATOMIC_INC(&profile_entry->calls);
580580
}
581581

582582
state = _HA_ATOMIC_FETCH_AND(&t->state, TASK_PERSISTENT);
@@ -593,10 +593,8 @@ unsigned int run_tasks_from_lists(unsigned int budgets[])
593593
continue;
594594
}
595595

596-
if (unlikely(_HA_ATOMIC_LOAD(&th_ctx->flags) & TH_FL_TASK_PROFILING)) {
597-
HA_ATOMIC_INC(&profile_entry->calls);
598-
HA_ATOMIC_ADD(&profile_entry->cpu_time, now_mono_time() - before);
599-
}
596+
if (th_ctx->sched_wake_date)
597+
HA_ATOMIC_ADD(&profile_entry->cpu_time, (uint32_t)(now_mono_time() - th_ctx->sched_call_date));
600598

601599
done++;
602600
th_ctx->current = NULL;
@@ -607,6 +605,21 @@ unsigned int run_tasks_from_lists(unsigned int budgets[])
607605
LIST_DEL_INIT(&((struct tasklet *)t)->list);
608606
__ha_barrier_store();
609607

608+
th_ctx->sched_wake_date = t->wake_date;
609+
if (unlikely(t->wake_date)) {
610+
uint32_t now_ns = now_mono_time();
611+
uint32_t lat = now_ns - t->wake_date;
612+
613+
t->lat_time += lat;
614+
t->wake_date = 0;
615+
th_ctx->sched_call_date = now_ns;
616+
profile_entry = sched_activity_entry(sched_activity, t->process);
617+
HA_ATOMIC_ADD(&profile_entry->lat_time, lat);
618+
HA_ATOMIC_INC(&profile_entry->calls);
619+
}
620+
621+
__ha_barrier_store();
622+
610623
/* We must be the exclusive owner of the TASK_RUNNING bit, and
611624
* have to be careful that the task is not being manipulated on
612625
* another thread finding it expired in wake_expired_tasks().
@@ -627,18 +640,6 @@ unsigned int run_tasks_from_lists(unsigned int budgets[])
627640
/* OK then this is a regular task */
628641

629642
_HA_ATOMIC_DEC(&ha_thread_ctx[tid].tasks_in_list);
630-
if (unlikely(t->wake_date)) {
631-
uint64_t now_ns = now_mono_time();
632-
uint64_t lat = now_ns - t->wake_date;
633-
634-
t->lat_time += lat;
635-
t->wake_date = now_ns;
636-
profile_entry = sched_activity_entry(sched_activity, t->process);
637-
HA_ATOMIC_ADD(&profile_entry->lat_time, lat);
638-
HA_ATOMIC_INC(&profile_entry->calls);
639-
}
640-
641-
__ha_barrier_store();
642643

643644
/* Note for below: if TASK_KILLED arrived before we've read the state, we
644645
* directly free the task. Otherwise it will be seen after processing and
@@ -661,18 +662,20 @@ unsigned int run_tasks_from_lists(unsigned int budgets[])
661662
}
662663
th_ctx->current = NULL;
663664
__ha_barrier_store();
665+
666+
/* stats are only registered for non-zero wake dates */
667+
if (unlikely(th_ctx->sched_wake_date)) {
668+
uint32_t cpu = (uint32_t)now_mono_time() - th_ctx->sched_call_date;
669+
670+
if (t)
671+
t->cpu_time += cpu;
672+
HA_ATOMIC_ADD(&profile_entry->cpu_time, cpu);
673+
}
674+
664675
/* If there is a pending state we have to wake up the task
665676
* immediately, else we defer it into wait queue
666677
*/
667678
if (t != NULL) {
668-
if (unlikely(t->wake_date)) {
669-
uint64_t cpu = now_mono_time() - t->wake_date;
670-
671-
t->cpu_time += cpu;
672-
t->wake_date = 0;
673-
HA_ATOMIC_ADD(&profile_entry->cpu_time, cpu);
674-
}
675-
676679
state = _HA_ATOMIC_LOAD(&t->state);
677680
if (unlikely(state & TASK_KILLED)) {
678681
task_unlink_wq(t);

0 commit comments

Comments
 (0)