Fine-tuning backpressure clamping, and API cleanups #1442

mkeeter · 2024-09-03T15:36:08Z

Right now, backpressure goes to infinity as you approach the max value. However, it's possible to exceed max if you (1) start below max and (2) send a single job that pushes over the limit (e.g. a large write). If this happens, then we special-case a 1 hour backpressure delay.

All of this is kinda awkward.

This PR cleans it up:

There's an explicit delay_max stored in the backpressure configuration, which is both a clamp and the delay used if our counter exceeds max_value
Backpressure now has an explicit Saturated value, which is used if we exceed max_value
If backpressure is saturated, then we return an error to the guest immediately (instead of delaying the IO for 1 hour)

faithanalog

requested some changes but the main logic of the PR makes sense to me

upstairs/src/backpressure.rs

jmpesp · 2024-09-04T17:32:03Z

upstairs/src/guest.rs

+            BackpressureAmount::Saturated => {
+                let err = "write queue is saturated";
+                error!(self.log, "{err}");
+                Err(CrucibleError::IoError(err.to_owned()))


I wonder if this should have a separate CrucibleError variant, but maybe that only makes sense if there's some meaningful NVMe error type that Propolis can return to the guest?

leftwo

I'm concerned about returning errors rather than retrying forever (albit with longer and longer delays), so I thought I would at least document my concerns to allow for more discussion around them.

At some point, we don't have the memory to keep taking IOs in from the guest, so I guess we can't just keep queueing forever. That being said, what would it take for a guest to generate the traffic to get us saturated? I suppose it all depends on how fast we can process incoming IOs and what happens if the overall system is so busy that crucible is not getting the resources that it needs?

upstairs/src/backpressure.rs

leftwo · 2024-09-04T17:35:36Z

upstairs/src/guest.rs

+        match bp {
+            BackpressureAmount::Saturated => {
+                let err = "write queue is saturated";
+                error!(self.log, "{err}");


I wonder a few things here.

If we are saturated, what impact will we see from logging a bunch of messages. Like, will it further impact our ability to process IOs? Or, are things so bad at this point that it does not matter that we are spending time logging things?

I do like the idea of knowing this has happened, and if we don't log a message then bumping some internal counter I think would be good as it could help us explain why some VM got a bunch of errors yesterday but now things look okay.

And, finally, how difficult is it to trigger this condition from a VM?

In the saturated case, perhaps we should still sleep for delay_max here before returning the error code? That would limit us to one error every 30 seconds, which seems perfectly reasonable. We could also log an error then delay some long amount of time (e.g. 1 hour), which would match our previous behavior but seems fraught; I worry that delaying a guest by 1 hour would be more confusing than just returning an error.

Agreed, we should definitely record it somewhere if not in an error message

From a VM, you should only see this error condition if writes take > 30 seconds to execute in Crucible. The VM can't really affect this (assuming it's already spamming IOs as quickly as possible); the saturated error indicates that even at maximum backpressure, IOs are not being answered fast enough.

leftwo

I think we have a pile of backpressure work in flight, so we will be further updates in this area and I think this can go in while we iterate.

staged on top of #1442 Right now, we're manually tracking backpressure bytes by incrementing / decrementing a counter any time a job becomes active / inactive. This can be error-prone! This PR makes tracking backpressure values automatic, through the power of RAII: - We add a new `struct BackpressureCounters`, which contains the three flavors of counters (write bytes, total bytes, job count) - `BackpressureCounters::increment` increments counters and hands you a `BackpressureGuard` - When dropped, the `BackpressureGuard` decrements the counters automatically

mkeeter requested review from faithanalog, jmpesp and leftwo September 3, 2024 15:36

mkeeter mentioned this pull request Sep 3, 2024

Use RAII handles for backpressure #1443

Merged

faithanalog requested changes Sep 3, 2024

View reviewed changes

upstairs/src/backpressure.rs Show resolved Hide resolved

upstairs/src/backpressure.rs Outdated Show resolved Hide resolved

upstairs/src/backpressure.rs Show resolved Hide resolved

mkeeter force-pushed the mkeeter/backpressure-cleanup branch from dc3130f to 97d1fdd Compare September 4, 2024 15:35

jmpesp reviewed Sep 4, 2024

View reviewed changes

leftwo reviewed Sep 4, 2024

View reviewed changes

leftwo approved these changes Sep 5, 2024

View reviewed changes

mkeeter added 4 commits September 5, 2024 11:54

Fine-tuning backpressure clamping, and API cleanups

9e9a591

Fixes per review

65df375

Restore squared scale

b2d0667

Fix comment

f0eb4f1

mkeeter force-pushed the mkeeter/backpressure-cleanup branch from ea5568e to f0eb4f1 Compare September 5, 2024 15:54

faithanalog approved these changes Sep 5, 2024

View reviewed changes

mkeeter merged commit 2869ea2 into main Sep 6, 2024
18 checks passed

mkeeter deleted the mkeeter/backpressure-cleanup branch September 6, 2024 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine-tuning backpressure clamping, and API cleanups #1442

Fine-tuning backpressure clamping, and API cleanups #1442

Uh oh!

mkeeter commented Sep 3, 2024

Uh oh!

faithanalog left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmpesp Sep 4, 2024

Uh oh!

leftwo left a comment

Uh oh!

Uh oh!

leftwo Sep 4, 2024

Uh oh!

mkeeter Sep 4, 2024

Uh oh!

leftwo left a comment

Uh oh!

Uh oh!

Uh oh!

Fine-tuning backpressure clamping, and API cleanups #1442

Fine-tuning backpressure clamping, and API cleanups #1442

Uh oh!

Conversation

mkeeter commented Sep 3, 2024

Uh oh!

faithanalog left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmpesp Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

leftwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leftwo Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

mkeeter Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

leftwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!