-
Notifications
You must be signed in to change notification settings - Fork 590
Shred fetch bounded channel #6768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6768 +/- ##
=========================================
- Coverage 83.2% 83.2% -0.1%
=========================================
Files 852 852
Lines 377089 377104 +15
=========================================
- Hits 314051 314046 -5
- Misses 63038 63058 +20 🚀 New features to boost your workflow:
|
Is it somehow possible to measure at what shred arrival rate would these channels overflow? |
We can make some educated guesses with the current instrumentation. shred_sigverify is the downstream consumer of this channel. Its metrics are reported every ~2s. In this sample (which represents a typical mean), shred_sigverify was active, processing shreds, for ~461ms, or idle for ~77% of the 2s window. The crucial piece of data here, which I do not have, is how the shred_sigverify function scales with input / batch size. A very dumb, linear scaling would entail something like 4.3x increase in shreds to fully saturate the 2s window with work (i.e., we are processing shreds for the entire duration of the stat submission window - no idle time). After this point we may risk adding messages to the channel faster than we're processing them. Though, I don't think this kind of analysis is very useful in practice, but perhaps useful as a starting point. The data reveal that we are actually doing a fairly poor job batching and thus parallelizing the work. In fact, using So, back to your question -- how much channel headroom do we actually have? Given the current state of the code, it depends how much an increase in shred ingress increases packet batch saturation, and how shred_sigverify scales with batch size. Our batches are currently very poorly saturated (actually hard to call them "batches", given the data). Assuming worst case, increase in shred ingress without increase in packet batch saturation (i.e., linear scaling), we have roughly 4x of headroom before we have to starting thinking about whether we're falling behind. With shred_sigverify properly batching and parallelizing, I would conjecture we have significantly more headroom. |
Thank you for the detailed analysis! From what I understand, we used to have a short sleep in solana-streamer that would accumulate packets for up to 1ms while making a batch for sigverify. However, practical measurements showed that it makes very little difference. However, in hunt for reduced latency, the wait was removed in #5213 based on no data I can track in PRs. When I tried a similar change for gossip it turned out to increase CPU usage through the mechanism you have mentioned #5319. I do not know if we can/should trade cpu for latency reduction here. Arguably, if we get the architecture right, we might be able to get away with no batching while preserving low latency of the no-coalesce solution. As I understand, sigverify itself does not benefit much from batching as it sigverifies packets one at a time anyway. so the benefits are in channel utilization. switching to bounded channel and try_send/try_recv access might be preferable to straight up adding latency in hope of making meaningfully sized batches. I would not be surprised if much of the "work" that sigverify does is just managing rayon pools... |
I actually ran a test with coalesce enabled on the shred_sigverify channel, and found it yields a significant improvement to processing time (~7.3x). |
turbine/src/sigverify_shreds.rs
Outdated
@@ -47,6 +47,9 @@ const CLUSTER_NODES_CACHE_NUM_EPOCH_CAP: usize = 2; | |||
// are needed, we can use longer durations for cache TTL. | |||
const CLUSTER_NODES_CACHE_TTL: Duration = Duration::from_secs(30); | |||
|
|||
/// Maximum number of packet batches to process in a single sigverify iteration. | |||
const SIGVERIFY_SHRED_BATCH_SIZE: usize = 1024 * 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this so huge? Under which conditions would we get 8k batches of packets for sigverify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a mistake. Will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any numbers about how small can this be before we start losing performance? For minimal latency we'd want this to be as small as possible, right?
244e638
to
0d59906
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
Problem
Similar to #5177 and #6107. Shred fetch stage uses two unbounded channels to pass shreds via packet batches between threads. Additionally,
run_shred_sigverify
uses theVec::extend(receiver.try_iter())
pattern, which can continuously reallocate as shreds are received and potentially introduce lag time before shreds are actually processed.Summary of Changes
This bounds both channels involved in shred fetch with
EvictingSender
, and updatesrun_shred_sigverify
to pull off a fixed number of packet batches on each iteration.A note on channel size. The general case sees shred and repair ingress in the hundreds of packet batches per second. However, in the case of catch-up, we may see upwards of 8k packet batches per second, which would suggest a roughly 16k packet batch limit for ample headroom. We're setting it to 4x that amount to future proof for increases of CU limits (e.g., a future 100k CU limit).