Skip to content

Shred fetch bounded channel #6768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 10, 2025
Merged

Conversation

cpubot
Copy link

@cpubot cpubot commented Jun 27, 2025

Problem

Similar to #5177 and #6107. Shred fetch stage uses two unbounded channels to pass shreds via packet batches between threads. Additionally, run_shred_sigverify uses the Vec::extend(receiver.try_iter()) pattern, which can continuously reallocate as shreds are received and potentially introduce lag time before shreds are actually processed.

Summary of Changes

This bounds both channels involved in shred fetch with EvictingSender, and updates run_shred_sigverify to pull off a fixed number of packet batches on each iteration.

A note on channel size. The general case sees shred and repair ingress in the hundreds of packet batches per second. However, in the case of catch-up, we may see upwards of 8k packet batches per second, which would suggest a roughly 16k packet batch limit for ample headroom. We're setting it to 4x that amount to future proof for increases of CU limits (e.g., a future 100k CU limit).

@cpubot cpubot requested a review from alessandrod June 27, 2025 20:05
@codecov-commenter
Copy link

codecov-commenter commented Jun 27, 2025

Codecov Report

Attention: Patch coverage is 69.69697% with 10 lines in your changes missing coverage. Please review.

Project coverage is 83.2%. Comparing base (87451d5) to head (0d59906).

Additional details and impacted files
@@            Coverage Diff            @@
##           master    #6768     +/-   ##
=========================================
- Coverage    83.2%    83.2%   -0.1%     
=========================================
  Files         852      852             
  Lines      377089   377104     +15     
=========================================
- Hits       314051   314046      -5     
- Misses      63038    63058     +20     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexpyattaev
Copy link

Is it somehow possible to measure at what shred arrival rate would these channels overflow?

@cpubot
Copy link
Author

cpubot commented Jun 30, 2025

Is it somehow possible to measure at what shred arrival rate would these channels overflow?

We can make some educated guesses with the current instrumentation.

image

shred_sigverify is the downstream consumer of this channel. Its metrics are reported every ~2s. In this sample (which represents a typical mean), shred_sigverify was active, processing shreds, for ~461ms, or idle for ~77% of the 2s window.

The crucial piece of data here, which I do not have, is how the shred_sigverify function scales with input / batch size. A very dumb, linear scaling would entail something like 4.3x increase in shreds to fully saturate the 2s window with work (i.e., we are processing shreds for the entire duration of the stat submission window - no idle time). After this point we may risk adding messages to the channel faster than we're processing them. Though, I don't think this kind of analysis is very useful in practice, but perhaps useful as a starting point.

image
image

The data reveal that we are actually doing a fairly poor job batching and thus parallelizing the work. In fact, using par_iter in shred_sigverify is likely hurting more than it is helping.

So, back to your question -- how much channel headroom do we actually have? Given the current state of the code, it depends how much an increase in shred ingress increases packet batch saturation, and how shred_sigverify scales with batch size. Our batches are currently very poorly saturated (actually hard to call them "batches", given the data). Assuming worst case, increase in shred ingress without increase in packet batch saturation (i.e., linear scaling), we have roughly 4x of headroom before we have to starting thinking about whether we're falling behind. With shred_sigverify properly batching and parallelizing, I would conjecture we have significantly more headroom.

@alexpyattaev
Copy link

Thank you for the detailed analysis! From what I understand, we used to have a short sleep in solana-streamer that would accumulate packets for up to 1ms while making a batch for sigverify. However, practical measurements showed that it makes very little difference.

However, in hunt for reduced latency, the wait was removed in #5213 based on no data I can track in PRs. When I tried a similar change for gossip it turned out to increase CPU usage through the mechanism you have mentioned #5319. I do not know if we can/should trade cpu for latency reduction here.

Arguably, if we get the architecture right, we might be able to get away with no batching while preserving low latency of the no-coalesce solution. As I understand, sigverify itself does not benefit much from batching as it sigverifies packets one at a time anyway. so the benefits are in channel utilization. switching to bounded channel and try_send/try_recv access might be preferable to straight up adding latency in hope of making meaningfully sized batches. I would not be surprised if much of the "work" that sigverify does is just managing rayon pools...

@cpubot
Copy link
Author

cpubot commented Jul 1, 2025

Thank you for the detailed analysis! From what I understand, we used to have a short sleep in solana-streamer that would accumulate packets for up to 1ms while making a batch for sigverify. However, practical measurements showed that it makes very little difference.

However, in hunt for reduced latency, the wait was removed in #5213 based on no data I can track in PRs. When I tried a similar change for gossip it turned out to increase CPU usage through the mechanism you have mentioned #5319. I do not know if we can/should trade cpu for latency reduction here.

Arguably, if we get the architecture right, we might be able to get away with no batching while preserving low latency of the no-coalesce solution. As I understand, sigverify itself does not benefit much from batching as it sigverifies packets one at a time anyway. so the benefits are in channel utilization. switching to bounded channel and try_send/try_recv access might be preferable to straight up adding latency in hope of making meaningfully sized batches. I would not be surprised if much of the "work" that sigverify does is just managing rayon pools...

I actually ran a test with coalesce enabled on the shred_sigverify channel, and found it yields a significant improvement to processing time (~7.3x).

before:
image

after:
image

@@ -47,6 +47,9 @@ const CLUSTER_NODES_CACHE_NUM_EPOCH_CAP: usize = 2;
// are needed, we can use longer durations for cache TTL.
const CLUSTER_NODES_CACHE_TTL: Duration = Duration::from_secs(30);

/// Maximum number of packet batches to process in a single sigverify iteration.
const SIGVERIFY_SHRED_BATCH_SIZE: usize = 1024 * 8;
Copy link

@alexpyattaev alexpyattaev Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this so huge? Under which conditions would we get 8k batches of packets for sigverify?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a mistake. Will fix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any numbers about how small can this be before we start losing performance? For minimal latency we'd want this to be as small as possible, right?

@cpubot cpubot force-pushed the shred-fetch-bounded-channel branch from 244e638 to 0d59906 Compare July 9, 2025 20:58
Copy link

@alexpyattaev alexpyattaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@cpubot cpubot merged commit d277fd2 into anza-xyz:master Jul 10, 2025
41 checks passed
@cpubot cpubot deleted the shred-fetch-bounded-channel branch July 10, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants