Shred fetch bounded channel #6768

cpubot · 2025-06-27T20:05:33Z

Problem

Similar to #5177 and #6107. Shred fetch stage uses two unbounded channels to pass shreds via packet batches between threads. Additionally, run_shred_sigverify uses the Vec::extend(receiver.try_iter()) pattern, which can continuously reallocate as shreds are received and potentially introduce lag time before shreds are actually processed.

Summary of Changes

This bounds both channels involved in shred fetch with EvictingSender, and updates run_shred_sigverify to pull off a fixed number of packet batches on each iteration.

A note on channel size. The general case sees shred and repair ingress in the hundreds of packet batches per second. However, in the case of catch-up, we may see upwards of 8k packet batches per second, which would suggest a roughly 16k packet batch limit for ample headroom. We're setting it to 4x that amount to future proof for increases of CU limits (e.g., a future 100k CU limit).

codecov-commenter · 2025-06-27T20:54:21Z

Codecov Report

Attention: Patch coverage is 69.69697% with 10 lines in your changes missing coverage. Please review.

Project coverage is 83.2%. Comparing base (87451d5) to head (0d59906).

Additional details and impacted files

@@            Coverage Diff            @@
##           master    #6768     +/-   ##
=========================================
- Coverage    83.2%    83.2%   -0.1%     
=========================================
  Files         852      852             
  Lines      377089   377104     +15     
=========================================
- Hits       314051   314046      -5     
- Misses      63038    63058     +20

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alexpyattaev · 2025-06-27T21:09:08Z

Is it somehow possible to measure at what shred arrival rate would these channels overflow?

cpubot · 2025-06-30T21:22:03Z

Is it somehow possible to measure at what shred arrival rate would these channels overflow?

We can make some educated guesses with the current instrumentation.

shred_sigverify is the downstream consumer of this channel. Its metrics are reported every ~2s. In this sample (which represents a typical mean), shred_sigverify was active, processing shreds, for ~461ms, or idle for ~77% of the 2s window.

The crucial piece of data here, which I do not have, is how the shred_sigverify function scales with input / batch size. A very dumb, linear scaling would entail something like 4.3x increase in shreds to fully saturate the 2s window with work (i.e., we are processing shreds for the entire duration of the stat submission window - no idle time). After this point we may risk adding messages to the channel faster than we're processing them. Though, I don't think this kind of analysis is very useful in practice, but perhaps useful as a starting point.

The data reveal that we are actually doing a fairly poor job batching and thus parallelizing the work. In fact, using par_iter in shred_sigverify is likely hurting more than it is helping.

So, back to your question -- how much channel headroom do we actually have? Given the current state of the code, it depends how much an increase in shred ingress increases packet batch saturation, and how shred_sigverify scales with batch size. Our batches are currently very poorly saturated (actually hard to call them "batches", given the data). Assuming worst case, increase in shred ingress without increase in packet batch saturation (i.e., linear scaling), we have roughly 4x of headroom before we have to starting thinking about whether we're falling behind. With shred_sigverify properly batching and parallelizing, I would conjecture we have significantly more headroom.

alexpyattaev · 2025-07-01T02:31:31Z

Thank you for the detailed analysis! From what I understand, we used to have a short sleep in solana-streamer that would accumulate packets for up to 1ms while making a batch for sigverify. However, practical measurements showed that it makes very little difference.

However, in hunt for reduced latency, the wait was removed in #5213 based on no data I can track in PRs. When I tried a similar change for gossip it turned out to increase CPU usage through the mechanism you have mentioned #5319. I do not know if we can/should trade cpu for latency reduction here.

Arguably, if we get the architecture right, we might be able to get away with no batching while preserving low latency of the no-coalesce solution. As I understand, sigverify itself does not benefit much from batching as it sigverifies packets one at a time anyway. so the benefits are in channel utilization. switching to bounded channel and try_send/try_recv access might be preferable to straight up adding latency in hope of making meaningfully sized batches. I would not be surprised if much of the "work" that sigverify does is just managing rayon pools...

cpubot · 2025-07-01T02:47:35Z

Thank you for the detailed analysis! From what I understand, we used to have a short sleep in solana-streamer that would accumulate packets for up to 1ms while making a batch for sigverify. However, practical measurements showed that it makes very little difference.

However, in hunt for reduced latency, the wait was removed in #5213 based on no data I can track in PRs. When I tried a similar change for gossip it turned out to increase CPU usage through the mechanism you have mentioned #5319. I do not know if we can/should trade cpu for latency reduction here.

Arguably, if we get the architecture right, we might be able to get away with no batching while preserving low latency of the no-coalesce solution. As I understand, sigverify itself does not benefit much from batching as it sigverifies packets one at a time anyway. so the benefits are in channel utilization. switching to bounded channel and try_send/try_recv access might be preferable to straight up adding latency in hope of making meaningfully sized batches. I would not be surprised if much of the "work" that sigverify does is just managing rayon pools...

I actually ran a test with coalesce enabled on the shred_sigverify channel, and found it yields a significant improvement to processing time (~7.3x).

before:

after:

alexpyattaev · 2025-07-03T22:58:44Z

turbine/src/sigverify_shreds.rs

@@ -47,6 +47,9 @@ const CLUSTER_NODES_CACHE_NUM_EPOCH_CAP: usize = 2;
 // are needed, we can use longer durations for cache TTL.
 const CLUSTER_NODES_CACHE_TTL: Duration = Duration::from_secs(30);

+/// Maximum number of packet batches to process in a single sigverify iteration.
+const SIGVERIFY_SHRED_BATCH_SIZE: usize = 1024 * 8;


Why is this so huge? Under which conditions would we get 8k batches of packets for sigverify?

Yeah, this was a mistake. Will fix

Do you have any numbers about how small can this be before we start losing performance? For minimal latency we'd want this to be as small as possible, right?

alexpyattaev

LGTM, thank you!

cpubot requested a review from alessandrod June 27, 2025 20:05

alexpyattaev reviewed Jul 3, 2025

View reviewed changes

cpubot added 3 commits July 9, 2025 13:58

Use EvictingSender to bound shred fetch stage channels

575607a

Pull fixed number of shreds in sigverify_shreds and re-use memory

aff3495

Adjust sigverify batch size

0d59906

cpubot force-pushed the shred-fetch-bounded-channel branch from 244e638 to 0d59906 Compare July 9, 2025 20:58

alexpyattaev approved these changes Jul 10, 2025

View reviewed changes

cpubot merged commit d277fd2 into anza-xyz:master Jul 10, 2025
41 checks passed

cpubot deleted the shred-fetch-bounded-channel branch July 10, 2025 17:26

alexpyattaev mentioned this pull request Jul 14, 2025

Enable coalesce in shred-fetch streamer::receiver #6865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Shred fetch bounded channel #6768

Shred fetch bounded channel #6768

Uh oh!

cpubot commented Jun 27, 2025

Uh oh!

codecov-commenter commented Jun 27, 2025 •

edited

Loading

Uh oh!

alexpyattaev commented Jun 27, 2025

Uh oh!

cpubot commented Jun 30, 2025

Uh oh!

alexpyattaev commented Jul 1, 2025

Uh oh!

cpubot commented Jul 1, 2025

Uh oh!

alexpyattaev Jul 3, 2025 •

edited

Loading

Uh oh!

cpubot Jul 4, 2025

Uh oh!

alexpyattaev Jul 10, 2025

Uh oh!

alexpyattaev left a comment

Uh oh!

Uh oh!

Uh oh!

Shred fetch bounded channel #6768

Shred fetch bounded channel #6768

Uh oh!

Conversation

cpubot commented Jun 27, 2025

Problem

Summary of Changes

Uh oh!

codecov-commenter commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alexpyattaev commented Jun 27, 2025

Uh oh!

cpubot commented Jun 30, 2025

Uh oh!

alexpyattaev commented Jul 1, 2025

Uh oh!

cpubot commented Jul 1, 2025

Uh oh!

alexpyattaev Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpubot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

alexpyattaev Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

alexpyattaev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 27, 2025 •

edited

Loading

alexpyattaev Jul 3, 2025 •

edited

Loading