Trash overgrown bp unified scheduler #6686

ryoqun · 2025-06-22T13:52:51Z

Problem

Currently, block production unified scheduler's overgrown status isn't checked due to insufficient plumbing. Related, there's largely theoretical attack vector of remotely controlled overgrown status check bypass in (not-production-ready-yet) unified scheduler as a banking stage. last but not least, trashing it isn't working to begin with due to lack of proper implementation in disconnect_new_task_sender().

As a bonus, BankingStageHelper::task_id could overlap but we're ignoring the fact. ;)

Summary of Changes

Fix them all with proper plumbing, introduction of mitigation step in the cleaner thread, and
bp-ready thread shutdown signaling in disconnect_new_task_sender().

... and don't forget about the bonus. Let's wire that to overgrown. After years, it's finally evident for the public about the reason i insisted introducing the word of overgrown at the time (#1672) where there were no meaning other than too big UsageQueueLoader. :)

There should be no functional change for the block verification code path.

extracted from #3946

ryoqun · 2025-06-22T14:00:28Z

unified-scheduler-pool/src/lib.rs

@@ -1281,10 +1421,6 @@ where
        // before that.
        self.thread_manager.are_threads_joined()
    }
-
-    fn is_overgrown(&self) -> bool {


this inherent method is promoted to a trait method of SchedulerInner to make it callable inside cleaner_main_loop....

ryoqun · 2025-06-23T12:31:11Z

unified-scheduler-pool/src/lib.rs

-pub struct UsageQueueLoader {
+struct UsageQueueLoaderInner {
    usage_queues: DashMap<Pubkey, UsageQueue>,
 }

-impl UsageQueueLoader {
-    pub fn load(&self, address: Pubkey) -> UsageQueue {


fewer pubs, less cognitive load. :)

ryoqun · 2025-06-23T12:37:52Z

unified-scheduler-pool/src/lib.rs

@@ -653,7 +731,7 @@ where
            .lock()
            .unwrap()
            .as_mut()
-            .map(|respawner| respawner.banking_stage_monitor.status())


well, respanwer is remnant of very old prototype code....

codecov-commenter · 2025-06-23T13:15:54Z

Codecov Report

Attention: Patch coverage is 96.73913% with 6 lines in your changes missing coverage. Please review.

Project coverage is 82.8%. Comparing base (d25252f) to head (e1dd743).
Report is 135 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #6686    +/-   ##
========================================
  Coverage    82.8%    82.8%            
========================================
  Files         849      849            
  Lines      379311   379485   +174     
========================================
+ Hits       314229   314395   +166     
- Misses      65082    65090     +8

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

apfitzge · 2025-06-23T13:57:48Z

unified-scheduler-pool/src/lib.rs

    fn discard_buffer(&self) {
        self.thread_manager.discard_buffered_tasks();
    }
+
+    #[cfg(test)]
+    fn change_next_task_id_for_block_production(&self, next_task_id: usize) {


nit: set_next_task_id_for_block_production. It also seems the chain of calls is not consistnetly named, some called banking_stage, some called block_production. Let's consistently use block_production

done: bd6078c

apfitzge · 2025-06-23T14:07:14Z

unified-scheduler-pool/src/lib.rs

+    OwnedBySelf {
+        usage_queue_loader_inner: UsageQueueLoaderInner,
+    },
+    SharedWithBankingStage {


I'm trying to recall the reason this is "shared with banking stage". The usage queue is on the BankingStageHelper, which is held by both scheduler & workers - why do we need it on the workers again?

Hope this doc rework should help future you recall the reason..?: 4e7f8e3

apfitzge · 2025-06-24T15:06:29Z

unified-scheduler-pool/src/lib.rs

+// AtomicUsize's fetch_add entails the wrapping semantics. So, it's needed to be rather
+// conservative to prevent overflowing from happening on production, under the constraint of not
+// compromising performance at all (i.e. no limit check on hot path and no d-cache pressure). With
+// these background given, it's exceedingly hard to conceive task id is alloted more than half of


it's hard to conceive that we will ever hit it regardless of single-session status. If we could ingest & index a transaction in a single nanosecond, we'd need to be running for almost 300 years continuously to index u64::max/2 txs.

nice catch! I adjusted the tone of prose accordingly: 4940a57

apfitzge · 2025-06-24T16:04:49Z

unified-scheduler-pool/src/lib.rs

                        break;
                    };

                    if let Some(pooled) = inner.peek_pooled() {
-                        {
+                        if pooled.is_overgrown() {
+                            // This is very unlikely code path to address a theoretically-possible


Can you explain to me why this is very unlikely? The u64 indexing overflow is never going to happen, but the overgrown usage queue seems like it could happen reasonably often during normal operation, right?

another nice catch! I think the previous prose was misleading. I think things are clarified now?: e1dd743

mergify bot added the need:merge-assist label Jun 22, 2025

ryoqun commented Jun 22, 2025

View reviewed changes

Trash overgrown bp unified scheduler

c9df3e8

ryoqun force-pushed the overgrown-bp-scheduler-trashing branch from 4ec4752 to c9df3e8 Compare June 23, 2025 12:20

ryoqun commented Jun 23, 2025

View reviewed changes

ryoqun requested a review from apfitzge June 23, 2025 12:54

apfitzge reviewed Jun 23, 2025

View reviewed changes

ryoqun added 2 commits June 24, 2025 22:23

Use consistent naming

bd6078c

Document UsageQueueLoader's multithread nature

4e7f8e3

ryoqun requested a review from apfitzge June 24, 2025 14:16

apfitzge requested review from apfitzge and removed request for apfitzge July 2, 2025 02:12

apfitzge reviewed Jul 2, 2025

View reviewed changes

ryoqun force-pushed the overgrown-bp-scheduler-trashing branch 4 times, most recently from 7606c17 to 3dbd55a Compare July 7, 2025 01:56

Minor clean ups

1bd12ea

ryoqun force-pushed the overgrown-bp-scheduler-trashing branch from 3dbd55a to f0b1962 Compare July 7, 2025 02:03

ryoqun added 2 commits July 7, 2025 11:12

Address factual correctness of BANKING_STAGE_MAX_TASK_ID

4940a57

Address factual correctness of idling overgrown check

e1dd743

ryoqun force-pushed the overgrown-bp-scheduler-trashing branch from f0b1962 to e1dd743 Compare July 7, 2025 02:12

ryoqun requested a review from apfitzge July 7, 2025 02:16

ryoqun added the automerge automerge Merge this Pull Request automatically once CI passes label Jul 7, 2025

apfitzge approved these changes Jul 7, 2025

View reviewed changes

apfitzge merged commit 0de8717 into anza-xyz:master Jul 7, 2025
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Trash overgrown bp unified scheduler #6686

Trash overgrown bp unified scheduler #6686

Uh oh!

ryoqun commented Jun 22, 2025 •

edited

Loading

Uh oh!

ryoqun Jun 22, 2025

Uh oh!

ryoqun Jun 23, 2025 •

edited

Loading

Uh oh!

ryoqun Jun 23, 2025

Uh oh!

codecov-commenter commented Jun 23, 2025 •

edited

Loading

Uh oh!

apfitzge Jun 23, 2025

Uh oh!

ryoqun Jun 24, 2025

Uh oh!

apfitzge Jun 23, 2025

Uh oh!

ryoqun Jun 24, 2025

Uh oh!

apfitzge Jun 24, 2025

Uh oh!

ryoqun Jul 7, 2025 •

edited

Loading

Uh oh!

apfitzge Jun 24, 2025

Uh oh!

ryoqun Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

Trash overgrown bp unified scheduler #6686

Trash overgrown bp unified scheduler #6686

Uh oh!

Conversation

ryoqun commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Summary of Changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryoqun Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ryoqun commented Jun 22, 2025 •

edited

Loading

ryoqun Jun 23, 2025 •

edited

Loading

codecov-commenter commented Jun 23, 2025 •

edited

Loading

ryoqun Jul 7, 2025 •

edited

Loading