Skip to content

Trash overgrown bp unified scheduler #6686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 7, 2025

Conversation

ryoqun
Copy link
Member

@ryoqun ryoqun commented Jun 22, 2025

Problem

Currently, block production unified scheduler's overgrown status isn't checked due to insufficient plumbing. Related, there's largely theoretical attack vector of remotely controlled overgrown status check bypass in (not-production-ready-yet) unified scheduler as a banking stage. last but not least, trashing it isn't working to begin with due to lack of proper implementation in disconnect_new_task_sender().

As a bonus, BankingStageHelper::task_id could overlap but we're ignoring the fact. ;)

Summary of Changes

Fix them all with proper plumbing, introduction of mitigation step in the cleaner thread, and
bp-ready thread shutdown signaling in disconnect_new_task_sender().

... and don't forget about the bonus. Let's wire that to overgrown. After years, it's finally evident for the public about the reason i insisted introducing the word of overgrown at the time (#1672) where there were no meaning other than too big UsageQueueLoader. :)

There should be no functional change for the block verification code path.

extracted from #3946

@@ -1281,10 +1421,6 @@ where
// before that.
self.thread_manager.are_threads_joined()
}

fn is_overgrown(&self) -> bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this inherent method is promoted to a trait method of SchedulerInner to make it callable inside cleaner_main_loop....

@ryoqun ryoqun force-pushed the overgrown-bp-scheduler-trashing branch from 4ec4752 to c9df3e8 Compare June 23, 2025 12:20
Comment on lines -1124 to -1129
pub struct UsageQueueLoader {
struct UsageQueueLoaderInner {
usage_queues: DashMap<Pubkey, UsageQueue>,
}

impl UsageQueueLoader {
pub fn load(&self, address: Pubkey) -> UsageQueue {
Copy link
Member Author

@ryoqun ryoqun Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fewer pubs, less cognitive load. :)

@@ -653,7 +731,7 @@ where
.lock()
.unwrap()
.as_mut()
.map(|respawner| respawner.banking_stage_monitor.status())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, respanwer is remnant of very old prototype code....

@ryoqun ryoqun requested a review from apfitzge June 23, 2025 12:54
@codecov-commenter
Copy link

codecov-commenter commented Jun 23, 2025

Codecov Report

Attention: Patch coverage is 96.73913% with 6 lines in your changes missing coverage. Please review.

Project coverage is 82.8%. Comparing base (d25252f) to head (e1dd743).
Report is 135 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #6686    +/-   ##
========================================
  Coverage    82.8%    82.8%            
========================================
  Files         849      849            
  Lines      379311   379485   +174     
========================================
+ Hits       314229   314395   +166     
- Misses      65082    65090     +8     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fn discard_buffer(&self) {
self.thread_manager.discard_buffered_tasks();
}

#[cfg(test)]
fn change_next_task_id_for_block_production(&self, next_task_id: usize) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: set_next_task_id_for_block_production. It also seems the chain of calls is not consistnetly named, some called banking_stage, some called block_production. Let's consistently use block_production

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: bd6078c

OwnedBySelf {
usage_queue_loader_inner: UsageQueueLoaderInner,
},
SharedWithBankingStage {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to recall the reason this is "shared with banking stage". The usage queue is on the BankingStageHelper, which is held by both scheduler & workers - why do we need it on the workers again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope this doc rework should help future you recall the reason..?: 4e7f8e3

@ryoqun ryoqun requested a review from apfitzge June 24, 2025 14:16
@apfitzge apfitzge requested review from apfitzge and removed request for apfitzge July 2, 2025 02:12
// AtomicUsize's fetch_add entails the wrapping semantics. So, it's needed to be rather
// conservative to prevent overflowing from happening on production, under the constraint of not
// compromising performance at all (i.e. no limit check on hot path and no d-cache pressure). With
// these background given, it's exceedingly hard to conceive task id is alloted more than half of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to conceive that we will ever hit it regardless of single-session status. If we could ingest & index a transaction in a single nanosecond, we'd need to be running for almost 300 years continuously to index u64::max/2 txs.

Copy link
Member Author

@ryoqun ryoqun Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch! I adjusted the tone of prose accordingly: 4940a57

break;
};

if let Some(pooled) = inner.peek_pooled() {
{
if pooled.is_overgrown() {
// This is very unlikely code path to address a theoretically-possible

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain to me why this is very unlikely? The u64 indexing overflow is never going to happen, but the overgrown usage queue seems like it could happen reasonably often during normal operation, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another nice catch! I think the previous prose was misleading. I think things are clarified now?: e1dd743

@ryoqun ryoqun force-pushed the overgrown-bp-scheduler-trashing branch 4 times, most recently from 7606c17 to 3dbd55a Compare July 7, 2025 01:56
@ryoqun ryoqun force-pushed the overgrown-bp-scheduler-trashing branch from 3dbd55a to f0b1962 Compare July 7, 2025 02:03
@ryoqun ryoqun force-pushed the overgrown-bp-scheduler-trashing branch from f0b1962 to e1dd743 Compare July 7, 2025 02:12
@ryoqun ryoqun requested a review from apfitzge July 7, 2025 02:16
@ryoqun ryoqun added the automerge automerge Merge this Pull Request automatically once CI passes label Jul 7, 2025
@apfitzge apfitzge merged commit 0de8717 into anza-xyz:master Jul 7, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge automerge Merge this Pull Request automatically once CI passes need:merge-assist
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants