-
Notifications
You must be signed in to change notification settings - Fork 589
Trash overgrown bp unified scheduler #6686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trash overgrown bp unified scheduler #6686
Conversation
@@ -1281,10 +1421,6 @@ where | |||
// before that. | |||
self.thread_manager.are_threads_joined() | |||
} | |||
|
|||
fn is_overgrown(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this inherent method is promoted to a trait method of SchedulerInner
to make it callable inside cleaner_main_loop
....
4ec4752
to
c9df3e8
Compare
pub struct UsageQueueLoader { | ||
struct UsageQueueLoaderInner { | ||
usage_queues: DashMap<Pubkey, UsageQueue>, | ||
} | ||
|
||
impl UsageQueueLoader { | ||
pub fn load(&self, address: Pubkey) -> UsageQueue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fewer pub
s, less cognitive load. :)
@@ -653,7 +731,7 @@ where | |||
.lock() | |||
.unwrap() | |||
.as_mut() | |||
.map(|respawner| respawner.banking_stage_monitor.status()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, respanwer is remnant of very old prototype code....
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6686 +/- ##
========================================
Coverage 82.8% 82.8%
========================================
Files 849 849
Lines 379311 379485 +174
========================================
+ Hits 314229 314395 +166
- Misses 65082 65090 +8 🚀 New features to boost your workflow:
|
unified-scheduler-pool/src/lib.rs
Outdated
fn discard_buffer(&self) { | ||
self.thread_manager.discard_buffered_tasks(); | ||
} | ||
|
||
#[cfg(test)] | ||
fn change_next_task_id_for_block_production(&self, next_task_id: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: set_next_task_id_for_block_production
. It also seems the chain of calls is not consistnetly named, some called banking_stage, some called block_production. Let's consistently use block_production
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: bd6078c
OwnedBySelf { | ||
usage_queue_loader_inner: UsageQueueLoaderInner, | ||
}, | ||
SharedWithBankingStage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to recall the reason this is "shared with banking stage". The usage queue is on the BankingStageHelper, which is held by both scheduler & workers - why do we need it on the workers again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope this doc rework should help future you recall the reason..?: 4e7f8e3
unified-scheduler-pool/src/lib.rs
Outdated
// AtomicUsize's fetch_add entails the wrapping semantics. So, it's needed to be rather | ||
// conservative to prevent overflowing from happening on production, under the constraint of not | ||
// compromising performance at all (i.e. no limit check on hot path and no d-cache pressure). With | ||
// these background given, it's exceedingly hard to conceive task id is alloted more than half of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's hard to conceive that we will ever hit it regardless of single-session status. If we could ingest & index a transaction in a single nanosecond, we'd need to be running for almost 300 years continuously to index u64::max/2 txs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch! I adjusted the tone of prose accordingly: 4940a57
unified-scheduler-pool/src/lib.rs
Outdated
break; | ||
}; | ||
|
||
if let Some(pooled) = inner.peek_pooled() { | ||
{ | ||
if pooled.is_overgrown() { | ||
// This is very unlikely code path to address a theoretically-possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain to me why this is very unlikely? The u64 indexing overflow is never going to happen, but the overgrown usage queue seems like it could happen reasonably often during normal operation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another nice catch! I think the previous prose was misleading. I think things are clarified now?: e1dd743
7606c17
to
3dbd55a
Compare
3dbd55a
to
f0b1962
Compare
f0b1962
to
e1dd743
Compare
Problem
Currently, block production unified scheduler's overgrown status isn't checked due to insufficient plumbing. Related, there's largely theoretical attack vector of remotely controlled overgrown status check bypass in (not-production-ready-yet) unified scheduler as a banking stage. last but not least, trashing it isn't working to begin with due to lack of proper implementation in
disconnect_new_task_sender()
.As a bonus,
BankingStageHelper::task_id
could overlap but we're ignoring the fact. ;)Summary of Changes
Fix them all with proper plumbing, introduction of mitigation step in the cleaner thread, and
bp-ready thread shutdown signaling in
disconnect_new_task_sender()
.... and don't forget about the bonus. Let's wire that to overgrown. After years, it's finally evident for the public about the reason i insisted introducing the word of overgrown at the time (#1672) where there were no meaning other than too big
UsageQueueLoader
. :)There should be no functional change for the block verification code path.
extracted from #3946