-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup() #4830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup() #4830
Conversation
Test build #28122 has started for PR 4830 at commit
|
Test build #28122 has finished for PR 4830 at commit
|
Test FAILed. |
Jenkins, retest this please. |
Test build #28125 has started for PR 4830 at commit
|
Test build #28125 has finished for PR 4830 at commit
|
Test PASSed. |
Jenkins, retest this please. |
@pwendell @andrewor14, could one of you take a quick look at this patch? Should be pretty straightforward. |
Test build #28412 has started for PR 4830 at commit
|
Test build #28412 has finished for PR 4830 at commit
|
Test PASSed. |
@@ -109,7 +111,8 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging { | |||
"failedJobs" -> failedJobs.size, | |||
"completedStages" -> completedStages.size, | |||
"skippedStages" -> skippedStages.size, | |||
"failedStages" -> failedStages.size | |||
"failedStages" -> failedStages.size, | |||
"jobGroupToJobIds" -> jobGroupToJobIds.values.map(_.size).sum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, based on when elements are removed, the size of this should be the same as the size of jobIdToData
. Why not place jobGroupToJobIds
alongside it in getSizesOfSoftSizeLimitedCollections
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point; I'll make this change.
LGTM for the most part. I think this is mergeable as is but I would like to see the for loop written in a more readable way. |
Test build #29185 has started for PR 4830 at commit
|
Test build #29185 has finished for PR 4830 at commit
|
Test PASSed. |
Thanks for the new comments @JoshRosen. The latest version is really easy to follow. Merging into master. |
Also backported this to |
`StatusTracker.getJobIdsForGroup()` is implemented via a linear scan over a HashMap rather than using an index, which might be an expensive operation if there are many (e.g. thousands) of retained jobs. This patch adds a new map to `JobProgressListener` in order to speed up these lookups. Author: Josh Rosen <[email protected]> Closes #4830 from JoshRosen/statustracker-job-group-indexing and squashes the following commits: e39c5c7 [Josh Rosen] Address review feedback 6709fb2 [Josh Rosen] Merge remote-tracking branch 'origin/master' into statustracker-job-group-indexing 2c49614 [Josh Rosen] getOrElse 97275a7 [Josh Rosen] Add jobGroup to jobId index to JobProgressListener (cherry picked from commit d44a336) Signed-off-by: Josh Rosen <[email protected]>
StatusTracker.getJobIdsForGroup()
is implemented via a linear scan over a HashMap rather than using an index, which might be an expensive operation if there are many (e.g. thousands) of retained jobs.This patch adds a new map to
JobProgressListener
in order to speed up these lookups.