Skip to content

[SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup() #4830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

JoshRosen
Copy link
Contributor

StatusTracker.getJobIdsForGroup() is implemented via a linear scan over a HashMap rather than using an index, which might be an expensive operation if there are many (e.g. thousands) of retained jobs.

This patch adds a new map to JobProgressListener in order to speed up these lookups.

@SparkQA
Copy link

SparkQA commented Feb 28, 2015

Test build #28122 has started for PR 4830 at commit 2c49614.

  • This patch merges cleanly.

@JoshRosen JoshRosen changed the title [SPARK-6079 ] Use index to speed up StatusTracker.getJobIdsForGroup() [SPARK-6079] Use index to speed up StatusTracker.getJobIdsForGroup() Feb 28, 2015
@SparkQA
Copy link

SparkQA commented Feb 28, 2015

Test build #28122 has finished for PR 4830 at commit 2c49614.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28122/
Test FAILed.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Feb 28, 2015

Test build #28125 has started for PR 4830 at commit 2c49614.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 28, 2015

Test build #28125 has finished for PR 4830 at commit 2c49614.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28125/
Test PASSed.

@JoshRosen
Copy link
Contributor Author

Jenkins, retest this please.

@JoshRosen
Copy link
Contributor Author

@pwendell @andrewor14, could one of you take a quick look at this patch? Should be pretty straightforward.

@SparkQA
Copy link

SparkQA commented Mar 9, 2015

Test build #28412 has started for PR 4830 at commit 2c49614.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 10, 2015

Test build #28412 has finished for PR 4830 at commit 2c49614.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28412/
Test PASSed.

@@ -109,7 +111,8 @@ class JobProgressListener(conf: SparkConf) extends SparkListener with Logging {
"failedJobs" -> failedJobs.size,
"completedStages" -> completedStages.size,
"skippedStages" -> skippedStages.size,
"failedStages" -> failedStages.size
"failedStages" -> failedStages.size,
"jobGroupToJobIds" -> jobGroupToJobIds.values.map(_.size).sum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, based on when elements are removed, the size of this should be the same as the size of jobIdToData. Why not place jobGroupToJobIds alongside it in getSizesOfSoftSizeLimitedCollections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; I'll make this change.

@andrewor14
Copy link
Contributor

LGTM for the most part. I think this is mergeable as is but I would like to see the for loop written in a more readable way.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29185 has started for PR 4830 at commit e39c5c7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 25, 2015

Test build #29185 has finished for PR 4830 at commit e39c5c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29185/
Test PASSed.

@andrewor14
Copy link
Contributor

Thanks for the new comments @JoshRosen. The latest version is really easy to follow. Merging into master.

@asfgit asfgit closed this in d44a336 Mar 26, 2015
@JoshRosen JoshRosen deleted the statustracker-job-group-indexing branch April 2, 2015 20:13
@JoshRosen
Copy link
Contributor Author

Also backported this to branch-1.3 (1.3.1).

asfgit pushed a commit that referenced this pull request Apr 2, 2015
`StatusTracker.getJobIdsForGroup()` is implemented via a linear scan over a HashMap rather than using an index, which might be an expensive operation if there are many (e.g. thousands) of retained jobs.

This patch adds a new map to `JobProgressListener` in order to speed up these lookups.

Author: Josh Rosen <[email protected]>

Closes #4830 from JoshRosen/statustracker-job-group-indexing and squashes the following commits:

e39c5c7 [Josh Rosen] Address review feedback
6709fb2 [Josh Rosen] Merge remote-tracking branch 'origin/master' into statustracker-job-group-indexing
2c49614 [Josh Rosen] getOrElse
97275a7 [Josh Rosen] Add jobGroup to jobId index to JobProgressListener

(cherry picked from commit d44a336)
Signed-off-by: Josh Rosen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants