-
Notifications
You must be signed in to change notification settings - Fork 90
Merge Release 2.4.0 #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Merge Release 2.4.0 #153
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We were scaling up only when the job was waiting for resources to become available (Job reason code = Resources). In case of a requeued job the cluster didn't scale up because the pending reason is NodeDown and then BeginTime. I'm adding them to the list of reason for the scale up. I also added Priority because I think that we can scale up if the job is waiting for other higher priority jobs. Signed-off-by: Enrico Usai <[email protected]>
Signed-off-by: Luca Carrogu <[email protected]>
Failed messages are not re-queued after 2 retries Signed-off-by: Francesco De Martino <[email protected]>
…able retries Signed-off-by: Francesco De Martino <[email protected]>
This allows the REMOVE event to remove partially configured node from scheduler Signed-off-by: Francesco De Martino <[email protected]>
… as expected https://github.com/paramiko/paramiko/pull/1118/files Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
…ort updates Instance properties containing the number of vcpus to use whan adding a node to the scheduler was computed only when the jobwatcher is initialized and was using a static instance type value read from the jobwatcher config file. In case of an update of the compute instance type the jobwatcher was still using the old cpu values to compute the number of instances to start when scaling up the cluster. Now the instance properties are re-computed at every iteration of the jobwatcher and the instance type is retrieved from the cloudformation stack parameters. A better solution would consist in updating and reloading the node daemons every time a pcluster update operation is performed. Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
This logic will be used by sqswatcher as well Signed-off-by: Francesco De Martino <[email protected]>
Note: without FastSchedule set to 1 Slurm is still accepting jobs that violate the CPU constraint Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
This additional timeout prevents the entire pool to timeout and allows a more granular failures handling Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
- Fix: correctly handle maintain_size call from self_terminate. If maintain_size was true the node was not terminating but was entering the infinite termination loop. - Fix: dump idletime to file at every iteration and not only when this is increased as it was done before. There are cases when the idletime is reset to 0 and hence we need to update the stored copy - Fix: in case _has_pending_jobs fails the log message was saying not terminating instance but the instance was being terminated. Decided to terminate instance in this case to avoid keeping node up and running with a broken scheduler daemon. - Refactoring: move check for stack readiness outside of main loop - Enhancement: reset idletime to 0 when the host becomes essential for the cluster (because of min_size of ASG or because there are pending jobs in scheduler queue) - Refactoring: check for pending jobs before increasing the idletime to also avoid misleading log messages - Refactoring: terminate nodewatcher daemon when the node is terminating rather than sleeping forever Signed-off-by: Francesco De Martino <[email protected]>
Currently with Slurm and Torque when a job is submitted and the requested slots per node are bigger that the number of available vcpus on a single instance the jobwatcher is selecting a bigger number of nodes to fit the requested CPUs. This operation/assumption is incorrect since the scheduler will never schdule the job due to the fact that the CPU contraints are never satisfied. This causes the cluster to spin up new capacity that will stay idle forever or until the job is canceled. This commit fixes the above mentioned issue. Signed-off-by: Francesco De Martino <[email protected]>
…irements when verifying if there are pending jobs in the slurm queue we are now discarding pending jobs that require unsatisfiable CPU requirements. This is to prevent that the cluster keeps nodes up and running for jobs that will be never scheduled. Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
This is done to speedup cluster recovery in case of failing nodes. jobwatcher will start requesting an additional node if possible while nodewatcher will take care of terminating the faulty instance. Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
nodewatcher is now able to self-teminate the instance in case the node is reported as down by the scheduler or if not attached to scheduler at all. The logs are dumped to the shared /home/logs/compute directory in order to facilitate future investigations. Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
… for job Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Assuming that a node in an orphaned state is not in ASG anymore hence not including in busy count to prevent overscaling Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Luca Carrogu <[email protected]>
Merge feature branch sge-enhancements into develop
… 2.6 In python 2.6 CalledProcessError does not contain the output of the failed command Signed-off-by: Francesco De Martino <[email protected]>
Starting the compute node sge daemon after the node is added to the queue causes the scheduler to see the node as unknown for a few seconds. If jobwatcher wakes up in this period it will mark the node as unreachable and spin up a new isntance causing overscaling Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
This adds some logic to correctly recompute the number of required nodes Signed-off-by: Francesco De Martino <[email protected]>
This adds some logic to correctly filter out jobs that cannot be executed due to cluster limits Signed-off-by: Francesco De Martino <[email protected]>
…rque Note: sge and slurm are not affected by this Signed-off-by: Francesco De Martino <[email protected]>
Since the compute node cannot change its instance type it makes no sense to dynamically fetch this value in the nodewatcher loop. This change reduces the calls to DescribeStack API which allows a relatively low TPS Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
sean-smith
approved these changes
Jun 11, 2019
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.