Merge Release 2.4.0 #153

lukeseawalker · 2019-06-11T14:56:09Z

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

We were scaling up only when the job was waiting for resources to become available (Job reason code = Resources). In case of a requeued job the cluster didn't scale up because the pending reason is NodeDown and then BeginTime. I'm adding them to the list of reason for the scale up. I also added Priority because I think that we can scale up if the job is waiting for other higher priority jobs. Signed-off-by: Enrico Usai <[email protected]>

Signed-off-by: Luca Carrogu <[email protected]>

Failed messages are not re-queued after 2 retries Signed-off-by: Francesco De Martino <[email protected]>

…able retries Signed-off-by: Francesco De Martino <[email protected]>

This allows the REMOVE event to remove partially configured node from scheduler Signed-off-by: Francesco De Martino <[email protected]>

… as expected https://github.com/paramiko/paramiko/pull/1118/files Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

…ort updates Instance properties containing the number of vcpus to use whan adding a node to the scheduler was computed only when the jobwatcher is initialized and was using a static instance type value read from the jobwatcher config file. In case of an update of the compute instance type the jobwatcher was still using the old cpu values to compute the number of instances to start when scaling up the cluster. Now the instance properties are re-computed at every iteration of the jobwatcher and the instance type is retrieved from the cloudformation stack parameters. A better solution would consist in updating and reloading the node daemons every time a pcluster update operation is performed. Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

This logic will be used by sqswatcher as well Signed-off-by: Francesco De Martino <[email protected]>

Note: without FastSchedule set to 1 Slurm is still accepting jobs that violate the CPU constraint Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

This additional timeout prevents the entire pool to timeout and allows a more granular failures handling Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

- Fix: correctly handle maintain_size call from self_terminate. If maintain_size was true the node was not terminating but was entering the infinite termination loop. - Fix: dump idletime to file at every iteration and not only when this is increased as it was done before. There are cases when the idletime is reset to 0 and hence we need to update the stored copy - Fix: in case _has_pending_jobs fails the log message was saying not terminating instance but the instance was being terminated. Decided to terminate instance in this case to avoid keeping node up and running with a broken scheduler daemon. - Refactoring: move check for stack readiness outside of main loop - Enhancement: reset idletime to 0 when the host becomes essential for the cluster (because of min_size of ASG or because there are pending jobs in scheduler queue) - Refactoring: check for pending jobs before increasing the idletime to also avoid misleading log messages - Refactoring: terminate nodewatcher daemon when the node is terminating rather than sleeping forever Signed-off-by: Francesco De Martino <[email protected]>

Currently with Slurm and Torque when a job is submitted and the requested slots per node are bigger that the number of available vcpus on a single instance the jobwatcher is selecting a bigger number of nodes to fit the requested CPUs. This operation/assumption is incorrect since the scheduler will never schdule the job due to the fact that the CPU contraints are never satisfied. This causes the cluster to spin up new capacity that will stay idle forever or until the job is canceled. This commit fixes the above mentioned issue. Signed-off-by: Francesco De Martino <[email protected]>

…irements when verifying if there are pending jobs in the slurm queue we are now discarding pending jobs that require unsatisfiable CPU requirements. This is to prevent that the cluster keeps nodes up and running for jobs that will be never scheduled. Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

This is done to speedup cluster recovery in case of failing nodes. jobwatcher will start requesting an additional node if possible while nodewatcher will take care of terminating the faulty instance. Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher is now able to self-teminate the instance in case the node is reported as down by the scheduler or if not attached to scheduler at all. The logs are dumped to the shared /home/logs/compute directory in order to facilitate future investigations. Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

… for job Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

Assuming that a node in an orphaned state is not in ASG anymore hence not including in busy count to prevent overscaling Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Luca Carrogu <[email protected]>

Merge feature branch sge-enhancements into develop

… 2.6 In python 2.6 CalledProcessError does not contain the output of the failed command Signed-off-by: Francesco De Martino <[email protected]>

Starting the compute node sge daemon after the node is added to the queue causes the scheduler to see the node as unknown for a few seconds. If jobwatcher wakes up in this period it will mark the node as unreachable and spin up a new isntance causing overscaling Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

This adds some logic to correctly recompute the number of required nodes Signed-off-by: Francesco De Martino <[email protected]>

This adds some logic to correctly filter out jobs that cannot be executed due to cluster limits Signed-off-by: Francesco De Martino <[email protected]>

…rque Note: sge and slurm are not affected by this Signed-off-by: Francesco De Martino <[email protected]>

Since the compute node cannot change its instance type it makes no sense to dynamically fetch this value in the nodewatcher loop. This change reduces the calls to DescribeStack API which allows a relatively low TPS Signed-off-by: Francesco De Martino <[email protected]>

Signed-off-by: Francesco De Martino <[email protected]>

enrico-usai and others added 30 commits April 15, 2019 14:40

Bump version to 2.3.2 alpha 1

0581c01

Signed-off-by: Luca Carrogu <[email protected]>

sqswatcher: retry failed event messages up to 2 times

88f08cf

Failed messages are not re-queued after 2 retries Signed-off-by: Francesco De Martino <[email protected]>

sqswatcher: raise exception when _restart_compute_daemons fails to en…

de73a3b

…able retries Signed-off-by: Francesco De Martino <[email protected]>

sqswatcher: add item to dynamodb also on failed ADD events

b1c43d3

This allows the REMOVE event to remove partially configured node from scheduler Signed-off-by: Francesco De Martino <[email protected]>

sqswatcher - slurm: remove timeout recv_exit_status since not working…

f74ca50

… as expected https://github.com/paramiko/paramiko/pull/1118/files Signed-off-by: Francesco De Martino <[email protected]>

sqswatcher - slurm: remove duplicate log message

810ef26

Signed-off-by: Francesco De Martino <[email protected]>

remove log parameter from common.utils functions

0960f06

Signed-off-by: Francesco De Martino <[email protected]>

Move logic to retrieve instance properties to common.utils

1f42802

This logic will be used by sqswatcher as well Signed-off-by: Francesco De Martino <[email protected]>

sqswatcher - slurm: set correct CPUs for dummy nodes

e3d421b

Note: without FastSchedule set to 1 Slurm is still accepting jobs that violate the CPU constraint Signed-off-by: Francesco De Martino <[email protected]>

sqswatcher: remove unneeded max_queue_size from sqs_config

33be70e

Signed-off-by: Francesco De Martino <[email protected]>

Remove unused daemons test config files

744fc6f

Signed-off-by: Francesco De Martino <[email protected]>

Add fallback and retry logic to get_compute_instance_type

e69b508

Signed-off-by: Francesco De Martino <[email protected]>

sqswatcher - slurm: close ssh_client in finally clause

e01b022

Signed-off-by: Francesco De Martino <[email protected]>

sqswatcher - slurm: add timeout for ssh command execution

9f8c4e0

This additional timeout prevents the entire pool to timeout and allows a more granular failures handling Signed-off-by: Francesco De Martino <[email protected]>

Add tox and tox travis support

bdfa61b

Signed-off-by: Francesco De Martino <[email protected]>

Reformat files with black and isort and enable travis check

3fc5e48

Signed-off-by: Francesco De Martino <[email protected]>

Add time_utils module with time conversion functions

9a1148d

Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher: handle broader Exception in hasPendingJobs

64eb1ec

Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher - slurm: add function to check if node is down

e4d2cc6

Signed-off-by: Francesco De Martino <[email protected]>

jobwatcher: remove unused param from get_busy_node func

0e6cb18

Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher: defined termination timeouts in constants

edf5abe

Signed-off-by: Francesco De Martino <[email protected]>

fix imports order

c9b2a8d

Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher: fix tarfile for python 2.6

3f00c75

Signed-off-by: Francesco De Martino <[email protected]>

demartinofra and others added 28 commits May 16, 2019 12:08

sge: move locking functions to sge_commands

54f0d92

Signed-off-by: Francesco De Martino <[email protected]>

sge: implement logic to parse xml output from sge commands

18bf0e3

Signed-off-by: Francesco De Martino <[email protected]>

sge: add unit-tests for sge_commands

2b0e545

Signed-off-by: Francesco De Martino <[email protected]>

Add configparser dependency for python 2/3 compatibility

70e0f8e

Signed-off-by: Francesco De Martino <[email protected]>

unit tests: fix dependencies for python 2.6

1f7100c

Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher - sge: implement logic to self terminate faulty node

36366cb

Signed-off-by: Francesco De Martino <[email protected]>

jobwatcher - sge: include unusable nodes in busy node count

ea10f62

Signed-off-by: Francesco De Martino <[email protected]>

jobwatcher - sge: verify cluster limits when computing slots required…

26f3470

… for job Signed-off-by: Francesco De Martino <[email protected]>

jobwatcher - sge: do not scale up when a job is pending a dependency

307205f

Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher - sge: scale down if pending jobs cannot be scheduled

39c322e

Signed-off-by: Francesco De Martino <[email protected]>

Add missing copyright to unit tests

9302beb

Signed-off-by: Francesco De Martino <[email protected]>

Replace dict comprehension for python 2.6 compatibility

957e5e4

Signed-off-by: Francesco De Martino <[email protected]>

jobwatcher - sge: skip orphaned nodes from busy nodes count

fe9ac37

Assuming that a node in an orphaned state is not in ASG anymore hence not including in busy count to prevent overscaling Signed-off-by: Francesco De Martino <[email protected]>

sge: remove queue info from hostname when parsing xml

74d2bf2

Signed-off-by: Francesco De Martino <[email protected]>

Bump version to 2.4.0

c2f5889

Signed-off-by: Luca Carrogu <[email protected]>

Merge remote-tracking branch 'upstream/develop' into sge-enhancements

d870fb2

Merge pull request #145 from aws/sge-enhancements

838d814

Merge feature branch sge-enhancements into develop

Fallback to empty string on subprocess.CalledProcessError with python…

7fb8e6c

… 2.6 In python 2.6 CalledProcessError does not contain the output of the failed command Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher - sge: Fix logging message

bcb8df1

Signed-off-by: Francesco De Martino <[email protected]>

slurm: add ReqNodeNotAvail to whitelisted PENDING_RESOURCES_REASONS

b6bbe99

Signed-off-by: Francesco De Martino <[email protected]>

slurm: implement common module to run Slurm commands

328ed5d

Signed-off-by: Francesco De Martino <[email protected]>

jobwatcher - slurm: integrate new get_pending_jobs_info function

bfd5763

This adds some logic to correctly recompute the number of required nodes Signed-off-by: Francesco De Martino <[email protected]>

nodewatcher - slurm: integrate new get_pending_jobs_info function

86bfca3

This adds some logic to correctly filter out jobs that cannot be executed due to cluster limits Signed-off-by: Francesco De Martino <[email protected]>

Revert a fix in get_optimal_nodes to allow additional testing with to…

c3e20a3

…rque Note: sge and slurm are not affected by this Signed-off-by: Francesco De Martino <[email protected]>

Reduce calls to CloudFormation APIs to avoid throttling

291f81a

Signed-off-by: Francesco De Martino <[email protected]>

Update changelog for v2.4.0

9db8e2e

Signed-off-by: Francesco De Martino <[email protected]>

sean-smith approved these changes Jun 11, 2019

View reviewed changes

lukeseawalker merged commit 9dbff99 into master Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge Release 2.4.0 #153

Merge Release 2.4.0 #153

Uh oh!

lukeseawalker commented Jun 11, 2019

Uh oh!

Uh oh!

Merge Release 2.4.0 #153

Merge Release 2.4.0 #153

Uh oh!

Conversation

lukeseawalker commented Jun 11, 2019

Uh oh!

Uh oh!