Skip to content

Merge Release 2.4.0 #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 65 commits into from
Jun 11, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
d33723f
slurm: add other reasons for the cluster scale up
enrico-usai Apr 12, 2019
0581c01
Bump version to 2.3.2 alpha 1
lukeseawalker Apr 8, 2019
88f08cf
sqswatcher: retry failed event messages up to 2 times
demartinofra Apr 17, 2019
de73a3b
sqswatcher: raise exception when _restart_compute_daemons fails to en…
demartinofra Apr 17, 2019
b1c43d3
sqswatcher: add item to dynamodb also on failed ADD events
demartinofra Apr 17, 2019
f74ca50
sqswatcher - slurm: remove timeout recv_exit_status since not working…
demartinofra Apr 17, 2019
810ef26
sqswatcher - slurm: remove duplicate log message
demartinofra Apr 17, 2019
9dafebc
jobwatcher: re-compute instance properties at every iteration to supp…
demartinofra Apr 15, 2019
0960f06
remove log parameter from common.utils functions
demartinofra Apr 15, 2019
1f42802
Move logic to retrieve instance properties to common.utils
demartinofra Apr 16, 2019
e3d421b
sqswatcher - slurm: set correct CPUs for dummy nodes
demartinofra Apr 16, 2019
33be70e
sqswatcher: remove unneeded max_queue_size from sqs_config
demartinofra Apr 17, 2019
744fc6f
Remove unused daemons test config files
demartinofra Apr 17, 2019
e69b508
Add fallback and retry logic to get_compute_instance_type
demartinofra Apr 17, 2019
e01b022
sqswatcher - slurm: close ssh_client in finally clause
demartinofra Apr 17, 2019
9f8c4e0
sqswatcher - slurm: add timeout for ssh command execution
demartinofra Apr 17, 2019
bdfa61b
Add tox and tox travis support
demartinofra Apr 18, 2019
3fc5e48
Reformat files with black and isort and enable travis check
demartinofra Apr 18, 2019
cd8e840
Minor enhancements and refactoring for nodewatcher
demartinofra Apr 18, 2019
896bbac
jobwatcher: do not scale-up if CPU contraints cannot be satisfied
demartinofra Apr 19, 2019
0c32a24
nodewatcher - slurm: scale down if pending jobs have invalid CPU requ…
demartinofra Apr 23, 2019
9a1148d
Add time_utils module with time conversion functions
demartinofra May 1, 2019
64eb1ec
nodewatcher: handle broader Exception in hasPendingJobs
demartinofra May 1, 2019
e4d2cc6
nodewatcher - slurm: add function to check if node is down
demartinofra May 1, 2019
a071fec
jobwatcher - slurm: add down nodes to busy nodes count
demartinofra May 1, 2019
0e6cb18
jobwatcher: remove unused param from get_busy_node func
demartinofra May 1, 2019
cdf6bbd
nodewatcher: add logic to self terminate failing node
demartinofra May 1, 2019
edf5abe
nodewatcher: defined termination timeouts in constants
demartinofra May 2, 2019
c9b2a8d
fix imports order
demartinofra May 2, 2019
3f00c75
nodewatcher: fix tarfile for python 2.6
demartinofra May 6, 2019
3a97dbf
Define class to execute remote commands over SSH
demartinofra May 10, 2019
20f35c0
_run_command: return command output on errors
demartinofra May 10, 2019
f166e50
sqswatcher - sge: parallelize node processing and improve error handling
demartinofra May 10, 2019
c984304
Add some comments to clarify functions behaviour
demartinofra May 15, 2019
3fd705f
sqswatcher - slurm: use RemoteCommandExecutor to reboot compute nodes
demartinofra May 14, 2019
8183bc1
nodewatcher - sge: check only for running/suspended jobs in hasJobs f…
demartinofra May 14, 2019
058e758
Create a common module for SGE related commands
demartinofra May 15, 2019
54f0d92
sge: move locking functions to sge_commands
demartinofra May 15, 2019
18bf0e3
sge: implement logic to parse xml output from sge commands
demartinofra May 16, 2019
2b0e545
sge: add unit-tests for sge_commands
demartinofra May 16, 2019
70e0f8e
Add configparser dependency for python 2/3 compatibility
demartinofra May 16, 2019
1f7100c
unit tests: fix dependencies for python 2.6
demartinofra May 16, 2019
36366cb
nodewatcher - sge: implement logic to self terminate faulty node
demartinofra May 14, 2019
ea10f62
jobwatcher - sge: include unusable nodes in busy node count
demartinofra May 16, 2019
26f3470
jobwatcher - sge: verify cluster limits when computing slots required…
demartinofra May 14, 2019
307205f
jobwatcher - sge: do not scale up when a job is pending a dependency
demartinofra May 14, 2019
39c322e
nodewatcher - sge: scale down if pending jobs cannot be scheduled
demartinofra May 16, 2019
9302beb
Add missing copyright to unit tests
demartinofra May 21, 2019
957e5e4
Replace dict comprehension for python 2.6 compatibility
demartinofra May 21, 2019
fe9ac37
jobwatcher - sge: skip orphaned nodes from busy nodes count
demartinofra May 22, 2019
74d2bf2
sge: remove queue info from hostname when parsing xml
demartinofra May 22, 2019
c2f5889
Bump version to 2.4.0
lukeseawalker May 23, 2019
d870fb2
Merge remote-tracking branch 'upstream/develop' into sge-enhancements
demartinofra May 27, 2019
838d814
Merge pull request #145 from aws/sge-enhancements
demartinofra May 27, 2019
7fb8e6c
Fallback to empty string on subprocess.CalledProcessError with python…
demartinofra May 28, 2019
6cce65f
sqswatcher - sge: start compute daemon before adding to queue
demartinofra May 29, 2019
bcb8df1
nodewatcher - sge: Fix logging message
demartinofra May 27, 2019
b6bbe99
slurm: add ReqNodeNotAvail to whitelisted PENDING_RESOURCES_REASONS
demartinofra May 27, 2019
328ed5d
slurm: implement common module to run Slurm commands
demartinofra May 27, 2019
bfd5763
jobwatcher - slurm: integrate new get_pending_jobs_info function
demartinofra May 27, 2019
86bfca3
nodewatcher - slurm: integrate new get_pending_jobs_info function
demartinofra May 27, 2019
c3e20a3
Revert a fix in get_optimal_nodes to allow additional testing with to…
demartinofra May 27, 2019
4574613
nodewatcher: fetch instance type only at start-up
demartinofra Jun 5, 2019
291f81a
Reduce calls to CloudFormation APIs to avoid throttling
demartinofra Jun 5, 2019
9db8e2e
Update changelog for v2.4.0
demartinofra Jun 6, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
nodewatcher - slurm: add function to check if node is down
Signed-off-by: Francesco De Martino <[email protected]>
  • Loading branch information
demartinofra committed May 2, 2019
commit e4d2cc65c0b23dda341f85f72a92405a05eeca3c
6 changes: 6 additions & 0 deletions nodewatcher/plugins/sge.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,9 @@ def lockHost(hostname, unlock=False):
run_sge_command(command)
except subprocess.CalledProcessError:
log.error("Error %s host %s", "unlocking" if unlock else "locking", hostname)


def is_node_down():
"""Check if node is down according to scheduler"""
# ToDo: to be implemented
return False
18 changes: 18 additions & 0 deletions nodewatcher/plugins/slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,24 @@ def lockHost(hostname, unlock=False):
log.error("Error %s host %s", "unlocking" if unlock else "locking", hostname)


def is_node_down():
"""Check if node is down according to scheduler"""
try:
# retrieves the state of a specific node
# https://slurm.schedmd.com/sinfo.html#lbAG
# Output format:
# down*
command = "/bin/bash -c \"/opt/slurm/bin/sinfo --noheader -o '%T' -n $(hostname)\""
output = check_command_output(command).strip()
log.info("Node is in state: '{0}'".format(output))
if output and all(state not in output for state in ["down", "drained", "fail"]):
return False
except Exception as e:
log.error("Failed when checking if node is down with exception %s. Reporting node as down.", e)

return True


def _get_node_slots():
hostname = check_command_output("hostname")
# retrieves number of slots for a specific node in the cluster.
Expand Down
6 changes: 6 additions & 0 deletions nodewatcher/plugins/torque.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,3 +89,9 @@ def lockHost(hostname, unlock=False):
run_command(command)
except subprocess.CalledProcessError:
log.error("Error %s host %s", "unlocking" if unlock else "locking", hostname)


def is_node_down():
"""Check if node is down according to scheduler"""
# ToDo: to be implemented
return False