Skip to content

Merge Release 2.4.0 #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 65 commits into from
Jun 11, 2019
Merged

Merge Release 2.4.0 #153

merged 65 commits into from
Jun 11, 2019

Conversation

lukeseawalker
Copy link
Contributor

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

enrico-usai and others added 30 commits April 15, 2019 14:40
We were scaling up only when the job was waiting for resources
to become available (Job reason code = Resources).

In case of a requeued job the cluster didn't scale up because the
pending reason is NodeDown and then BeginTime.
I'm adding them to the list of reason for the scale up.

I also added Priority because I think that we can scale up if the job
is waiting for other higher priority jobs.

Signed-off-by: Enrico Usai <[email protected]>
Signed-off-by: Luca Carrogu <[email protected]>
Failed messages are not re-queued after 2 retries

Signed-off-by: Francesco De Martino <[email protected]>
This allows the REMOVE event to remove partially configured node from scheduler

Signed-off-by: Francesco De Martino <[email protected]>
…ort updates

Instance properties containing the number of vcpus to use whan adding a node
to the scheduler was computed only when the jobwatcher is initialized and was
using a static instance type value read from the jobwatcher config file.
In case of an update of the compute instance type the jobwatcher was still
using the old cpu values to compute the number of instances to start when
scaling up the cluster.
Now the instance properties are re-computed at every iteration of the jobwatcher
and the instance type is retrieved from the cloudformation stack parameters.
A better solution would consist in updating and reloading the node daemons
every time a pcluster update operation is performed.

Signed-off-by: Francesco De Martino <[email protected]>
This logic will be used by sqswatcher as well

Signed-off-by: Francesco De Martino <[email protected]>
Note: without FastSchedule set to 1 Slurm is still accepting
jobs that violate the CPU constraint

Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
This additional timeout prevents the entire pool to timeout and
allows a more granular failures handling

Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
- Fix: correctly handle maintain_size call from self_terminate. If maintain_size
  was true the node was not terminating but was entering the infinite termination loop.
- Fix: dump idletime to file at every iteration and not only when this is increased
  as it was done before. There are cases when the idletime is reset to 0 and hence
  we need to update the stored copy
- Fix: in case _has_pending_jobs fails the log message was saying not terminating instance
  but the instance was being terminated. Decided to terminate instance in this case to avoid
  keeping node up and running with a broken scheduler daemon.
- Refactoring: move check for stack readiness outside of main loop
- Enhancement: reset idletime to 0 when the host becomes essential for the cluster
  (because of min_size of ASG or because there are pending jobs in scheduler queue)
- Refactoring: check for pending jobs before increasing the idletime to also avoid
  misleading log messages
- Refactoring: terminate nodewatcher daemon when the node is terminating rather than
  sleeping forever

Signed-off-by: Francesco De Martino <[email protected]>
Currently with Slurm and Torque when a job is submitted and the requested
slots per node are bigger that the number of available vcpus on a single
instance the jobwatcher is selecting a bigger number of nodes to fit the
requested CPUs. This operation/assumption is incorrect since the scheduler
will never schdule the job due to the fact that the CPU contraints are never
satisfied. This causes the cluster to spin up new capacity that will stay
idle forever or until the job is canceled.
This commit fixes the above mentioned issue.

Signed-off-by: Francesco De Martino <[email protected]>
…irements

when verifying if there are pending jobs in the slurm queue we
are now discarding pending jobs that require unsatisfiable CPU
requirements. This is to prevent that the cluster keeps nodes up
and running for jobs that will be never scheduled.

Signed-off-by: Francesco De Martino <[email protected]>
This is done to speedup cluster recovery in case of failing
nodes. jobwatcher will start requesting an additional node if
possible while nodewatcher will take care of terminating the
faulty instance.

Signed-off-by: Francesco De Martino <[email protected]>
nodewatcher is now able to self-teminate the instance in case the node
is reported as down by the scheduler or if not attached to scheduler
at all. The logs are dumped to the shared /home/logs/compute directory
in order to facilitate future investigations.

Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
demartinofra and others added 28 commits May 16, 2019 12:08
Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
Assuming that a node in an orphaned state is not in ASG anymore
hence not including in busy count to prevent overscaling

Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Luca Carrogu <[email protected]>
Merge feature branch sge-enhancements into develop
… 2.6

In python 2.6 CalledProcessError does not contain the output of the failed command

Signed-off-by: Francesco De Martino <[email protected]>
Starting the compute node sge daemon after the node is added to the queue causes
the scheduler to see the node as unknown for a few seconds. If jobwatcher wakes up
in this period it will mark the node as unreachable and spin up a new isntance
causing overscaling

Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
This adds some logic to correctly recompute the number of required nodes

Signed-off-by: Francesco De Martino <[email protected]>
This adds some logic to correctly filter out jobs that cannot be executed due to cluster limits

Signed-off-by: Francesco De Martino <[email protected]>
…rque

Note: sge and slurm are not affected by this
Signed-off-by: Francesco De Martino <[email protected]>
Since the compute node cannot change its instance type it makes no sense
to dynamically fetch this value in the nodewatcher loop.
This change reduces the calls to DescribeStack API which allows a relatively
low TPS

Signed-off-by: Francesco De Martino <[email protected]>
Signed-off-by: Francesco De Martino <[email protected]>
@lukeseawalker lukeseawalker merged commit 9dbff99 into master Jun 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants