Skip to content

(BOLT-459) Create reboot plan #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 19, 2018
Merged

Conversation

MikaelSmith
Copy link
Contributor

Creates a reboot plan that queries last boot time, reboots targets, then waits until all have rebooted or we timeout. Reverts previous work to add a wait function.

…t_function"

This reverts commit d4e8444, reversing
changes made to 313b52f.
@MikaelSmith
Copy link
Contributor Author

Still needs some testing.

@MikaelSmith MikaelSmith force-pushed the BOLT-957 branch 2 times, most recently from fe00117 to ffd6370 Compare October 31, 2018 22:40
Copy link
Contributor

@dylanratcliffe dylanratcliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main concerns detailed in comments, but also if a task fails, but the plan handles that failure, will it still show as a failure in the PE GUI? If so this will generate a lot of failures and be very ugly...

@MikaelSmith MikaelSmith force-pushed the BOLT-957 branch 3 times, most recently from a0dc39c to a26a2ce Compare November 1, 2018 18:47
}

# Reboot; catch errors here because the connection may get cut out from underneath
$reboot_result = run_task('reboot', $nodes, timeout => $reboot_delay, message => $message)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this be a single plan means that if most nodes successfully reboot but one fails, it's hard to recover. May need to split waiting for the reboot into a separate plan. Should we catch errors, wait for reboot on the successful nodes, then fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reidmv any input on this question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch errors -> wait for all nodes to finish -> fail seems like a logical eventflow to me, but I don't have an exact use case in mind.

@MikaelSmith
Copy link
Contributor Author

In PE, it should look like
screen shot 2018-11-01 at 1 26 21 pm and the plan will be recorded as Complete.

@MikaelSmith MikaelSmith force-pushed the BOLT-957 branch 9 times, most recently from e35d86c to 7021a80 Compare November 2, 2018 00:55
@MikaelSmith MikaelSmith force-pushed the BOLT-957 branch 4 times, most recently from 92d3f69 to 95146e2 Compare November 2, 2018 22:14
@MikaelSmith
Copy link
Contributor Author

I think I've added the minimum required testing. Probably won't have time to expand on it for a few weeks.

@dylanratcliffe dylanratcliffe dismissed their stale review November 8, 2018 07:21

Out of date

@MikaelSmith MikaelSmith closed this Nov 8, 2018
@MikaelSmith MikaelSmith reopened this Nov 8, 2018
@lucywyman
Copy link
Contributor

2 approves: is this ready for merge?

@MikaelSmith
Copy link
Contributor Author

Still working on getting acceptance tests to pass in CI.

# Ensure Bolt logger is initialized so 'notice' works when running the reboot plan.
# Needs to be moved to BoltSpec.
require 'bolt/logger'
Bolt::Logger.initialize_logging
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stuff is really messy, and doesn't seem to work consistently. @adreyer any idea what's going on with logging in BoltSpec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this somewhat. There's still a lot of awkward setup we should probably roll into a BoltSpec helper.

@MikaelSmith MikaelSmith force-pushed the BOLT-957 branch 4 times, most recently from 011c59c to 07d7208 Compare November 9, 2018 00:32
@MikaelSmith
Copy link
Contributor Author

Looks like I got a clean CI run yesterday evening.

@MikaelSmith
Copy link
Contributor Author

Oh, finally figured out what's going on with tests. Some VMs don't like to be rebooted. The reboot acceptance tests themselves catch the reboot command and kill it, so no actual reboot happens.

@MikaelSmith MikaelSmith force-pushed the BOLT-957 branch 2 times, most recently from a4850d6 to b923ca4 Compare November 14, 2018 22:55
Add bash and powershell implementations of the reboot task so it can be
run on systems without Ruby and Puppet installed.
Adds a plan that reboots targets, then waits until they're available
again.
Adds rspec unit tests for reboot plan based on BoltSpec::Plans.
Tests are constrained by needing to kill the shutdown command so we
don't actually restart VMs.
Update Bolt test dependency to 1.3 to eliminate cludges to work around
issues in BoltSpec.
Use Bolt 1.3`s new `wait_until_available` function to reduce the number
of task runs we need to do to determine whether targets have rebooted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants