-
Notifications
You must be signed in to change notification settings - Fork 83
(BOLT-459) Create reboot plan #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b04eb99
6d73a90
bba4efb
63794fb
134cb6e
30a61ff
a578da7
7481dd2
c9bf10d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,3 @@ | ||
fixtures: | ||
symlinks: | ||
reboot: "#{source_dir}" | ||
boltlib: "#{source_dir}/spec/fixtures/modules/bolt/bolt-modules/boltlib" | ||
repositories: | ||
bolt: https://github.com/puppetlabs/bolt.git |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +0,0 @@ | ||
# This configuration was generated by | ||
# `rubocop --auto-gen-config` | ||
# on 2018-10-08 10:49:35 +0800 using RuboCop version 0.49.1. | ||
# The point is for the user to remove these configuration records | ||
# one by one as the offenses are removed from the code base. | ||
# Note that changes in the inspected code, or installation of new | ||
# versions of RuboCop, may require this file to be generated again. | ||
|
||
# Offense count: 6 | ||
RSpec/AnyInstance: | ||
Exclude: | ||
- 'spec/functions/wait_spec.rb' | ||
|
||
# Offense count: 1 | ||
# Configuration parameters: SkipBlocks, EnforcedStyle, SupportedStyles. | ||
# SupportedStyles: described_class, explicit | ||
RSpec/DescribedClass: | ||
Exclude: | ||
- 'spec/functions/wait/bolt/executor_spec.rb' | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Sleeps for specified number of seconds. | ||
Puppet::Functions.create_function(:'reboot::sleep') do | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't believe Puppet has a builtin |
||
# @param period Time to sleep (in seconds) | ||
dispatch :sleeper do | ||
required_param 'Integer', :period | ||
end | ||
|
||
def sleeper(period) | ||
sleep(period) | ||
end | ||
end |
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# Reboots targets and waits for them to be available again. | ||
# | ||
# @param nodes Targets to reboot. | ||
# @param message Message to log with the reboot (for platforms that support it). | ||
# @param reboot_delay How long (in seconds) to wait before rebooting. Defaults to 1. | ||
# @param disconnect_wait How long (in seconds) to wait before checking whether the server has rebooted. Defaults to 10. | ||
# @param reconnect_timeout How long (in seconds) to attempt to reconnect before giving up. Defaults to 180. | ||
# @param retry_interval How long (in seconds) to wait between retries. Defaults to 1. | ||
plan reboot ( | ||
TargetSpec $nodes, | ||
Optional[String] $message = undef, | ||
Integer[1] $reboot_delay = 1, | ||
Integer[0] $disconnect_wait = 10, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like the actual default is 10 when it should be 1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Documentation was wrong, I updated it to 10. |
||
Integer[0] $reconnect_timeout = 180, | ||
Integer[0] $retry_interval = 1, | ||
) { | ||
$targets = get_targets($nodes) | ||
|
||
# Get last boot time | ||
$begin_boot_time_results = without_default_logging() || { | ||
run_task('reboot::last_boot_time', $targets) | ||
} | ||
|
||
# Reboot; catch errors here because the connection may get cut out from underneath | ||
$reboot_result = run_task('reboot', $nodes, timeout => $reboot_delay, message => $message) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having this be a single plan means that if most nodes successfully reboot but one fails, it's hard to recover. May need to split waiting for the reboot into a separate plan. Should we catch errors, wait for reboot on the successful nodes, then fail? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @reidmv any input on this question? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Catch errors -> wait for all nodes to finish -> fail seems like a logical eventflow to me, but I don't have an exact use case in mind. |
||
|
||
# Wait long enough for all targets to trigger reboot, plus disconnect_wait to allow for shutdown time. | ||
$timeouts = $reboot_result.map |$result| { $result['timeout'] } | ||
$wait = max($timeouts) | ||
reboot::sleep($wait+$disconnect_wait) | ||
MikaelSmith marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
$start_time = Timestamp() | ||
# Wait for reboot in a loop | ||
MikaelSmith marked this conversation as resolved.
Show resolved
Hide resolved
|
||
## Check if we can connect; if we can retrieve last boot time. | ||
## Mark finished for targets with a new last boot time. | ||
## If we still have targets check for timeout, sleep if not done. | ||
$failed = without_default_logging() || { | ||
$reconnect_timeout.reduce($targets) |$down, $_| { | ||
if $down.empty() { | ||
break() | ||
} | ||
|
||
$plural = if $down.size() > 1 { 's' } | ||
notice("Waiting: ${$down.size()} target${plural} rebooting") | ||
$current_boot_time_results = run_task('reboot::last_boot_time', $down, _catch_errors => true) | ||
|
||
# Compare boot times | ||
$failed_results = $current_boot_time_results.filter |$current_boot_time_res| { | ||
# If this one errored, need to check it again | ||
if !$current_boot_time_res.ok() { | ||
true | ||
} | ||
else { | ||
# If this succeeded, then we have a boot time, compare it against the begin_boot_time | ||
$target_name = $current_boot_time_res.target().name() | ||
$begin_boot_time_res = $begin_boot_time_results.find($target_name) | ||
|
||
# If the boot times are the same, then we need to check it again | ||
$current_boot_time_res.value() == $begin_boot_time_res.value() | ||
} | ||
} | ||
|
||
# $failed_results is an array of results, turn it into a ResultSet so we can | ||
# extract the targets from it | ||
$failed_targets = ResultSet($failed_results).targets() | ||
dylanratcliffe marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Check for timeout if we still have failed targets | ||
if !$failed_targets.empty() { | ||
$elapsed_time_sec = Integer(Timestamp() - $start_time) | ||
if $elapsed_time_sec >= $reconnect_timeout { | ||
fail_plan( | ||
"Hosts failed to come up after reboot within ${reconnect_timeout} seconds: ${failed_targets}", | ||
'bolt/reboot-timeout', | ||
{ | ||
'failed_targets' => $failed_targets, | ||
} | ||
) | ||
} | ||
|
||
# sleep for a small time before trying again | ||
reboot::sleep($retry_interval) | ||
|
||
# wait for all targets to be available again | ||
$remaining_time = $reconnect_timeout - $elapsed_time_sec | ||
wait_until_available($failed_targets, wait_time => $remaining_time, retry_interval => $retry_interval) | ||
} | ||
|
||
$failed_targets | ||
} | ||
} | ||
|
||
if !$failed.empty() { | ||
fail_plan( | ||
"Failed to reboot ${failed}", | ||
'bolt/reboot-failed', | ||
{ | ||
'failed_targets' => $failed, | ||
}, | ||
) | ||
} | ||
} |
Uh oh!
There was an error while loading. Please reload this page.