Skip to content

(BOLT-459) Create reboot plan #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 19, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .fixtures.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
fixtures:
symlinks:
reboot: "#{source_dir}"
boltlib: "#{source_dir}/spec/fixtures/modules/bolt/bolt-modules/boltlib"
repositories:
bolt: https://github.com/puppetlabs/bolt.git
19 changes: 0 additions & 19 deletions .rubocop_todo.yml
Original file line number Diff line number Diff line change
@@ -1,19 +0,0 @@
# This configuration was generated by
# `rubocop --auto-gen-config`
# on 2018-10-08 10:49:35 +0800 using RuboCop version 0.49.1.
# The point is for the user to remove these configuration records
# one by one as the offenses are removed from the code base.
# Note that changes in the inspected code, or installation of new
# versions of RuboCop, may require this file to be generated again.

# Offense count: 6
RSpec/AnyInstance:
Exclude:
- 'spec/functions/wait_spec.rb'

# Offense count: 1
# Configuration parameters: SkipBlocks, EnforcedStyle, SupportedStyles.
# SupportedStyles: described_class, explicit
RSpec/DescribedClass:
Exclude:
- 'spec/functions/wait/bolt/executor_spec.rb'
13 changes: 0 additions & 13 deletions .sync.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,6 @@ Gemfile:
- gem: master_manipulator
- gem: puppet-blacksmith
version: '~> 3.4'
optional:
':development':
- gem: 'bolt'
condition: "Gem::Version.new(RUBY_VERSION.dup) >= Gem::Version.new('2.3.0')"

appveyor.yml:
matrix_extras:
Expand All @@ -35,14 +31,5 @@ appveyor.yml:
.gitlab-ci.yml:
delete: true

# Due to https://tickets.puppetlabs.com/browse/PDK-1199 we need to stop the symlink checks in Travis CI
# The symlink checks are still done in Appveyor so there's no loss in coverage
.travis.yml:
includes:
- env: CHECK="syntax lint metadata_lint check:git_ignore check:dot_underscore check:test_file rubocop"
- env: CHECK=parallel_spec
- env: PUPPET_GEM_VERSION="~> 4.0" CHECK=parallel_spec
rvm: 2.1.9

spec/default_facts.yml:
unmanaged: true
5 changes: 3 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,12 @@ env:
matrix:
fast_finish: true
include:
# PDK Update doesn't replace this line, but instead adds a new one.
-
env: CHECK="syntax lint metadata_lint check:git_ignore check:dot_underscore check:test_file rubocop"
env: CHECK="syntax lint metadata_lint check:symlinks check:git_ignore check:dot_underscore check:test_file rubocop"
-
env: CHECK=parallel_spec
-
env: PUPPET_GEM_VERSION="~> 6.0" GEM_BOLT=true CHECK=parallel_spec
-
env: PUPPET_GEM_VERSION="~> 4.0" CHECK=parallel_spec
rvm: 2.1.9
Expand Down
8 changes: 7 additions & 1 deletion Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,20 @@ group :development do
gem "puppet-module-posix-dev-r#{minor_version}", require: false, platforms: [:ruby]
gem "puppet-module-win-default-r#{minor_version}", require: false, platforms: [:mswin, :mingw, :x64_mingw]
gem "puppet-module-win-dev-r#{minor_version}", require: false, platforms: [:mswin, :mingw, :x64_mingw]
gem "bolt", require: false if Gem::Version.new(RUBY_VERSION.dup) >= Gem::Version.new('2.3.0')
if ENV['GEM_BOLT']
gem 'bolt', '~> 1.3', require: false
end
end
group :system_tests do
gem "puppet-module-posix-system-r#{minor_version}", require: false, platforms: [:ruby]
gem "puppet-module-win-system-r#{minor_version}", require: false, platforms: [:mswin, :mingw, :x64_mingw]
gem "beaker-testmode_switcher", '~> 0.4', require: false
gem "master_manipulator", require: false
gem "puppet-blacksmith", '~> 3.4', require: false
if ENV['GEM_BOLT']
gem 'bolt', '~> 1.3', require: false
gem 'beaker-task_helper', '~> 1.5', require: false
end
end

puppet_version = ENV['PUPPET_GEM_VERSION']
Expand Down
36 changes: 22 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,9 +158,9 @@ This can take a single reason or an array of reasons.

See the [Reboot when certain conditions are met](#reboot-when-certain-conditions-are-met) section for reasons why you might reboot.

### Function: `reboot::wait`
### Plan: `reboot::wait`

This function is intended to be used as part of a [plan](https://puppet.com/docs/bolt/latest/writing_plans.html) and allows Bolt to wait for a server to reboot before continuing. This function has no use in normal Puppet code (outside of plans) and will not work.
This plan is intended to be used as part of other [plans](https://puppet.com/docs/bolt/latest/writing_plans.html) and allows Bolt to wait for a server to reboot before continuing.

Here is an example of using this module to reboot servers, wait for them to come back, then check the status of a service:

Expand All @@ -172,12 +172,8 @@ plan myapp::patch (
# Upgrade the application
run_task('myapp::upgrade', $servers, { 'version' => $version })

# Reboot the servers
run_task('reboot', $servers)

# Wait for them to come back, this app is slow to shut down so give them
# 5 min to shut down
reboot::wait($servers, { 'disconnect_wait' => 300 })
# Reboot the servers. This app is slow to shut down so give them 5 minutes to reboot.
run_plan('reboot', $servers, reconnect_timeout => 300)

# Check the status of the service
return run_task('service', $nodes, {
Expand All @@ -189,17 +185,29 @@ plan myapp::patch (

#### Parameters

##### `targets`
##### `nodes`

A `TargetSpec` object containing all nodes to wait for.

##### `params`
##### `message`

An optional message to log when rebooting.

##### `reboot_delay`

How long (in seconds) to wait before shutting down. Defaults to 0, shutdown immediately.

##### `disconnect_wait`

How long (in seconds) to wait before checking whether the server has rebooted. Defaults to 1.

##### `reconnect_timeout`

How long (in seconds) to attempt to reconnect before giving up. Defaults to 180.

A `Hash` of optional timing parameters, these should be specified as an `Integer` representing seconds. Available parameters are:
##### `retry_interval`

* `disconnect_wait`
* `reconnect_wait`
* `retry_interval`
How long (in seconds) to wait between retries. Defaults to 1.

## Limitations

Expand Down
2 changes: 2 additions & 0 deletions Rakefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,6 @@ require 'puppet-syntax/tasks/puppet-syntax'
require 'puppet_blacksmith/rake_tasks' if Bundler.rubygems.find_name('puppet-blacksmith').any?

PuppetLint.configuration.send('disable_relative')
PuppetSyntax.exclude_paths << %w[plans/*]

task :beaker => :spec_prep
11 changes: 11 additions & 0 deletions lib/puppet/functions/reboot/sleep.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Sleeps for specified number of seconds.
Puppet::Functions.create_function(:'reboot::sleep') do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sleep() function not available from plans? Only curious, I trust there's good reason for this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe Puppet has a builtin sleep function. It's not a common need in catalogs.

# @param period Time to sleep (in seconds)
dispatch :sleeper do
required_param 'Integer', :period
end

def sleeper(period)
sleep(period)
end
end
72 changes: 0 additions & 72 deletions lib/puppet/functions/reboot/wait.rb

This file was deleted.

101 changes: 101 additions & 0 deletions plans/init.pp
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Reboots targets and waits for them to be available again.
#
# @param nodes Targets to reboot.
# @param message Message to log with the reboot (for platforms that support it).
# @param reboot_delay How long (in seconds) to wait before rebooting. Defaults to 1.
# @param disconnect_wait How long (in seconds) to wait before checking whether the server has rebooted. Defaults to 10.
# @param reconnect_timeout How long (in seconds) to attempt to reconnect before giving up. Defaults to 180.
# @param retry_interval How long (in seconds) to wait between retries. Defaults to 1.
plan reboot (
TargetSpec $nodes,
Optional[String] $message = undef,
Integer[1] $reboot_delay = 1,
Integer[0] $disconnect_wait = 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the actual default is 10 when it should be 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation was wrong, I updated it to 10.

Integer[0] $reconnect_timeout = 180,
Integer[0] $retry_interval = 1,
) {
$targets = get_targets($nodes)

# Get last boot time
$begin_boot_time_results = without_default_logging() || {
run_task('reboot::last_boot_time', $targets)
}

# Reboot; catch errors here because the connection may get cut out from underneath
$reboot_result = run_task('reboot', $nodes, timeout => $reboot_delay, message => $message)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this be a single plan means that if most nodes successfully reboot but one fails, it's hard to recover. May need to split waiting for the reboot into a separate plan. Should we catch errors, wait for reboot on the successful nodes, then fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reidmv any input on this question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catch errors -> wait for all nodes to finish -> fail seems like a logical eventflow to me, but I don't have an exact use case in mind.


# Wait long enough for all targets to trigger reboot, plus disconnect_wait to allow for shutdown time.
$timeouts = $reboot_result.map |$result| { $result['timeout'] }
$wait = max($timeouts)
reboot::sleep($wait+$disconnect_wait)

$start_time = Timestamp()
# Wait for reboot in a loop
## Check if we can connect; if we can retrieve last boot time.
## Mark finished for targets with a new last boot time.
## If we still have targets check for timeout, sleep if not done.
$failed = without_default_logging() || {
$reconnect_timeout.reduce($targets) |$down, $_| {
if $down.empty() {
break()
}

$plural = if $down.size() > 1 { 's' }
notice("Waiting: ${$down.size()} target${plural} rebooting")
$current_boot_time_results = run_task('reboot::last_boot_time', $down, _catch_errors => true)

# Compare boot times
$failed_results = $current_boot_time_results.filter |$current_boot_time_res| {
# If this one errored, need to check it again
if !$current_boot_time_res.ok() {
true
}
else {
# If this succeeded, then we have a boot time, compare it against the begin_boot_time
$target_name = $current_boot_time_res.target().name()
$begin_boot_time_res = $begin_boot_time_results.find($target_name)

# If the boot times are the same, then we need to check it again
$current_boot_time_res.value() == $begin_boot_time_res.value()
}
}

# $failed_results is an array of results, turn it into a ResultSet so we can
# extract the targets from it
$failed_targets = ResultSet($failed_results).targets()

# Check for timeout if we still have failed targets
if !$failed_targets.empty() {
$elapsed_time_sec = Integer(Timestamp() - $start_time)
if $elapsed_time_sec >= $reconnect_timeout {
fail_plan(
"Hosts failed to come up after reboot within ${reconnect_timeout} seconds: ${failed_targets}",
'bolt/reboot-timeout',
{
'failed_targets' => $failed_targets,
}
)
}

# sleep for a small time before trying again
reboot::sleep($retry_interval)

# wait for all targets to be available again
$remaining_time = $reconnect_timeout - $elapsed_time_sec
wait_until_available($failed_targets, wait_time => $remaining_time, retry_interval => $retry_interval)
}

$failed_targets
}
}

if !$failed.empty() {
fail_plan(
"Failed to reboot ${failed}",
'bolt/reboot-failed',
{
'failed_targets' => $failed,
},
)
}
}
Loading