-
Couldn't load subscription status.
- Fork 678
add custom problem detector plugin #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| { | ||
| "plugin": "custom", | ||
| "pluginConfig": { | ||
| "invoke_interval": "30s", | ||
| "timeout": "5s", | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like we could have global default There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For now, only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm fine with this for now, but please add TODO in code. We should have per-rule interval. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. |
||
| "max_output_length": 80, | ||
| "concurrency": 3 | ||
| }, | ||
| "source": "ntp-custom-plugin-monitor", | ||
| "conditions": [ | ||
| { | ||
| "type": "NTPProblem", | ||
| "reason": "NTPIsUp", | ||
| "message": "ntp service is up" | ||
| } | ||
| ], | ||
| "rules": [ | ||
| { | ||
| "type": "temporary", | ||
| "reason": "NTPIsDown", | ||
| "path": "./config/plugin/check_ntp.sh", | ||
| "timeout": "3s" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the per plugin timeout config. :) |
||
| }, | ||
| { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we run the script twice? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. The reason for doing so is that in this way, we can give users more control on how events and conditions are emitted. When users want an event and condition for a reason, users should declare this explicitly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should generate event for condition change. With that, you shouldn't need 2 rules here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sometimes, for the same condition, status is not changed, but reason is changed. Without event, people will not even notice that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the log condition, when the reason changed, the timestamp and reason will be updated in the node status. I am fine with emitting events when condition changes, but since events has made a stress on ETCD, maybe we should emit events as less as we can. :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Condition change is quite rare. :) |
||
| "type": "permanent", | ||
| "condition": "NTPProblem", | ||
| "reason": "NTPIsDown", | ||
| "path": "./config/plugin/check_ntp.sh", | ||
| "timeout": "3s" | ||
| } | ||
| ] | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| #!/bin/bash | ||
|
|
||
| # NOTE: THIS NTP SERVICE CHECK SCRIPT ASSUME THAT NTP SERVICE IS RUNNING UNDER SYSTEMD. | ||
| # THIS IS JUST AN EXAMPLE. YOU CAN WRITE YOUR OWN NODE PROBLEM PLUGIN ON DEMAND. | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check whether this node is using systemd? If not, return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch. Will do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually what we use in our production env has those checks. :) |
||
| OK=0 | ||
| NONOK=1 | ||
| UNKNOWN=2 | ||
|
|
||
| which systemctl >/dev/null | ||
| if [ $? -ne 0 ]; then | ||
| echo "Systemd is not supported" | ||
| exit $UNKNOWN | ||
| fi | ||
|
|
||
| systemctl status ntp.service | grep 'Active:' | grep -q running | ||
| if [ $? -ne 0 ]; then | ||
| echo "NTP service is not running" | ||
| exit $NONOK | ||
| fi | ||
|
|
||
| echo "NTP service is running" | ||
| exit $OK | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| # Custom Plugin Monitor | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should complete the document. I'm fine with doing that in a following PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed on detail this in another PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fired #150 to track unresolved comments. :) |
||
|
|
||
| Custom plugin monitor is a plugin mechanism for node-problem-detector. It will | ||
| extend node-problem-detector to execute any monitor scripts written in any language. | ||
| The monitor scripts must conform to the plugin protocol in exit code and standard | ||
| output. For more info about the plugin protocol, please refer to the | ||
| [node-problem-detector plugin interface proposal](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,167 @@ | ||
| /* | ||
| Copyright 2017 The Kubernetes Authors All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| */ | ||
|
|
||
| package custompluginmonitor | ||
|
|
||
| import ( | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Follow the convention: import(
"encoding/json"
"io/ioutil"
"time"
"github.com/golang/glog"
cpmtypes "k8s.io/node-problem-detector/pkg/custompluginmonitor/types"
"k8s.io/node-problem-detector/pkg/types"
"k8s.io/node-problem-detector/pkg/custompluginmonitor/plugin"
"k8s.io/node-problem-detector/pkg/util/tomb"
)There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do. What conversion should kubernetes use? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I manually do that. :p |
||
| "encoding/json" | ||
| "io/ioutil" | ||
| "time" | ||
|
|
||
| "github.com/golang/glog" | ||
|
|
||
| "k8s.io/node-problem-detector/pkg/custompluginmonitor/plugin" | ||
| cpmtypes "k8s.io/node-problem-detector/pkg/custompluginmonitor/types" | ||
| "k8s.io/node-problem-detector/pkg/types" | ||
| "k8s.io/node-problem-detector/pkg/util/tomb" | ||
| ) | ||
|
|
||
| type customPluginMonitor struct { | ||
| config cpmtypes.CustomPluginConfig | ||
| conditions []types.Condition | ||
| plugin *plugin.Plugin | ||
| resultChan <-chan cpmtypes.Result | ||
| statusChan chan *types.Status | ||
| tomb *tomb.Tomb | ||
| } | ||
|
|
||
| // NewCustomPluginMonitorOrDie create a new customPluginMonitor, panic if error occurs. | ||
| func NewCustomPluginMonitorOrDie(configPath string) types.Monitor { | ||
| c := &customPluginMonitor{ | ||
| tomb: tomb.NewTomb(), | ||
| } | ||
| f, err := ioutil.ReadFile(configPath) | ||
| if err != nil { | ||
| glog.Fatalf("Failed to read configuration file %q: %v", configPath, err) | ||
| } | ||
| err = json.Unmarshal(f, &c.config) | ||
| if err != nil { | ||
| glog.Fatalf("Failed to unmarshal configuration file %q: %v", configPath, err) | ||
| } | ||
| // Apply configurations | ||
| err = (&c.config).ApplyConfiguration() | ||
| if err != nil { | ||
| glog.Fatalf("Failed to apply configuration for %q: %v", configPath, err) | ||
| } | ||
|
|
||
| // Validate configurations | ||
| err = c.config.Validate() | ||
| if err != nil { | ||
| glog.Fatalf("Failed to validate custom plugin config %+v: %v", c.config, err) | ||
| } | ||
|
|
||
| glog.Infof("Finish parsing custom plugin monitor config file: %+v", c.config) | ||
|
|
||
| c.plugin = plugin.NewPlugin(c.config) | ||
| // A 1000 size channel should be big enough. | ||
| c.statusChan = make(chan *types.Status, 1000) | ||
| return c | ||
| } | ||
|
|
||
| func (c *customPluginMonitor) Start() (<-chan *types.Status, error) { | ||
| glog.Info("Start custom plugin monitor") | ||
| go c.plugin.Run() | ||
| go c.monitorLoop() | ||
| return c.statusChan, nil | ||
| } | ||
|
|
||
| func (c *customPluginMonitor) Stop() { | ||
| glog.Info("Stop custom plugin monitor") | ||
| c.tomb.Stop() | ||
| } | ||
|
|
||
| // monitorLoop is the main loop of log monitor. | ||
| func (c *customPluginMonitor) monitorLoop() { | ||
| c.initializeStatus() | ||
|
|
||
| resultChan := c.plugin.GetResultChan() | ||
|
|
||
| for { | ||
| select { | ||
| case result := <-resultChan: | ||
| glog.V(3).Infof("Receive new plugin result: %+v", result) | ||
| status := c.generateStatus(result) | ||
| glog.Infof("New status generated: %+v", status) | ||
| c.statusChan <- status | ||
| case <-c.tomb.Stopping(): | ||
| c.plugin.Stop() | ||
| glog.Infof("Custom plugin monitor stopped") | ||
| c.tomb.Done() | ||
| break | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // generateStatus generates status from the plugin check result. | ||
| func (c *customPluginMonitor) generateStatus(result cpmtypes.Result) *types.Status { | ||
| timestamp := time.Now() | ||
| var events []types.Event | ||
| if result.Rule.Type == types.Temp { | ||
| // For temporary error only generate event when exit status is above warning | ||
| if result.ExitStatus >= cpmtypes.NonOK { | ||
| events = append(events, types.Event{ | ||
| Severity: types.Warn, | ||
| Timestamp: timestamp, | ||
| Reason: result.Rule.Reason, | ||
| Message: result.Message, | ||
| }) | ||
| } | ||
| } else { | ||
| // For permanent error changes the condition | ||
| for i := range c.conditions { | ||
| condition := &c.conditions[i] | ||
| if condition.Type == result.Rule.Condition { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Combine the logic: status = (result.ExitStatus >= cpmtypes.NonOK)
if condition.Status != status || condition.Reason != result.Rule.Reason {
condition.Transition = timestamp
condition.Message = result.Message
}
condition.Status = status
condition.Reason = result.Rule.ReasonThere was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And please generate event. :P There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once we have a conclusion on emitting events when condition change. I prefer adding this as a TODO and will be addressed in next PR. :) |
||
| status := result.ExitStatus >= cpmtypes.NonOK | ||
| if condition.Status != status || condition.Reason != result.Rule.Reason { | ||
| condition.Transition = timestamp | ||
| condition.Message = result.Message | ||
| } | ||
| condition.Status = status | ||
| condition.Reason = result.Rule.Reason | ||
| break | ||
| } | ||
| } | ||
| } | ||
| return &types.Status{ | ||
| Source: c.config.Source, | ||
| // TODO(random-liu): Aggregate events and conditions and then do periodically report. | ||
| Events: events, | ||
| Conditions: c.conditions, | ||
| } | ||
| } | ||
|
|
||
| // initializeStatus initializes the internal condition and also reports it to the node problem detector. | ||
| func (c *customPluginMonitor) initializeStatus() { | ||
| // Initialize the default node conditions | ||
| c.conditions = initialConditions(c.config.DefaultConditions) | ||
| glog.Infof("Initialize condition generated: %+v", c.conditions) | ||
| // Update the initial status | ||
| c.statusChan <- &types.Status{ | ||
| Source: c.config.Source, | ||
| Conditions: c.conditions, | ||
| } | ||
| } | ||
|
|
||
| func initialConditions(defaults []types.Condition) []types.Condition { | ||
| conditions := make([]types.Condition, len(defaults)) | ||
| copy(conditions, defaults) | ||
| for i := range conditions { | ||
| // TODO(random-liu): Validate default conditions | ||
| conditions[i].Status = false | ||
| conditions[i].Transition = time.Now() | ||
| } | ||
| return conditions | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mainly used to make the code block indent with previous content in the same section.