-
Notifications
You must be signed in to change notification settings - Fork 679
Detect kubelet and container runtime frequent restarts #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/assign @Random-Liu |
b1cd976 to
b398d5f
Compare
ea26527 to
aa53d4c
Compare
|
@Random-Liu, I have updated this PR with delay option. PTAL |
cmd/logcounter/options/options.go
Outdated
|
|
||
| // AddFlags adds log counter command line options to pflag. | ||
| func (fedo *LogCounterOptions) AddFlags(fs *pflag.FlagSet) { | ||
| fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, docker, etc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/docker/dockerd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/logcounter/options/options.go
Outdated
| fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, docker, etc") | ||
| fs.StringVar(&fedo.LogPath, "log-path", "", "The log path that log watcher looks up") | ||
| fs.StringVar(&fedo.Lookback, "lookback", "", "The time log watcher looks up") | ||
| fs.StringVar(&fedo.Delay, "delay", "", "The time duration log watcher delays after node boot time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a bit more explanation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| glog.Fatalf("Failed to get system info: %v", err) | ||
| startTime, err := util.GetDelayedStartTime(cfg.Delay) | ||
| if err != nil { | ||
| glog.Fatalf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the error log more informative: "failed to get xxx"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| func NewJournaldWatcher(cfg types.WatcherConfig) types.LogWatcher { | ||
| startTime, err := util.GetDelayedStartTime(cfg.Delay) | ||
| if err != nil { | ||
| glog.Fatalf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| glog.Infof("Lookback changed to system uptime: %v", lookback) | ||
| } | ||
| // Seek journal client based on the lookback duration. | ||
| start := time.Now().Add(-lookback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now considering lookback when getting start time.
pkg/util/helpers.go
Outdated
| } | ||
| } | ||
|
|
||
| func GetDelayedStartTime(delay string) (time.Time, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this can be combined with lookback, just pass in both delay and lookback, and return starttime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| "--log-path=/var/log/journal", | ||
| "--lookback=20m", | ||
| "--count=5", | ||
| "--pattern=Starting Docker Application Container Engine..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use Starting for docker and containerd, but Started for kubelet? Any particular reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is no Starting message for kubelet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting.
aa53d4c to
a7b6902
Compare
a7b6902 to
48cfb6d
Compare
wangzhen127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Random-Liu, I have updated this PR. PTAL.
cmd/logcounter/options/options.go
Outdated
|
|
||
| // AddFlags adds log counter command line options to pflag. | ||
| func (fedo *LogCounterOptions) AddFlags(fs *pflag.FlagSet) { | ||
| fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, docker, etc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cmd/logcounter/options/options.go
Outdated
| fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, docker, etc") | ||
| fs.StringVar(&fedo.LogPath, "log-path", "", "The log path that log watcher looks up") | ||
| fs.StringVar(&fedo.Lookback, "lookback", "", "The time log watcher looks up") | ||
| fs.StringVar(&fedo.Delay, "delay", "", "The time duration log watcher delays after node boot time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| "--log-path=/var/log/journal", | ||
| "--lookback=20m", | ||
| "--count=5", | ||
| "--pattern=Starting Docker Application Container Engine..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is no Starting message for kubelet.
| glog.Fatalf("Failed to get system info: %v", err) | ||
| startTime, err := util.GetDelayedStartTime(cfg.Delay) | ||
| if err != nil { | ||
| glog.Fatalf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| func NewJournaldWatcher(cfg types.WatcherConfig) types.LogWatcher { | ||
| startTime, err := util.GetDelayedStartTime(cfg.Delay) | ||
| if err != nil { | ||
| glog.Fatalf("%v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| glog.Infof("Lookback changed to system uptime: %v", lookback) | ||
| } | ||
| // Seek journal client based on the lookback duration. | ||
| start := time.Now().Add(-lookback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now considering lookback when getting start time.
pkg/util/helpers.go
Outdated
| } | ||
| } | ||
|
|
||
| func GetDelayedStartTime(delay string) (time.Time, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
/retest |
48cfb6d to
e5ccf98
Compare
|
Rebased |
|
Just tested on a GKE node and it works. |
Random-Liu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits
| if s.clock.Since(log.Timestamp) > lookback || log.Timestamp.Before(s.uptime) { | ||
| // Discard messages before start time. | ||
| if log.Timestamp.Before(s.startTime) { | ||
| glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", log.Message, log.Timestamp, s.startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/%v/%q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| if entry.RealtimeTimestamp < startTimestamp { | ||
| glog.V(5).Infof("Throwing away journal entry before start time: %v < %v", entry.RealtimeTimestamp, startTimestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include message content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| glog.V(5).Infof("Throwing away msg %v for being too old: %v > %v", msg.Message, msg.Timestamp.String(), lookback.String()) | ||
| // Discard messages before start time. | ||
| if msg.Timestamp.Before(k.startTime) { | ||
| glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", msg.Message, msg.Timestamp, k.startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/%v/%q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
e5ccf98 to
1f63638
Compare
wangzhen127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. PTAL
| if s.clock.Since(log.Timestamp) > lookback || log.Timestamp.Before(s.uptime) { | ||
| // Discard messages before start time. | ||
| if log.Timestamp.Before(s.startTime) { | ||
| glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", log.Message, log.Timestamp, s.startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| if entry.RealtimeTimestamp < startTimestamp { | ||
| glog.V(5).Infof("Throwing away journal entry before start time: %v < %v", entry.RealtimeTimestamp, startTimestamp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| glog.V(5).Infof("Throwing away msg %v for being too old: %v > %v", msg.Message, msg.Timestamp.String(), lookback.String()) | ||
| // Discard messages before start time. | ||
| if msg.Timestamp.Before(k.startTime) { | ||
| glog.V(5).Infof("Throwing away msg %v before start time: %v < %v", msg.Message, msg.Timestamp, k.startTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Random-Liu, wangzhen127 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR adds customer plugin that watch systemd logs for frequent kubelet and container runtime restarts. It reuses log counter for frequent restart detection. Notice that kubelet and container runtime restart logs are actually in systemd logs, not in kubelet or container runtime logs.
Noticeable changes:
dockerdas the source name.