-
Notifications
You must be signed in to change notification settings - Fork 23
Add SMART Monitoring with dash and alerts #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8f37c1c
to
6de74ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been tested on sms-lab and can be demoed there. Good job @technowhizz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed we are missing the change to globals.yml, @technowhizz, please could you add it?
Eg. this bit:
# Additional command line flags for node exporter to enable texfile collector for disk metrics and create textfile docker volume
prometheus_node_exporter_extra_volumes: "textfile:/var/lib/node_exporter/textfile_collector"
prometheus_node_exporter_cmdline_extras: "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector"
Ah ok. I guess that answers my question to Mark about including the changes in this PR or not. :) Thank you! |
Enabled Textfile collector in node exporter in kolla/globals.yml Added smartmon script as is from the prometheus-community github and then removed NVME support from this script in favour of using the nvme-cli script, which has also been added in. This is because the nvme-cli script provides better metrics than the smartmon script does. The script also adds the serial number of the disk as a label to all SMART metrics. Added a Kayobe custom playbook to easily deploy the script and associated cron job. This playbook installs smartmontool and nvmecli then copies these over to the hosts and sets up a cronjob which runs the scripts and stores the metrics in the docker volume for node exporter. The playbook changes the way the metrics are saved to a file by making use of the mv command as it is atomic. This was needed as at times prometheus would read a partially completed file. Added a prometheus alert to alert when a drive is reported as not healthy for more than 10 minutes. Added a Grafana dashboard to display the number of healthy and unhealthy drives reported in prometheus.
6de74ee
to
d83ecde
Compare
@dougszumski @markgoddard Added in the changes to globals.yml |
docs now in stackhpc/yoga |
…e-config into feature/smartmon
Fix kayobe command Co-authored-by: Will Szumski <[email protected]>
Fix Spelling Co-authored-by: Will Szumski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job!
force push means commit is gone and the only way to resolve this comment is to dismiss review. Sorry! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent job @technowhizz
Added smartmon script as is from the prometheus-community github and then removed NVME support from this script in favour of using the nvme-cli script, which has also been added in. This is because the nvme-cli script provides better metrics than the smartmon script does. The script also adds the serial number of the disk as a label to all SMART metrics.
Added a Kayobe custom playbook to easily deploy the script and associated cron job. This playbook installs smartmontool and nvmecli then copies these over to the hosts and sets up a cronjob which runs the scripts and stores the metrics in the docker volume for node exporter. The playbook changes the way the metrics are saved to a file by making use of the mv command as it is atomic. This was needed as at times prometheus would read a partially completed file.
Added a prometheus alert to alert when a drive is reported as not healthy for more than 10 minutes.
Added a Grafana dashboard to display the number of healthy and unhealthy drives reported in prometheus.