Skip to content

Add SMART Monitoring with dash and alerts #251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 1, 2022

Conversation

technowhizz
Copy link
Contributor

Added smartmon script as is from the prometheus-community github and then removed NVME support from this script in favour of using the nvme-cli script, which has also been added in. This is because the nvme-cli script provides better metrics than the smartmon script does. The script also adds the serial number of the disk as a label to all SMART metrics.

Added a Kayobe custom playbook to easily deploy the script and associated cron job. This playbook installs smartmontool and nvmecli then copies these over to the hosts and sets up a cronjob which runs the scripts and stores the metrics in the docker volume for node exporter. The playbook changes the way the metrics are saved to a file by making use of the mv command as it is atomic. This was needed as at times prometheus would read a partially completed file.

Added a prometheus alert to alert when a drive is reported as not healthy for more than 10 minutes.

Added a Grafana dashboard to display the number of healthy and unhealthy drives reported in prometheus.

@technowhizz technowhizz requested a review from a team as a code owner November 25, 2022 17:23
Copy link
Member

@dougszumski dougszumski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been tested on sms-lab and can be demoed there. Good job @technowhizz

@dougszumski dougszumski self-requested a review November 28, 2022 11:45
Copy link
Member

@dougszumski dougszumski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed we are missing the change to globals.yml, @technowhizz, please could you add it?

Eg. this bit:

# Additional command line flags for node exporter to enable texfile collector for disk metrics and create textfile docker volume
prometheus_node_exporter_extra_volumes: "textfile:/var/lib/node_exporter/textfile_collector"
prometheus_node_exporter_cmdline_extras: "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector"

@technowhizz
Copy link
Contributor Author

Ah ok. I guess that answers my question to Mark about including the changes in this PR or not. :) Thank you!

Enabled Textfile collector in node exporter in kolla/globals.yml

Added smartmon script as is from the prometheus-community github and then
removed NVME support from this script in favour of using the nvme-cli script,
which has also been added in. This is because the nvme-cli script provides
better metrics than the smartmon script does. The script also adds the serial
number of the disk as a label to all SMART metrics.

Added a Kayobe custom playbook to easily deploy the script and associated
cron job. This playbook installs smartmontool and nvmecli then copies these
over to the hosts and sets up a cronjob which runs the scripts and stores
the metrics in the docker volume for node exporter. The playbook changes
the way the metrics are saved to a file by making use of the mv command
as it is atomic. This was needed as at times prometheus would read a
partially completed file.

Added a prometheus alert to alert when a drive is reported as not healthy
for more than 10 minutes.

Added a Grafana dashboard to display the number of healthy and unhealthy
drives reported in prometheus.
@technowhizz
Copy link
Contributor Author

@dougszumski @markgoddard Added in the changes to globals.yml

@markgoddard
Copy link
Contributor

docs now in stackhpc/yoga

Copy link
Contributor

@markgoddard markgoddard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job!

@technowhizz technowhizz removed the request for review from dougszumski November 30, 2022 15:21
@technowhizz technowhizz dismissed dougszumski’s stale review November 30, 2022 15:27

force push means commit is gone and the only way to resolve this comment is to dismiss review. Sorry! :)

Copy link
Member

@dougszumski dougszumski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job @technowhizz

@technowhizz technowhizz merged commit edad38b into stackhpc/yoga Dec 1, 2022
@technowhizz technowhizz deleted the feature/smartmon branch December 1, 2022 11:14
@technowhizz technowhizz self-assigned this Dec 1, 2022
technowhizz added a commit that referenced this pull request Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants