Add SMART Monitoring with dash and alerts #251

technowhizz · 2022-11-25T17:23:01Z

Added smartmon script as is from the prometheus-community github and then removed NVME support from this script in favour of using the nvme-cli script, which has also been added in. This is because the nvme-cli script provides better metrics than the smartmon script does. The script also adds the serial number of the disk as a label to all SMART metrics.

Added a Kayobe custom playbook to easily deploy the script and associated cron job. This playbook installs smartmontool and nvmecli then copies these over to the hosts and sets up a cronjob which runs the scripts and stores the metrics in the docker volume for node exporter. The playbook changes the way the metrics are saved to a file by making use of the mv command as it is atomic. This was needed as at times prometheus would read a partially completed file.

Added a prometheus alert to alert when a drive is reported as not healthy for more than 10 minutes.

Added a Grafana dashboard to display the number of healthy and unhealthy drives reported in prometheus.

dougszumski

This has been tested on sms-lab and can be demoed there. Good job @technowhizz

etc/kayobe/ansible/smartmon-tools.yml

dougszumski

Just noticed we are missing the change to globals.yml, @technowhizz, please could you add it?

Eg. this bit:

# Additional command line flags for node exporter to enable texfile collector for disk metrics and create textfile docker volume
prometheus_node_exporter_extra_volumes: "textfile:/var/lib/node_exporter/textfile_collector"
prometheus_node_exporter_cmdline_extras: "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector"

technowhizz · 2022-11-28T11:49:18Z

Ah ok. I guess that answers my question to Mark about including the changes in this PR or not. :) Thank you!

Enabled Textfile collector in node exporter in kolla/globals.yml Added smartmon script as is from the prometheus-community github and then removed NVME support from this script in favour of using the nvme-cli script, which has also been added in. This is because the nvme-cli script provides better metrics than the smartmon script does. The script also adds the serial number of the disk as a label to all SMART metrics. Added a Kayobe custom playbook to easily deploy the script and associated cron job. This playbook installs smartmontool and nvmecli then copies these over to the hosts and sets up a cronjob which runs the scripts and stores the metrics in the docker volume for node exporter. The playbook changes the way the metrics are saved to a file by making use of the mv command as it is atomic. This was needed as at times prometheus would read a partially completed file. Added a prometheus alert to alert when a drive is reported as not healthy for more than 10 minutes. Added a Grafana dashboard to display the number of healthy and unhealthy drives reported in prometheus.

technowhizz · 2022-11-29T10:52:05Z

@dougszumski @markgoddard Added in the changes to globals.yml

markgoddard · 2022-11-29T12:42:14Z

docs now in stackhpc/yoga

…tmon

…e-config into feature/smartmon

…tmon

doc/source/configuration/monitoring.rst

…tmon

etc/kayobe/kolla/config/prometheus/system.rules

doc/source/configuration/monitoring.rst

Fix kayobe command Co-authored-by: Will Szumski <[email protected]>

Fix Spelling Co-authored-by: Will Szumski <[email protected]>

doc/source/configuration/monitoring.rst

markgoddard

Nice job!

force push means commit is gone and the only way to resolve this comment is to dismiss review. Sorry! :)

dougszumski

Excellent job @technowhizz

Original PR here: #251

technowhizz requested a review from a team as a code owner November 25, 2022 17:23

technowhizz force-pushed the feature/smartmon branch from 8f37c1c to 6de74ee Compare November 25, 2022 17:40

dougszumski approved these changes Nov 25, 2022

View reviewed changes

markgoddard reviewed Nov 28, 2022

View reviewed changes

etc/kayobe/ansible/smartmon-tools.yml Show resolved Hide resolved

dougszumski self-requested a review November 28, 2022 11:45

dougszumski previously requested changes Nov 28, 2022

View reviewed changes

technowhizz force-pushed the feature/smartmon branch from 6de74ee to d83ecde Compare November 29, 2022 10:50

technowhizz requested a review from dougszumski November 29, 2022 11:19

technowhizz closed this Nov 29, 2022

technowhizz reopened this Nov 29, 2022

Merge branch 'stackhpc/yoga' into feature/smartmon

49f0d39

technowhizz added 4 commits November 29, 2022 13:05

Merge remote-tracking branch 'origin/stackhpc/yoga' into feature/smar…

5baac36

…tmon

Merge branch 'feature/smartmon' of github.com:stackhpc/stackhpc-kayob…

a4ed336

…e-config into feature/smartmon

Merge remote-tracking branch 'origin/stackhpc/yoga' into feature/smar…

3e14df7

…tmon

Add docs for SMART Monitoring

595429a

jovial reviewed Nov 29, 2022

View reviewed changes

doc/source/configuration/monitoring.rst Outdated Show resolved Hide resolved

doc/source/configuration/monitoring.rst Outdated Show resolved Hide resolved

doc/source/configuration/monitoring.rst Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/stackhpc/yoga' into feature/smar…

6dcff3d

…tmon

markgoddard reviewed Nov 30, 2022

View reviewed changes

etc/kayobe/kolla/config/prometheus/system.rules Outdated Show resolved Hide resolved

doc/source/configuration/monitoring.rst Outdated Show resolved Hide resolved

doc/source/configuration/monitoring.rst Outdated Show resolved Hide resolved

markgoddard reviewed Nov 30, 2022

View reviewed changes

doc/source/configuration/monitoring.rst Show resolved Hide resolved

technowhizz and others added 5 commits November 30, 2022 13:53

Update doc/source/configuration/monitoring.rst

9a5fc53

Fix kayobe command Co-authored-by: Will Szumski <[email protected]>

Update doc/source/configuration/monitoring.rst

ef25d6f

Fix Spelling Co-authored-by: Will Szumski <[email protected]>

Add release note

3d4d011

Amend docs and add release note

b6cb511

Move SMART prometheus alert to own file

b353fd3

markgoddard suggested changes Nov 30, 2022

View reviewed changes

doc/source/configuration/monitoring.rst Outdated Show resolved Hide resolved

Fix typo

611f2fb

markgoddard approved these changes Nov 30, 2022

View reviewed changes

technowhizz removed the request for review from dougszumski November 30, 2022 15:21

dougszumski approved these changes Nov 30, 2022

View reviewed changes

technowhizz merged commit edad38b into stackhpc/yoga Dec 1, 2022

technowhizz deleted the feature/smartmon branch December 1, 2022 11:14

technowhizz self-assigned this Dec 1, 2022

technowhizz added a commit that referenced this pull request Dec 21, 2022

Update system.rules to add back Smart Disk status

5e52498

Original PR here: #251

technowhizz mentioned this pull request Dec 21, 2022

Update system.rules to add back Smart Disk status #310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SMART Monitoring with dash and alerts #251

Add SMART Monitoring with dash and alerts #251

Uh oh!

technowhizz commented Nov 25, 2022

Uh oh!

dougszumski left a comment

Uh oh!

Uh oh!

dougszumski left a comment

Uh oh!

technowhizz commented Nov 28, 2022

Uh oh!

technowhizz commented Nov 29, 2022

Uh oh!

markgoddard commented Nov 29, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markgoddard left a comment

Uh oh!

dougszumski left a comment

Uh oh!

Uh oh!

Add SMART Monitoring with dash and alerts #251

Add SMART Monitoring with dash and alerts #251

Uh oh!

Conversation

technowhizz commented Nov 25, 2022

Uh oh!

dougszumski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dougszumski left a comment

Choose a reason for hiding this comment

Uh oh!

technowhizz commented Nov 28, 2022

Uh oh!

technowhizz commented Nov 29, 2022

Uh oh!

markgoddard commented Nov 29, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markgoddard left a comment

Choose a reason for hiding this comment

Uh oh!

dougszumski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!