Enterprise monitoring is a living breathing entity that requires constant evaluation and adjustments to produce the most reliable accurate results. Once defined it is not often reviewed and often edge case monitoring is added as new systems evolve. This often adds to false alerts, noise, and loss of trust in monitoring systems
Alert or Notify?
An alert should be triggered only if it warrants definite action and has defined action to work on. When designing monitoring, people always get carried away with the number of metrics available to monitor and ofter end up generating a lot of noise. Yes good to have monitoring data to help with troubleshooting, but that does not mean alert!
When should you alert?
- Service impact detected and human intervention required.
- Breaching a threshold that is likely to cause a service outage.
- Critical failures that have defined actions to rectify the issue.
- Failures that have the potential to cause outage like host down.
What to monitor
Quote from SRE Book
"The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four."
Not all metrics require monitoring and even fewer require alerting.
There are often 2 areas to consider when it comes to defining monitoring.
- Events that can cause service disruption and requires manual intervention (On-Call)
- Events/thresholds recording to help fault finding and long term analysis.
The Four golden signals are especially suitable for defining monitoring for events that generate alerts. For example, a high CPU for a short period does not always warrant Alerting unless there is an impact on Latency/errors and/or reaching resource saturation.
How many alerts?
Number of alerts for same incident should be minimal. For example when an underlying host fails, all metrics on that servers will fail too. There must be single alert for the host and rest of the checks should have dependency on it to have quality alerting.
Also major noise maker is repeat alert. Once an alert been acknowledged or being actioned, further repeat alerts only disrupts investigation and increases recovery time as Engineer will have to deal/change thought process to attend alert.