kind: alert version: v2 metadata: name: cpu-alert spec: # the alert name alert_name: CPUAlert # the rule group the alert belongs to group_name: test-group # the alert expression formula: | node:cluster_cpu_utilization:ratio * 100 > 80 # the alert labels labels: severity: info # the alert annotations annotations: description: | Cluster CPU usage exceeds 80%.
copy
Configure Alerts for Anypoint Platform PCE
Anypoint Platform Private Cloud Edition (Anypoint Platform PCE) provides built-in alerts that are triggered when a condition specified in any alert definition is detected.
Measurements are stored in Prometheus and read by Alertmanager
. Alertmanager sends emails when an alert is triggered.
Alert Definitions
Default alerts are listed in the following table:
Component | Alert | Description |
---|---|---|
CPU |
High CPU usage |
Triggers a warning when > 75% used; triggers a critical error when > 90% used |
Memory |
High memory usage |
Triggers a warning when > 80% used; triggers a critical error when > 90% used |
Systemd |
Overall systemd health |
Triggers an error when systemd detects a failed service |
Systemd |
Individual systemd unit health |
Triggers an error when a systemd unit is not loaded or active |
Filesystem |
High disk space usage |
Triggers a warning when > 80% used; triggers a critical error when > 90% used |
Filesystem |
High inode usage |
Triggers a warning when > 90% used; triggers a critical error when > 95% used |
System |
Uptime |
Triggers a warning when the uptime for a node is less than five minutes |
System |
Kernel parameters |
Triggers an error if a parameter is not set. See value matrix |
Etcd |
Etcd instance health |
Triggers an error when an etcd leader is down longer than five minutes |
Etcd |
Etcd latency check |
Triggers a warning when follower <→ leader latency exceeds 500 ms; triggers an error when it exceeds one second over a period of one minute |
Docker |
Docker daemon health |
Triggers an error when the Docker daemon is down |
Kubernetes |
Kubernetes node readiness |
Triggers an error when the node is not ready |
Configure Alert Definitions
You define new alerts using a gravity resource called alert
, as shown in the following example:
See the Alerting Rules documentation for more details about Prometheus alerts.
-
To create an alert, run:
gravity resource create alert.yaml
-
To view existing alerts, run:
gravity resource get alerts
-
To remove an alert, run:
gravity resource rm alert cpu-alert
Configure Alerts Delivery
To configure Alertmanager to send email alerts, create the following gravity resources:
-
Using the following spec, create a file named
smtp-config.yaml
inside gravity, and replace the placeholder values with those of your SMTP configuration:kind: smtp version: v2 metadata: name: smtp spec: host: <SMTP_HOST> port: <SMTP_PORT> username: <SMTP_USERNAME> password: <SMTP_PASSWORD> --- kind: alerttarget version: v2 metadata: name: email-alerts spec: # email address of the alerts recipient email: <RECIPIENT_EMAIL>
copy -
Run
gravity resource create smtp-config.yaml
. You should see the following output:Created cluster SMTP configuration Created monitoring alert target "email-alerts"
-
Add a default router to the Alertmanager configuration inside gravity:
kubectl get secret -n monitoring alertmanager-monitoring-kube-prometheus-alertmanager -o json | jq --arg foo "$(kubectl get secret -n monitoring alertmanager-monitoring-kube-prometheus-alertmanager -o json | jq -r '.data["alertmanager.yaml"]' | base64 -d | yq r - --tojson | jq -r '.route.routes[1] |= . + {"match":{"alertname": "Watchdog", "receiver": "default", "continue": true}}' | jq -r '.route.routes[0].match += {"continue":true}' | yq r - -P | base64 | tr -d '\n')" '.data["alertmanager.yaml"]=$foo' | kubectl apply -f -
-
Configure the FROM email address by replacing the <SMTP_FROM> value:
kubectl get secret -n monitoring alertmanager-monitoring-kube-prometheus-alertmanager -o json | jq --arg foo "$(kubectl get secret -n monitoring alertmanager-monitoring-kube-prometheus-alertmanager -o json | jq -r '.data["alertmanager.yaml"]' | base64 -d | yq w - 'global.smtp_from' <SMTP_FROM> | base64 | tr -d '\n')" '.data["alertmanager.yaml"]=$foo' | kubectl apply -f -
-
Restart Alertmanager pods:
kubectl delete pod -n monitoring -l app=alertmanager
-
Test Alertmanager by running the following command inside gravity:
curl -H 'Content-Type: application/json' -d '[{"labels":{"alertname":"test-alert","state":"firing"}}]' http://monitoring-kube-prometheus-alertmanager.monitoring.svc.cluster.local:9093/api/v1/alerts
Troubleshooting Alerts
Common troubleshooting tasks include the following:
-
Verify that your SMTP server can send and receive emails using the addresses you defined as the
FROM
andTO
addresses when you configured alerts delivery. -
Verify that your cluster nodes can communicate with your SMTP server.
For example, use
telnet
to connect to your SMTP server from one of your cluster nodes:telnet my.smtp.server.com 587 Trying XXX.XXX.XXX.XXX... Connected to my.smtp.server.com. Escape character is '^]'. 220 my.smtp.server.com ESMTP ^[^] telnet> quit Connection closed.