kind: alert
version: v2
metadata:
name: cpu-alert
spec:
# the alert name
alert_name: CPUAlert
# the rule group the alert belongs to
group_name: test-group
# the alert expression
formula: |
node:cluster_cpu_utilization:ratio * 100 > 80
# the alert labels
labels:
severity: info
# the alert annotations
annotations:
description: |
Cluster CPU usage exceeds 80%.
Configure Alerts for Anypoint Platform PCE
Where possible, we changed noninclusive terms to align with our company value of Equality. We maintained certain terms to avoid any effect on customer implementations. |
Anypoint Platform Private Cloud Edition (Anypoint Platform PCE) provides built-in alerts that are triggered when a condition specified in any alert definition is detected.
Measurements are stored in Prometheus and read by Alertmanager. Alertmanager sends emails when an alert is triggered.
Alert Definitions
Default alerts are listed in the following table:
Component | Alert | Description |
---|---|---|
CPU |
High CPU usage |
Triggers a warning when > 75% used; triggers a critical error when > 90% used |
Memory |
High memory usage |
Triggers a warning when > 80% used; triggers a critical error when > 90% used |
Systemd |
Overall systemd health |
Triggers an error when systemd detects a failed service |
Systemd |
Individual systemd unit health |
Triggers an error when a systemd unit is not loaded or active |
Filesystem |
High disk space usage |
Triggers a warning when > 80% used; triggers a critical error when > 90% used |
Filesystem |
High inode usage |
Triggers a warning when > 90% used; triggers a critical error when > 95% used |
System |
Uptime |
Triggers a warning when the uptime for a node is less than five minutes |
System |
Kernel parameters |
Triggers an error if a parameter is not set. See value matrix for details. |
Etcd |
Etcd instance health |
Triggers an error when an etcd master is down longer than five minutes |
Etcd |
Etcd latency check |
Triggers a warning when follower <→ leader latency exceeds 500 ms; triggers an error when it exceeds one second over a period of one minute |
Docker |
Docker daemon health |
Triggers an error when the Docker daemon is down |
Kubernetes |
Kubernetes node readiness |
Triggers an error when the node is not ready |
Configure Alert Definitions
You define new alerts using a gravity resource called alert
, as shown in the following example:
See the Alerting Rules documentation for more details about Prometheus alerts.
-
To create an alert, run:
gravity resource create alert.yaml
-
To view existing alerts, run:
gravity resource get alerts
-
To remove an alert, run:
gravity resource rm alert cpu-alert
Configure Alerts Delivery
To configure Alertmanager to send email alerts, create the following gravity resources:
-
Using the following spec, create a file named
smtp-config.yaml
inside gravity, and replace the placeholder values with those of your SMTP configuration:kind: smtp version: v2 metadata: name: smtp spec: host: <SMTP_HOST> port: <SMTP_PORT> username: <SMTP_USERNAME> password: <SMTP_PASSWORD> --- kind: alerttarget version: v2 metadata: name: email-alerts spec: # email address of the alerts recipient email: <RECIPIENT_EMAIL>
-
Run
gravity resource create smtp-config.yaml
. You should see the following output:Created cluster SMTP configuration Created monitoring alert target "email-alerts"
-
Add a default router to the Alertmanager configuration inside gravity:
kubectl get secret -n monitoring alertmanager-monitoring-kube-prometheus-alertmanager -o json | jq --arg foo "$(kubectl get secret -n monitoring alertmanager-monitoring-kube-prometheus-alertmanager -o json | jq -r '.data["alertmanager.yaml"]' | base64 -d | yq r - --tojson | jq -r '.route.routes[1] |= . + {"match":{"alertname": "Watchdog", "receiver": "default", "continue": true}}' | jq -r '.route.routes[0].match += {"continue":true}' | yq r - -P | base64 | tr -d '\n')" '.data["alertmanager.yaml"]=$foo' | kubectl apply -f -
-
Configure the FROM email address by replacing the <SMTP_FROM> value:
kubectl get secret -n monitoring alertmanager-monitoring-kube-prometheus-alertmanager -o json | jq --arg foo "$(kubectl get secret -n monitoring alertmanager-monitoring-kube-prometheus-alertmanager -o json | jq -r '.data["alertmanager.yaml"]' | base64 -d | yq w - 'global.smtp_from' <SMTP_FROM> | base64 | tr -d '\n')" '.data["alertmanager.yaml"]=$foo' | kubectl apply -f -
-
Restart Alertmanager pods:
kubectl delete pod -n monitoring -l app=alertmanager
-
Test Alertmanager by running the following command inside gravity:
curl -H 'Content-Type: application/json' -d '[{"labels":{"alertname":"test-alert","state":"firing"}}]' http://monitoring-kube-prometheus-alertmanager.monitoring.svc.cluster.local:9093/api/v1/alerts
Troubleshooting Alerts
Common troubleshooting tasks include the following:
-
Verify that your SMTP server can send and receive emails using the addresses you defined as the
FROM
andTO
addresses when you configured alerts delivery. -
Verify that your cluster nodes can communicate with your SMTP server.
For example, use
telnet
to connect to your SMTP server from one of your cluster nodes:telnet my.smtp.server.com 587 Trying XXX.XXX.XXX.XXX... Connected to my.smtp.server.com. Escape character is '^]'. 220 my.smtp.server.com ESMTP ^[^] telnet> quit Connection closed.