Maximizing Multi-Cluster Setups: Prometheus Monitoring Tips

This tip is for those who are using Prometheus federation to monitor multiple clusters.
How should alertmanager be configured for multiple clusters? Let us say that if there's an issue for Cluster A it only needs to send an alert for cluster A?

    alerting_rules.yml:
      groups:
        - name: Instances
          rules:
            - alert: TEST ALERT FROM PROMETHEUS PLEASE ACKNOWLEDGE
              expr: prometheus_build_info{instance="localhost:9090"} == 1
              for: 10s
              labels:
                severity: page
              annotations:
                description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
                summary: 'Instance {{ $labels.instance }} down'
                action: TESTING PLEASE ACKNOWLEDGE, NO FURTHER ACTION REQUIRED ONLY A TEST

![]({{ 'assets/images/prometheus-alert-1.png' | relative_url }})

📑

What is incident analysis?Discover more about its techniques.

In such cases, every alert should be routed to proper team based on labels (if there is problem with application A on cluster B - team responsible should be notified). In the above case, two alerts are triggered by the same rule. You'll have to deduplicate them. Now, if you don't wish to be alerted on each trigger of very smiliar alertsyou can treat them as a group.

If you know some app in node A have disk issues, and all other apps on that node have the same issue (the same cause) you might not want to recieve 10 alerts, but you'd rather just want to be informed of one if the conditions are met(like they were triggered by similar rules/in similar place and withing given time interval).

Do read up on the AlertManager docs for more infomation on alert grouping.

Looking for an end-to-end incident alerting, on-call scheduling and response orchestration platform?

Sign up for a 14-day free trial of Zenduty. No CC required. Implement modern incident response and SRE best practices within your production operations and provide industry-leading SLAs to your customers

[Kubernetes tip] Multi-Cluster Configurations with Prometheus

Looking for an end-to-end incident alerting, on-call scheduling and response orchestration platform?

Tags

Deepak Kumar

Recommended for you

Major Incident Management with Zenduty, Grafana, Slack and Zendesk

Master Kubernetes Monitoring with these Must-Track Metrics

Non-Abstract Large System Design (NALSD): The Ultimate Guide