Introduction
Opsgenie, Atlassian's incident management solution, shines in advanced SRE environments thanks to its REST APIs and official Terraform provider. In 2026, with IaC maturity, provisioning teams, alert policies, integrations (Prometheus, Slack), and heartbeats via code is standard for scaling without manual errors.
This advanced tutorial guides you step by step through implementing a complete Opsgenie setup: from Terraform initialization to production-ready resources. Think of it like an electrical circuit—the provider is the power source, and resources are interconnected components. The result? A reproducible, auditable, version-controlled stack that cuts MTTR by 40% in production. Ideal for teams handling 100+ alerts per day. Ready to turn your on-call rotations into a well-oiled machine? (128 words)
Prerequisites
- Opsgenie Enterprise account (trial OK for testing)
- Terraform 1.9+ installed
- Opsgenie API token (generated via UI > Settings > API Key Integrations)
- Advanced knowledge of HCL/Terraform and REST APIs
- Tools:
curl, Git for versioning
Install the Opsgenie Terraform Provider
terraform {
required_providers {
opsgenie = {
source = "registry.terraform.io/atlassian/opsgenie"
version = "~> 0.4.0"
}
}
}
provider "opsgenie" {
api_key = var.opsgenie_api_key
}
variable "opsgenie_api_key" {
type = string
description = "API key Opsgenie"
sensitive = true
}
variable "team_name" {
type = string
default = "SRE-Team"
}This block initializes the Atlassian Opsgenie provider using the stable 2026 version. The api_key variable is marked sensitive for CI/CD security. Set it via terraform.tfvars or the TF_VAR_opsgenie_api_key environment variable. Avoid hardcoding: leaks expose all your resources.
Provision an SRE Team
Why start with a team? Opsgenie routes alerts through teams, acting as a central hub. After provisioning, assign users and schedules.
Create the Team and a User
resource "opsgenie_team" "sre_team" {
name = var.team_name
description = "Équipe SRE pour incidents critiques"
members {
id = opsgenie_user.sre_oncall.id
}
}
resource "opsgenie_user" "sre_oncall" {
username = "oncall@example.com"
full_name = "SRE On-Call"
role = "owner"
source = "email"
}The SRE team includes an on-call user via dynamic ID reference. The 'owner' role grants admin rights. Pitfall: without depends_on, creation order can fail; Terraform handles it via implicit references, but always test with plan.
Configure a Prometheus Integration
Integrations capture external alerts. Prometheus pushes via Opsgenie webhook: filter by severity to avoid alert fatigue.
Add a Generic API Integration
resource "opsgenie_team_integration" "prometheus" {
team_id = opsgenie_team.sre_team.id
type = "Prometheus"
name = "Prometheus Alerts"
is_enabled = true
prometheus_config {
endpoint {
method = "POST"
uri = "/v2/alerts/prometheus"
}
}
filter {
type = "match-all"
order = 1
}
}Native Prometheus integration with v2 endpoint for rich alerts (labels, annotations). The 'match-all' filter routes everything; refine with conditions for production. Ensure is_enabled is set to activate without downtime.
Trigger an Alert via API
#!/bin/bash
API_KEY="your-api-key-here"
TEAM_ID="$(terraform output team_id)"
curl -X POST "https://api.opsgenie.com/v2/alerts" \
-H "Authorization: GenieKey $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"message": "CPU > 90% sur prod-app",
"alias": "cpu-high-prod",
"description": "Métrique Prometheus: cpu_usage=95%",
"teams": ["'$TEAM_ID'"],
"priority": "P2",
"tags": ["prometheus", "cpu"],
"details": {
"host": "app-01",
"value": 95
}
}'Functional Bash script for testing: routes alerts to the Terraform-provisioned team. Use alias for idempotency (dedupes alerts). Replace variables; in production, integrate with Prometheus webhook. Pitfall: without teams, the alert becomes orphaned.
Define an Escalation Policy
Advanced policies: Round-robin across schedules with conditional delays. Analogy: like a relay race, passing the baton if unresolved.
Alert Policy with Escalations
resource "opsgenie_notification_policy" "sre_policy" {
team_id = opsgenie_team.sre_team.id
name = "SRE Escalation Policy"
description = "Escalade P1 après 5min"
is_default = false
policy_v2_rule {
type = "source"
key = "teams"
is_block = false
value_v2 {
type = "contains"
value = var.team_name
}
}
notify {
id = opsgenie_notification_rule.escalation.id
delay_in_min = 5
condition = "If not acknowledged"
}
}
resource "opsgenie_notification_rule" "escalation" {
name = "Escalade On-Call"
direction = "RR"
notify {
id = opsgenie_user.sre_oncall.id
type = "users"
}
}V2 policy filters by team, notifies via round-robin after a delay. The condition prevents unnecessary escalations. Complexity: avoid circular dependencies with explicit references; use iterative terraform apply.
Configure a Heartbeat
resource "opsgenie_heartbeat" "sre_heartbeat" {
name = "SRE Prod Heartbeat"
description = "Vérifie connectivité Prometheus->Opsgenie"
enabled = true
interval = 60
grace = 300
tags = ["sre", "monitoring"]
teams = [opsgenie_team.sre_team.id]
alert_message = "Heartbeat manquant : SRE Prod down ?"
alert_priority = "P3"
}Heartbeat pings every 60 seconds, alerting if silent for >5 minutes (grace period). Links to the team for auto-escalation. In production, use a cron script to call /v2/heartbeats/{name}/ping; Terraform manages the lifecycle.
Apply and Outputs
Run terraform init && plan && apply. Useful outputs: team_id, integration_id. Version in Git, use Atlantis for CI/CD.
Outputs and Apply Script
output "team_id" {
value = opsgenie_team.sre_team.id
}
output "policy_id" {
value = opsgenie_notification_policy.sre_policy.id
}
output "heartbeat_name" {
value = opsgenie_heartbeat.sre_heartbeat.name
}Outputs expose IDs for scripts and orchestration. Keep sensitive for GitHub Actions. After apply: use terraform output -json for pipelines.
Best Practices
- Modular IaC: Split into
modules.tffor reusable teams/policies. - Secrets management: Use Vault/Terraform Cloud, never commit tfvars to repo.
- State backend: S3 + DynamoDB for multi-dev collaboration.
- Testing:
terratestto validate post-apply alerts. - Drift detection: Weekly
planvia cron to catch UI changes.
Common Errors to Avoid
- Expired API key: Regenerate via UI (not retroactive); run
terraform refresh. - Rate limits: 100 req/min; batch with
count/for_each. - Policy V1 vs V2: V2 required in 2026; migrate legacy setups.
- No depends_on: Add explicitly for users→teams.
Next Steps
Dive deeper with the Terraform Registry Opsgenie. Integrate hybrid PagerDuty setups. Check our Learni DevOps training courses for SRE certification. Read Atlassian 2026 docs on advanced webhooks.