Skip to content
Learni
View all tutorials
DevOps

How to Conduct Blameless Postmortems in 2026

Lire en français

Introduction

Blameless postmortems are a key practice in SRE (Site Reliability Engineering) and DevOps, popularized by giants like Google. Unlike traditional reviews that point fingers at human errors, they focus on systemic issues to turn every incident into a collective lesson.

Why adopt this approach in 2026? With the growing complexity of microservices, hybrid clouds, and generative AI, incidents are inevitable. But blaming stifles innovation: 70% of engineers stay silent out of fear (source: Google SRE Book). A blameless postmortem encourages transparency, cuts recurrences by 50%, and builds resilience.

This beginner tutorial guides you through a complete process: from structured templates to automation scripts. By the end, you'll produce actionable reports ready for your GitHub or Notion workflow. Estimated time: 30 minutes for your first postmortem.

Prerequisites

  • Basic knowledge of DevOps or SRE (not required).
  • Tools: Git, a Markdown editor (VS Code), Node.js 18+ for optional scripts.
  • Access to a recent incident (or use our simulated example).
  • Team of 3-5 people for the meeting (optional for solo testing).

Create the Structured YAML Template

postmortem-template.yaml
title: "[INCIDENT] Titre du postmortem"
lead: "Auteur principal"
date: "YYYY-MM-DD"
severity: "SEV1" # SEV1 à SEV4
timeline:
  - time: "YYYY-MM-DD HH:MM"
    event: "Début de l'incident"
    who: "Équipe/ Système"
  - time: "YYYY-MM-DD HH:MM"
    event: "Détection"
    who: "Outil/ Personne"
impact:
  users_affected: 10000
  duration: "2h30"
  slo_breached: "Disponibilité < 99.9%"
root_cause: "Description sans blâme"
lessons_learned:
  - "Ce qui a bien fonctionné"
  - "Améliorations systémiques"
actions:
  - task: "Implémenter alerte X"
    owner: "@user"
    due: "YYYY-MM-DD"
    status: "TODO"
blameless_notes: "Focus sur processus, pas personnes"

This YAML template defines the standard structure for a blameless postmortem, inspired by the Google SRE Workbook. It separates timeline, impact, and actions for objectivity. Use it as a skeleton: always ensure no blaming terms like 'human error'.

Step 1: Understand and Adapt the Template

Start by copying this YAML into a file. It enforces a chronological view (timeline) to avoid retrospective biases, like a police investigation: facts first, then root causes.

Adapt it to your stack: add fields like affected_services: ['api-v1', 'db-cluster']. Analogy: It's like a medical form—structured to diagnose without judging the patient.

Generate Markdown from YAML

generate-md.sh
#!/bin/bash

YAML_FILE=$1
MD_FILE="${YAML_FILE%.yaml}.md"

cat > "$MD_FILE" << 'EOF'
# %title%

**Lead:** %lead%  | **Date:** %date%  | **Severity:** %severity%

## Timeline
EOF

yq eval '.title' "$YAML_FILE" >> "$MD_FILE"
yq eval -o=plain '.lead' "$YAML_FILE" | sed 's/^/\*\*/; s/$/\n**Date:**/' >> "$MD_FILE"

yq eval '.timeline[] | "- **%time%**: %event% (%who%)"' "$YAML_FILE" >> "$MD_FILE"

cat >> "$MD_FILE" << 'EOF'

## Impact
%impact%

## Root Cause
%root_cause%

## Lessons Learned
%lessons_learned%

## Actions
%actions%

## Blameless Notes
%blameless_notes%
EOF

yq eval '.impact | to_entries | .[] | "- " + .key + ": " + .value' "$YAML_FILE" >> "$MD_FILE"
echo "Markdown généré : $MD_FILE"

This Bash script uses yq (install via brew install yq) to convert YAML to readable Markdown. It automates report generation, avoiding manual copy-paste. Run ./generate-md.sh postmortem-template.yaml to test.

Step 2: Fill with a Sample Incident

Let's simulate an incident: API down due to a failed deployment. Fill the YAML without blaming ('bad config' → 'missing auto-validation').

Golden rule: every event is factual (who/what/when), not opinionated. Gather the team for a max 60-minute brainstorm.

Filled YAML for API Incident

api-incident.yaml
title: "[SEV2] API v1 indisponible 2h"
lead: "Alice Dupont"
date: "2026-01-15"
severity: "SEV2"
timeline:
  - time: "2026-01-15 14:00"
    event: "Déploiement v1.2.3 sur cluster prod"
    who: "CI/CD Pipeline"
  - time: "2026-01-15 14:05"
    event: "Taux d'erreur 404 à 50%"
    who: "Prometheus Alert"
  - time: "2026-01-15 14:10"
    event: "Rollback manuel initié"
    who: "On-call Engineer"
  - time: "2026-01-15 16:30"
    event: "Service restauré"
    who: "Pipeline"
impact:
  users_affected: 5000
  duration: "2h30"
  slo_breached: "99.5% → 98.2%"
root_cause: "Migration DB sans validation schema en staging"
lessons_learned:
  - "Alertes précoces via Prometheus efficaces"
  - "Ajouter test schema pre-deploy"
actions:
  - task: "Implémenter schema validation en CI"
    owner: "@bob"
    due: "2026-02-01"
    status: "TODO"
  - task: "Mettre à jour playbook rollback"
    owner: "@alice"
    due: "2026-01-20"
    status: "IN PROGRESS"
blameless_notes: "Pas de faute individuelle : faille dans processus de validation"

Concrete API incident example: note the systemic focus ('pipeline' instead of 'faulty dev'). This YAML is complete—copy it and generate MD with the previous script to visualize.

Python Script to Validate Blameless

validate-blameless.py
import yaml
import sys
import re

def load_yaml(file_path):
    with open(file_path, 'r') as f:
        return yaml.safe_load(f)

def is_blameless(data):
    blame_words = ['erreur', 'faute', 'oublie', 'ne pas', 'devrait']
    text = str(data).lower()
    for word in blame_words:
        if re.search(rf'\b{word}\b', text):
            return False, f'Mot blâmant détecté: {word}'
    return True, 'OK'

if __name__ == '__main__':
    if len(sys.argv) != 2:
        print('Usage: python validate-blameless.py file.yaml')
        sys.exit(1)
    data = load_yaml(sys.argv[1])
    ok, msg = is_blameless(data)
    print(f'Validation blameless: {msg}')
    if not ok:
        sys.exit(1)

This Python script scans YAML for blaming words, enforcing the blameless rule. Install with pip install pyyaml. Run python validate-blameless.py api-incident.yaml—it blocks biased reports before sharing.

Step 3: Automate and Share

Integrate with GitHub: Create a postmortems repo with ISSUE_TEMPLATE. Use the scripts in Actions for auto-validation.

Meeting: 5 min per timeline point, vote on actions via Mentimeter. Publish anonymized on wiki.

GitHub Actions Config for Validation

.github/workflows/validate-postmortem.yml
name: Validate Blameless Postmortem

on:
  pull_request:
    paths: ['**/*.yaml']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - name: Install deps
        run: pip install pyyaml
      - name: Validate YAML
        run: python validate-blameless.py postmortem-file.yaml
      - name: Generate MD
        run: bash generate-md.sh postmortem-file.yaml

This GitHub workflow validates and generates MD on every PR. Copy to .github/workflows/: it blocks non-blameless merges and integrates automation into CI/CD.

Final Generated Markdown Template

api-incident.md
# [SEV2] API v1 indisponible 2h

**Lead:** Alice Dupont  | **Date:** 2026-01-15  | **Severity:** SEV2

## Timeline
- **2026-01-15 14:00**: Déploiement v1.2.3 sur cluster prod (CI/CD Pipeline)
- **2026-01-15 14:05**: Taux d'erreur 404 à 50% (Prometheus Alert)
- **2026-01-15 14:10**: Rollback manuel initié (On-call Engineer)
- **2026-01-15 16:30**: Service restauré (Pipeline)

## Impact
- users_affected: 5000
- duration: 2h30
- slo_breached: 99.5% → 98.2%

## Root Cause
Migration DB sans validation schema en staging

## Lessons Learned
- Alertes précoces via Prometheus efficaces
- Ajouter test schema pre-deploy

## Actions
- task: Implémenter schema validation en CI (owner: @bob, due: 2026-02-01, status: TODO)
- task: Mettre à jour playbook rollback (owner: @alice, due: 2026-01-20, status: IN PROGRESS)

## Blameless Notes
Pas de faute individuelle : faille dans processus de validation

Automatically generated Markdown: ready for GitHub Wiki or Confluence. It's complete, visual, and emphasizes trackable actions.

Best Practices

  • Always time the meeting: Max 60 min to stay focused.
  • Anonymize when sharing: Remove names to encourage honesty.
  • Track actions in Jira/Tickets: Link back to YAML.
  • Review SLOs: Include metrics to quantify impact.
  • Culture: Share success stories to build adoption.

Common Mistakes to Avoid

  • Implicit blaming: 'The dev forgot' → caught by script.
  • No timeline: Leads to biased narratives; enforce chronology.
  • Forgetting actions: 80% of postmortems fail without owners/dates.
  • Meetings too long: >60 min dilutes energy; strict agenda.

Next Steps

Read the Google SRE Workbook for advanced cases. Integrate with PagerDuty or Opsgenie via webhooks.

Check out our Learni DevOps training: certified SRE, hands-on postmortem workshops. Join the Discord community for custom templates.