Skip to content
Learni
View all tutorials
Kubernetes

How to Implement Chaos Mesh in Kubernetes in 2026

Lire en français

Introduction

Chaos Mesh is the leading open-source platform for chaos engineering on Kubernetes, backed by the CNCF. Unlike tools like Litmus that rely on scripts, Chaos Mesh provides a Kubernetes-native API with CRDs (Custom Resource Definitions) to model PodChaos, NetworkChaos, StressChaos, and more. In 2026, with Kubernetes 1.32+, it natively integrates eBPF for precise injections without downtime.

Why use it? In a world where 70% of cloud incidents stem from external dependencies (2025 SRE report), Chaos Mesh simulates outages, latencies, and overloads to validate resilience. Imagine your e-commerce API: a NetworkChaos with 500ms latency exposes a mishandled timeout. This advanced tutorial covers deployment, complex experiments, and workflows with fully functional YAML manifests. By the end, you'll bookmark this guide for your CI/CD pipelines. (128 words)

Prerequisites

  • Kubernetes 1.30+ cluster (Minikube, Kind, or EKS with 4+ nodes for advanced tests)
  • kubectl 1.30+ configured
  • Helm 3.14+
  • Dedicated namespace chaos-testing
  • Test application: Simple Nginx Deployment (exposed via Service)
  • Access to Chaos Mesh Dashboard via port-forward

Installing Chaos Mesh via Helm

install-chaos-mesh.sh
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
kubectl create namespace chaos-testing
helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace=chaos-testing \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.containerRuntime.socketPath=/run/containerd/containerd.sock \
  --set dashboard.create=true \
  --set enableCRD=true

This script adds the official Helm repo, creates the namespace, and deploys Chaos Mesh v2.6+ with ChaosDaemon for containerd (standard in 2026). The dashboard is enabled for the web UI. Pitfall: Check containerRuntime.socketPath based on your runtime (CRI-O: /var/run/crio/crio.sock) to avoid failed PodChaos.

Verification and Dashboard Access

Run the script above, then kubectl get pods -n chaos-testing to confirm components: chaos-controller-manager, chaos-daemon, and chaos-dashboard. Port-forward the dashboard: kubectl port-forward -n chaos-testing svc/chaos-dashboard 2333:2333. Access it at http://localhost:2333 with username/password: chaos-mesh/123456. The UI lists CRDs and lets you visualize experiments in real-time, with logs and integrated Prometheus metrics.

Deploying a Test App (Nginx)

nginx-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.27-alpine
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-svc
  namespace: default
spec:
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP

This manifest deploys 3 Nginx pods to simulate a scalable app. Apply with kubectl apply -f nginx-app.yaml. Check with kubectl get pods -l app=nginx. It's the target for all chaos experiments: select via labelSelector app=nginx.

PodChaos: Simulating Pod Crashes

pod-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
  namespace: default
spec:
  action: pod-kill
  mode: one
  selector:
    labelSelectors:
      app: nginx
  scheduler:
    cron: "@every 30s"
  duration: "10s"
  gracePeriod: 0
  replaceWith: "nginx:1.27-alpine"
  direction: forward

This PodChaos kills one Nginx pod every 30s for 10s, with replaceWith for automatic replacement without downtime. Mode 'one' targets a single pod. Monitor via dashboard: watch the flapping and HPA recovery if configured. Pitfall: gracePeriod=0 for hard kill; otherwise, slow SIGTERM.

Monitoring PodChaos

Test it: kubectl apply -f pod-chaos.yaml, then kubectl get events -l app=nginx. In the dashboard, click the experiment: charts show kill/recovery. Scale to at least 3 pods to absorb it. Analogy: like a storm felling a tree in a forest, testing redundancy.

NetworkChaos: Latency and Packet Loss

network-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-latency-loss
  namespace: default
spec:
  action: delay
  mode: one
  selector:
    namespaces:
    - default
    labelSelectors:
      app: nginx
  delay:
    latency: "500ms"
    jitter: "100ms"
    correlation: "80%"
  direction: to
  target:
    mode: all
  scheduler:
    cron: "@every 1m"
  duration: "30s"
---
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-loss-example
  namespace: default
spec:
  action: partition
  mode: one
  selector:
    labelSelectors:
      app: nginx
  partition:
    exceptions:
      - 172.17.0.0/16
  direction: both
  scheduler:
    cron: "@every 2m"
  duration: "20s"

Two NetworkChaos experiments: delay simulates 500ms latency + jitter (80% correlated) to Nginx, partition isolates one pod (IP exceptions to bypass). eBPF underneath for zero-copy. Dashboard shows chaotic network topology. Pitfall: direction 'to' targets outbound from pods.

StressChaos: CPU/Memory Overload

stress-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: stress-cpu-mem
  namespace: default
spec:
  mode: all
  selector:
    labelSelectors:
      app: nginx
  stressors:
  - containerName: nginx
    cpu:
      workers: 2
      load: 80
    memory:
      workers: 1
      size: "512Mi"
  scheduler:
    cron: "@every 45s"
  duration: "60s"
  force: true

StressChaos injects 2 CPU workers at 80% load + 512Mi memory allocation on all Nginx pods. Uses native stress-ng. force=true triggers OOMKill if pod limits exceeded. Monitor with kubectl top pods. Pitfall: Align workers with vCPU cores for realism.

IOChaos: Disk Chaos with falldisk

io-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: io-latency-fill
  namespace: default
spec:
  action: latency
  mode: one
  selector:
    namespaces:
    - default
    labelSelectors:
      app: nginx
  volumePaths:
  - /var/cache/nginx
  latency: "2s"
  scheduler:
    cron: "@every 90s"
  duration: "45s"
  force: true
---
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
  name: disk-fill
  namespace: default
spec:
  action: fill
  mode: all
  selector:
    labelSelectors:
      app: nginx
  fillPercent: 90
  path: /var/cache/nginx
  scheduler:
    cron: "@every 3m"
  duration: "30s"

IOChaos latency adds 2s delay on /var/cache/nginx (one pod), fill occupies 90% disk on all. falldisk backend for persistence. Ideal for StatefulSet DBs. Dashboard tracks IOPS drops. Pitfall: volumePaths must match pod mounts, or it skips.

Advanced Chaos Workflows

To chain experiments: Create a ChaosWorkflow combining PodChaos then NetworkChaos. Apply a YAML workflow for conditional sequences (onSuccess/onFailure). Advanced example: Limit blast radius with namespaceSelector. Integrate with ArgoCD for CI chaos.

Sequential ChaosWorkflow Example

chaos-workflow.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
  name: sequential-chaos
  namespace: default
spec:
  deadline: 600
  backoffPolicy:
    maxRetries: 3
  tasks:
  - name: pod-kill-task
    type: Task
    spec:
      embeddedChaos:
        spec:
          action: pod-kill
          mode: one
          selector:
            labelSelectors:
              app: nginx
          duration: "10s"
  - name: network-task
    type: Task
    dependencies:
    - pod-kill-task
    spec:
      embeddedChaos:
        spec:
          action: delay
          mode: one
          selector:
            labelSelectors:
              app: nginx
          delay:
            latency: "1s"
          duration: "20s"

This Workflow kills a pod, waits for recovery, then injects latency. Sequential dependencies, backoff on retry. 10min deadline. Launch and track in dashboard 'Workflows'. Pitfall: embeddedChaos must match exact CRD specs.

Best Practices

  • Controlled blast radius: Always use precise namespaceSelector and labelSelectors; test in staging with AdmissionWebhooks for auto-approval.
  • Advanced scheduling: Prefer cron + scheduler.range for realistic bursts (e.g., peak hour spikes).
  • Observability: Integrate Prometheus/Grafana via Chaos Metrics; export to Steadybit for SLOs.
  • Safe rollback: Set ChaosDaemon tolerations and high priorityClass for the controller.
  • CI/CD: GitOps with Flux/ArgoCD applying conditional YAMLs on Git tags.

Common Errors to Avoid

  • DaemonSet not ready: Forgot runtime socketPath → ChaosDaemon pending; debug with kubectl logs -n chaos-testing daemonset/chaos-daemon.
  • Selector mismatch: No matching pods → experiment 'Running' but no-op; validate kubectl get pods -l app=nginx.
  • Force=false without grace: Pods crashloop after kill; set force=true for SIGKILL.
  • Workflow deadline too short: Tasks timeout → infinite backoff; scale to 30min+ for complex chains.

Next Steps