Introduction
Chaos Mesh is the leading open-source platform for chaos engineering on Kubernetes, backed by the CNCF. Unlike tools like Litmus that rely on scripts, Chaos Mesh provides a Kubernetes-native API with CRDs (Custom Resource Definitions) to model PodChaos, NetworkChaos, StressChaos, and more. In 2026, with Kubernetes 1.32+, it natively integrates eBPF for precise injections without downtime.
Why use it? In a world where 70% of cloud incidents stem from external dependencies (2025 SRE report), Chaos Mesh simulates outages, latencies, and overloads to validate resilience. Imagine your e-commerce API: a NetworkChaos with 500ms latency exposes a mishandled timeout. This advanced tutorial covers deployment, complex experiments, and workflows with fully functional YAML manifests. By the end, you'll bookmark this guide for your CI/CD pipelines. (128 words)
Prerequisites
- Kubernetes 1.30+ cluster (Minikube, Kind, or EKS with 4+ nodes for advanced tests)
- kubectl 1.30+ configured
- Helm 3.14+
- Dedicated namespace
chaos-testing - Test application: Simple Nginx Deployment (exposed via Service)
- Access to Chaos Mesh Dashboard via port-forward
Installing Chaos Mesh via Helm
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
kubectl create namespace chaos-testing
helm install chaos-mesh chaos-mesh/chaos-mesh \
--namespace=chaos-testing \
--set chaosDaemon.runtime=containerd \
--set chaosDaemon.containerRuntime.socketPath=/run/containerd/containerd.sock \
--set dashboard.create=true \
--set enableCRD=trueThis script adds the official Helm repo, creates the namespace, and deploys Chaos Mesh v2.6+ with ChaosDaemon for containerd (standard in 2026). The dashboard is enabled for the web UI. Pitfall: Check containerRuntime.socketPath based on your runtime (CRI-O: /var/run/crio/crio.sock) to avoid failed PodChaos.
Verification and Dashboard Access
Run the script above, then kubectl get pods -n chaos-testing to confirm components: chaos-controller-manager, chaos-daemon, and chaos-dashboard. Port-forward the dashboard: kubectl port-forward -n chaos-testing svc/chaos-dashboard 2333:2333. Access it at http://localhost:2333 with username/password: chaos-mesh/123456. The UI lists CRDs and lets you visualize experiments in real-time, with logs and integrated Prometheus metrics.
Deploying a Test App (Nginx)
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-app
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.27-alpine
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: nginx-svc
namespace: default
spec:
selector:
app: nginx
ports:
- port: 80
targetPort: 80
type: ClusterIPThis manifest deploys 3 Nginx pods to simulate a scalable app. Apply with kubectl apply -f nginx-app.yaml. Check with kubectl get pods -l app=nginx. It's the target for all chaos experiments: select via labelSelector app=nginx.
PodChaos: Simulating Pod Crashes
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
namespace: default
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: nginx
scheduler:
cron: "@every 30s"
duration: "10s"
gracePeriod: 0
replaceWith: "nginx:1.27-alpine"
direction: forwardThis PodChaos kills one Nginx pod every 30s for 10s, with replaceWith for automatic replacement without downtime. Mode 'one' targets a single pod. Monitor via dashboard: watch the flapping and HPA recovery if configured. Pitfall: gracePeriod=0 for hard kill; otherwise, slow SIGTERM.
Monitoring PodChaos
Test it: kubectl apply -f pod-chaos.yaml, then kubectl get events -l app=nginx. In the dashboard, click the experiment: charts show kill/recovery. Scale to at least 3 pods to absorb it. Analogy: like a storm felling a tree in a forest, testing redundancy.
NetworkChaos: Latency and Packet Loss
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-latency-loss
namespace: default
spec:
action: delay
mode: one
selector:
namespaces:
- default
labelSelectors:
app: nginx
delay:
latency: "500ms"
jitter: "100ms"
correlation: "80%"
direction: to
target:
mode: all
scheduler:
cron: "@every 1m"
duration: "30s"
---
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-loss-example
namespace: default
spec:
action: partition
mode: one
selector:
labelSelectors:
app: nginx
partition:
exceptions:
- 172.17.0.0/16
direction: both
scheduler:
cron: "@every 2m"
duration: "20s"Two NetworkChaos experiments: delay simulates 500ms latency + jitter (80% correlated) to Nginx, partition isolates one pod (IP exceptions to bypass). eBPF underneath for zero-copy. Dashboard shows chaotic network topology. Pitfall: direction 'to' targets outbound from pods.
StressChaos: CPU/Memory Overload
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: stress-cpu-mem
namespace: default
spec:
mode: all
selector:
labelSelectors:
app: nginx
stressors:
- containerName: nginx
cpu:
workers: 2
load: 80
memory:
workers: 1
size: "512Mi"
scheduler:
cron: "@every 45s"
duration: "60s"
force: trueStressChaos injects 2 CPU workers at 80% load + 512Mi memory allocation on all Nginx pods. Uses native stress-ng. force=true triggers OOMKill if pod limits exceeded. Monitor with kubectl top pods. Pitfall: Align workers with vCPU cores for realism.
IOChaos: Disk Chaos with falldisk
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: io-latency-fill
namespace: default
spec:
action: latency
mode: one
selector:
namespaces:
- default
labelSelectors:
app: nginx
volumePaths:
- /var/cache/nginx
latency: "2s"
scheduler:
cron: "@every 90s"
duration: "45s"
force: true
---
apiVersion: chaos-mesh.org/v1alpha1
kind: IOChaos
metadata:
name: disk-fill
namespace: default
spec:
action: fill
mode: all
selector:
labelSelectors:
app: nginx
fillPercent: 90
path: /var/cache/nginx
scheduler:
cron: "@every 3m"
duration: "30s"IOChaos latency adds 2s delay on /var/cache/nginx (one pod), fill occupies 90% disk on all. falldisk backend for persistence. Ideal for StatefulSet DBs. Dashboard tracks IOPS drops. Pitfall: volumePaths must match pod mounts, or it skips.
Advanced Chaos Workflows
To chain experiments: Create a ChaosWorkflow combining PodChaos then NetworkChaos. Apply a YAML workflow for conditional sequences (onSuccess/onFailure). Advanced example: Limit blast radius with namespaceSelector. Integrate with ArgoCD for CI chaos.
Sequential ChaosWorkflow Example
apiVersion: chaos-mesh.org/v1alpha1
kind: Workflow
metadata:
name: sequential-chaos
namespace: default
spec:
deadline: 600
backoffPolicy:
maxRetries: 3
tasks:
- name: pod-kill-task
type: Task
spec:
embeddedChaos:
spec:
action: pod-kill
mode: one
selector:
labelSelectors:
app: nginx
duration: "10s"
- name: network-task
type: Task
dependencies:
- pod-kill-task
spec:
embeddedChaos:
spec:
action: delay
mode: one
selector:
labelSelectors:
app: nginx
delay:
latency: "1s"
duration: "20s"This Workflow kills a pod, waits for recovery, then injects latency. Sequential dependencies, backoff on retry. 10min deadline. Launch and track in dashboard 'Workflows'. Pitfall: embeddedChaos must match exact CRD specs.
Best Practices
- Controlled blast radius: Always use precise namespaceSelector and labelSelectors; test in staging with AdmissionWebhooks for auto-approval.
- Advanced scheduling: Prefer cron + scheduler.range for realistic bursts (e.g., peak hour spikes).
- Observability: Integrate Prometheus/Grafana via Chaos Metrics; export to Steadybit for SLOs.
- Safe rollback: Set ChaosDaemon tolerations and high priorityClass for the controller.
- CI/CD: GitOps with Flux/ArgoCD applying conditional YAMLs on Git tags.
Common Errors to Avoid
- DaemonSet not ready: Forgot runtime socketPath → ChaosDaemon pending; debug with
kubectl logs -n chaos-testing daemonset/chaos-daemon. - Selector mismatch: No matching pods → experiment 'Running' but no-op; validate
kubectl get pods -l app=nginx. - Force=false without grace: Pods crashloop after kill; set force=true for SIGKILL.
- Workflow deadline too short: Tasks timeout → infinite backoff; scale to 30min+ for complex chains.
Next Steps
- Official docs: Chaos Mesh GitHub
- Advanced eBPF: Chaos Mesh 2.6+ Features
- Litmus integration: Migration Guide
- Expert training: Check our Kubernetes and Chaos Engineering courses at Learni for CNCF CKA+Chaos certifications.