Introduction
Soda is a powerful open source tool for validating data quality through declarative checks. In 2026, data pipelines require robust automated controls to detect anomalies before they impact dashboards or ML models. This tutorial guides you step by step through an advanced setup including dynamic variables, multi-source checks, and structured reporting.
Prerequisites
- Python 3.10+
- Soda CLI installed (
pip install soda-core) - Access to a database (Snowflake, BigQuery or PostgreSQL)
- Solid knowledge of YAML and SQL
- Account with Soda Cloud token (optional but recommended)
Installation and Initial Setup
python -m pip install -U soda-core-postgres
soda --version
mkdir soda_project && cd soda_project
soda initThis command installs the PostgreSQL connector and initializes the project folder with a default configuration.yml file. Always verify the version to ensure compatibility with new 2026 features.
Create the Main Configuration File
data_source my_postgres:
type: postgres
host: ${PG_HOST}
port: "5432"
username: ${PG_USER}
password: ${PG_PASSWORD}
database: analytics
schema: publicUse environment variables to avoid hardcoding credentials. This file centralizes all connections and will be referenced in every scan.
Define Advanced Checks with Variables
checks for orders:
- row_count > 1000:
name: Volume minimum de commandes
- missing_count(email) = 0
- invalid_count(order_date) = 0:
valid format: date
- freshness(order_date) < 24h
- values in status must be ['pending', 'shipped', 'delivered']
- avg(revenue) between 50 and 5000This YAML file contains sophisticated checks combining volume, freshness, formats, and allowed values. Variables enable reusing the same file across multiple environments.
Run a Scan Using Environment Variables
export PG_HOST=prod.db.internal
export PG_USER=analytics_ro
export PG_PASSWORD=$(aws ssm get-parameter --name /prod/db/password --with-decryption --query Parameter.Value --output text)
soda scan -c configuration.yml -v checks.yml my_postgresRunning via environment variables and secure secret retrieval (AWS SSM) is the recommended approach in production to prevent any credential leaks.
CI/CD Integration with GitHub Actions
name: Data Quality Checks
on: [push]
jobs:
soda:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Soda
env:
PG_HOST: ${{ secrets.PG_HOST }}
PG_USER: ${{ secrets.PG_USER }}
PG_PASSWORD: ${{ secrets.PG_PASSWORD }}
run: |
pip install soda-core-postgres
soda scan -c configuration.yml checks.yml my_postgresCI/CD integration allows blocking merges if checks fail. GitHub secrets protect database credentials.
Best Practices
- Always version your checks.yml files in Git
- Use explicit names for each check to simplify debugging
- Configure Slack or email alerts via Soda Cloud for critical failures
- Separate checks by business domain (finance, marketing)
- Add statistical distribution checks for sensitive numeric columns
Common Errors to Avoid
- Forgetting to export environment variables before the scan
- Hardcoding credentials in YAML files
- Not testing checks locally before CI deployment
- Ignoring connection error messages related to firewalls or permissions
Going Further
Discover our advanced training on data quality and modern tools like Soda at Learni Group.