Skip to content
Learni
View all tutorials
Data Engineering

How to Configure Soda for Advanced Data Quality Checks in 2026

18 minADVANCED
Lire en français

Introduction

Soda is a powerful open source tool for validating data quality through declarative checks. In 2026, data pipelines require robust automated controls to detect anomalies before they impact dashboards or ML models. This tutorial guides you step by step through an advanced setup including dynamic variables, multi-source checks, and structured reporting.

Prerequisites

  • Python 3.10+
  • Soda CLI installed (pip install soda-core)
  • Access to a database (Snowflake, BigQuery or PostgreSQL)
  • Solid knowledge of YAML and SQL
  • Account with Soda Cloud token (optional but recommended)

Installation and Initial Setup

terminal
python -m pip install -U soda-core-postgres
soda --version
mkdir soda_project && cd soda_project
soda init

This command installs the PostgreSQL connector and initializes the project folder with a default configuration.yml file. Always verify the version to ensure compatibility with new 2026 features.

Create the Main Configuration File

configuration.yml
data_source my_postgres:
  type: postgres
  host: ${PG_HOST}
  port: "5432"
  username: ${PG_USER}
  password: ${PG_PASSWORD}
  database: analytics
  schema: public

Use environment variables to avoid hardcoding credentials. This file centralizes all connections and will be referenced in every scan.

Define Advanced Checks with Variables

checks.yml
checks for orders:
  - row_count > 1000:
      name: Volume minimum de commandes
  - missing_count(email) = 0
  - invalid_count(order_date) = 0:
      valid format: date
  - freshness(order_date) < 24h
  - values in status must be ['pending', 'shipped', 'delivered']
  - avg(revenue) between 50 and 5000

This YAML file contains sophisticated checks combining volume, freshness, formats, and allowed values. Variables enable reusing the same file across multiple environments.

Run a Scan Using Environment Variables

terminal
export PG_HOST=prod.db.internal

export PG_USER=analytics_ro
export PG_PASSWORD=$(aws ssm get-parameter --name /prod/db/password --with-decryption --query Parameter.Value --output text)

soda scan -c configuration.yml -v checks.yml my_postgres

Running via environment variables and secure secret retrieval (AWS SSM) is the recommended approach in production to prevent any credential leaks.

CI/CD Integration with GitHub Actions

.github/workflows/data-quality.yml
name: Data Quality Checks
on: [push]
jobs:
  soda:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - name: Run Soda
      env:
        PG_HOST: ${{ secrets.PG_HOST }}
        PG_USER: ${{ secrets.PG_USER }}
        PG_PASSWORD: ${{ secrets.PG_PASSWORD }}
      run: |
        pip install soda-core-postgres
        soda scan -c configuration.yml checks.yml my_postgres

CI/CD integration allows blocking merges if checks fail. GitHub secrets protect database credentials.

Best Practices

  • Always version your checks.yml files in Git
  • Use explicit names for each check to simplify debugging
  • Configure Slack or email alerts via Soda Cloud for critical failures
  • Separate checks by business domain (finance, marketing)
  • Add statistical distribution checks for sensitive numeric columns

Common Errors to Avoid

  • Forgetting to export environment variables before the scan
  • Hardcoding credentials in YAML files
  • Not testing checks locally before CI deployment
  • Ignoring connection error messages related to firewalls or permissions

Going Further

Discover our advanced training on data quality and modern tools like Soda at Learni Group.