Skip to content
Learni
View all tutorials
Data Engineering

How to Ensure Data Quality in 2026

Lire en français

Introduction

Data quality has become a critical pillar of modern systems. Corrupted data leads to erroneous decisions and high costs. This tutorial guides you step by step in setting up an automated validation solution with Great Expectations. You will learn to define clear expectations, execute validations, and integrate these controls into a data pipeline. The approach is progressive and production-oriented.

Prerequisites

  • Python 3.10+
  • Basic knowledge of pandas
  • A configured virtual environment
  • Notions of ETL pipelines

Project Initialization

terminal
python -m venv venv
source venv/bin/activate
pip install great_expectations pandas

This command creates an isolated environment and installs Great Expectations along with pandas for data manipulation.

Creating the GE Context

terminal
great_expectations init

Initializes the necessary folder structure and configuration files for Great Expectations.

Defining the Datasource

create_datasource.py
import great_expectations as gx
context = gx.get_context()
datasource = context.sources.add_pandas(name="my_pandas_datasource")
context.add_datasource(datasource)

This script registers a pandas data source in the Great Expectations context so that validations can be applied to it.

Creating an Expectation Suite

create_expectation_suite.py
suite = context.add_expectation_suite(expectation_suite_name="data_quality_suite")
validator = context.get_validator(
    batch_request=datasource.get_asset("my_asset").build_batch_request(),
    expectation_suite_name="data_quality_suite"
)
validator.expect_column_values_to_not_be_null(column="id")
validator.expect_column_values_to_be_unique(column="id")
validator.save_expectation_suite()

Here we define two simple quality rules: no null values and uniqueness on the id column.

Running the Checkpoint

run_checkpoint.py
checkpoint = context.add_or_update_checkpoint(
    name="my_checkpoint",
    validator=validator
)
results = checkpoint.run()
print(results.success)

The checkpoint executes all defined expectations and returns a boolean indicating whether the data complies with the quality rules.

Best Practices

  • Version your expectation suites in Git
  • Use explicit names for checkpoints
  • Integrate validations into your CI/CD pipeline
  • Monitor success rates with alerts
  • Document the business rules associated with each expectation

Common Errors to Avoid

  • Forgetting to update the batch request after schema changes
  • Defining too many strict expectations from the start
  • Not handling large data cases (sampling)
  • Ignoring validation results in logs

To Go Further

Deepen your skills with our data engineering training courses: https://learni-group.com/formations

How to Ensure Data Quality with Python in 2026 | Learni