Skip to content
Learni
View all tutorials
Bases de données

How to Build a Simple Data Warehouse with PostgreSQL in 2026

Lire en français

Introduction

A data warehouse centralizes your company's data for analysis. Unlike a transactional database, it is optimized for complex queries and historical tracking. This tutorial guides you step by step to create a simple data warehouse with PostgreSQL in 2026.

Prerequisites

  • PostgreSQL 16+
  • Basic SQL knowledge
  • Python 3.11+
  • Docker Desktop installed

Launch PostgreSQL with Docker

terminal
docker run --name dw-postgres -e POSTGRES_PASSWORD=secret -p 5432:5432 -d postgres:16

This command starts a persistent PostgreSQL container. Port 5432 is exposed for local connections.

Create the Database and Star Schema

schema.sql
CREATE DATABASE datawarehouse;
\c datawarehouse;

CREATE TABLE dim_client (
  client_id SERIAL PRIMARY KEY,
  nom VARCHAR(100),
  ville VARCHAR(50)
);

CREATE TABLE dim_produit (
  produit_id SERIAL PRIMARY KEY,
  nom VARCHAR(100),
  categorie VARCHAR(50)
);

CREATE TABLE fact_ventes (
  vente_id SERIAL PRIMARY KEY,
  client_id INT REFERENCES dim_client(client_id),
  produit_id INT REFERENCES dim_produit(produit_id),
  montant DECIMAL(10,2),
  date_vente DATE
);

We create a classic star schema structure: two dimension tables and one fact table. This optimizes analytical joins.

Simple Python ETL Script

etl.py
import psycopg2
conn = psycopg2.connect("host=localhost dbname=datawarehouse user=postgres password=secret")
cur = conn.cursor()
cur.execute("INSERT INTO dim_client (nom, ville) VALUES ('Jean Dupont', 'Paris');")
cur.execute("INSERT INTO dim_produit (nom, categorie) VALUES ('Laptop', 'Informatique');")
cur.execute("INSERT INTO fact_ventes (client_id, produit_id, montant, date_vente) VALUES (1, 1, 999.99, '2026-01-15');")
conn.commit()
cur.close()
conn.close()

This script inserts test data into the tables. In production, replace the values with data extracted from external sources.

Analysis Query

analyse.sql
SELECT 
  dc.ville,
  dp.categorie,
  SUM(fv.montant) AS total_ventes
FROM fact_ventes fv
JOIN dim_client dc ON fv.client_id = dc.client_id
JOIN dim_produit dp ON fv.produit_id = dp.produit_id
GROUP BY dc.ville, dp.categorie
ORDER BY total_ventes DESC;

This query aggregates sales by city and category. It leverages the star schema for optimal performance.

Configure the .env File

.env
DB_HOST=localhost
DB_PORT=5432
DB_NAME=datawarehouse
DB_USER=postgres
DB_PASSWORD=secret

Store connection credentials in a .env file to avoid hardcoding them in source code.

Best Practices

  • Always use surrogate primary keys
  • Separate raw data (staging) from modeled data
  • Document every table and column
  • Automate ETL loads with tools like Airflow

Common Mistakes to Avoid

  • Forgetting indexes on foreign keys
  • Loading data without prior cleaning
  • Using overly generic data types (TEXT instead of VARCHAR)
  • Ignoring date and timezone management

Going Further

Check out our Learni courses on data engineering and advanced data warehouses.