Skip to content
Learni
View all tutorials
Data Engineering

How to Build a Scalable Data Lake in 2026

Lire en français

Introduction

A data lake is a centralized storage for raw data at massive scale, following the schema-on-read principle: data arrives without an imposed schema and is interpreted only when read. Unlike structured data warehouses (schema-on-write), it handles huge volumes of heterogeneous data: JSON logs, images, CSV files, videos. In 2026, with the explosion of AI data (LLM fine-tuning, RAG), data lakes power ML pipelines with rich historical data.

This intermediate tutorial guides you through building a scalable local data lake with MinIO (100% S3-compatible API), Python for partitioned ingestion (optimized Parquet), and DuckDB for vectorized SQL queries. Think of it like a natural lake where water (data) accumulates freely and can be filtered on demand. By the end, you'll have a working, extensible setup ready for AWS S3 or Azure. Estimated time: 30 minutes. Perfect for data engineers to bookmark. (138 words)

Prerequisites

  • Docker and Docker Compose installed (version 20+).
  • Python 3.10+ with pip.
  • Internet access for Docker images and PyPI packages.
  • Basic data engineering knowledge: Parquet, partitioning, S3-like storage.
Install Python libraries: pip install minio pandas pyarrow duckdb. Download mc (MinIO Client) using the provided code.

Start MinIO with Docker Compose

docker-compose.yml
version: '3.8'

services:
  minio:
    image: minio/minio:latest
    container_name: minio-server
    ports:
      - "9000:9000"  # API S3
      - "9001:9001"  # Console UI
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    volumes:
      - minio_data:/data
    command: server /data --console-address ":9001"

  volumes:
    minio_data:

networks:
  default:
    name: datalake-net

This Docker Compose file launches MinIO as an S3-compatible server with data persistence. Ports 9000 (API) and 9001 (web UI). Run docker compose up -d to start. Access the UI at http://localhost:9001 (login: minioadmin/minioadmin). Pitfall: without a volume, data is lost on restart.

Access the MinIO Interface

Run docker compose up -d in the file's directory. Check logs: docker logs minio-server. Open http://localhost:9001 for the intuitive UI to explore buckets and monitor usage. MinIO perfectly emulates AWS S3, ideal for dev/testing without cloud costs. Next step: CLI client mc for automation.

Install and Configure the MinIO mc Client

setup-mc.sh
#!/bin/bash

# Télécharger mc (Linux/macOS, adaptez pour Windows)
curl -O https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/mc

# Configurer alias pour notre MinIO local
mc alias set myminio http://localhost:9000 minioadmin minioadmin

# Vérifier connexion
echo "Buckets disponibles :"
mc ls myminio

# Nettoyage optionnel : rm mc si pas sudo

This script installs mc (MinIO Client, like AWS CLI for S3) and creates an alias 'myminio' pointing to your local instance. Run bash setup-mc.sh. Verify: mc admin info myminio. Pitfall: firewall blocking port 9000—allow it. mc is essential for scripting buckets and policies.

Create the Data Lake Bucket

Buckets are logical containers in S3/MinIO, like root directories. For a data lake, create a 'datalake' bucket with a public read policy (local only). This enables direct queries without heavy auth. Use mc for idempotency: rerun without errors.

Create Bucket and Read Policy

create-bucket.sh
#!/bin/bash

BUCKET=datalake

# Créer bucket si inexistant
mc mb myminio/$BUCKET

# Policy JSON pour read public (local dev only)
cat > public-read.json << EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": "*",
      "Action": ["s3:GetObject"],
      "Resource": ["arn:aws:s3:::$BUCKET/*"]
    }
  ]
}
EOF

mc policy add public-read myminio/$BUCKET public-read.json

# Vérifier
echo "Policy :"
mc policy get-json myminio/$BUCKET
mc ls myminio

Creates 'datalake' bucket and applies JSON policy allowing anonymous reads (safe locally). Run bash create-bucket.sh. Check MinIO UI. Pitfall: malformed policy—requires valid JSON; in production, use fine-grained IAM. This makes the lake queryable without credentials.

Ingest Partitioned Data

Simulate sales data: 1000 rows (id, date, product, amount, region). Partitioning by year/month optimizes queries (pruning). Convert CSV to Parquet (columnar, compressed, embedded schema). Use MinIO Python SDK for atomic uploads.

Python Script: Generate and Upload Parquet Data

ingest-data.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from minio import Minio
from io import BytesIO
import numpy as np
from datetime import datetime, timedelta

client = Minio(
    "localhost:9000",
    access_key="minioadmin",
    secret_key="minioadmin",
    secure=False
)

# Générer data sample
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='H')[:1000]
data = pd.DataFrame({
    'id': range(1000),
    'date': dates,
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet'], 1000),
    'amount': np.random.uniform(100, 1000, 1000),
    'region': np.random.choice(['EU', 'US', 'ASIA'], 1000)
})

# Partitionner par year/month
data['year'] = data['date'].dt.year.astype(str)
data['month'] = data['date'].dt.month.astype(str).str.zfill(2)

for (year, month), group in data.groupby(['year', 'month']):
    table = pa.Table.from_pandas(group)
    buffer = BytesIO()
    pq.write_table(table, buffer)
    buffer.seek(0)
    client.put_object(
        "datalake",
        f"sales/year={year}/month={month}/sales.parquet",
        io.BytesIO(buffer.read()),
        length=buffer.tell(),
        content_type="application/parquet"
    )
    print(f"Uploadé : year={year}/month={month}")

print("Ingestion terminée. Vérifiez UI MinIO.")

Generates a partitioned sales dataset, converts to Parquet (10x more efficient than CSV), and uploads via MinIO SDK. Year/month partitioning enables auto-pruning. Run python ingest-data.py. Pitfall: without pyarrow, schema is lost; use buffer for in-memory writes. Scales to TB with Dask.

Query the Data Lake with SQL

DuckDB: Embedded OLAP engine that reads Parquet directly from S3 without ETL. Configure endpoint/credentials for MinIO. Pushdown queries: partition filters pruned, vectorized aggregations. Analogy: SQL on files like on a database.

Python Script: SQL Queries with DuckDB

query-lake.py
import duckdb

con = duckdb.connect()

# Installer/charger extension S3
con.execute("INSTALL httpfs; LOAD httpfs;")

# Config MinIO comme S3 (no secure)
con.execute """
SET s3_endpoint='localhost:9000';
SET s3_access_key='minioadmin';
SET s3_secret_key='minioadmin';
SET s3_region='us-east-1';
"""

# Query partitionnée : ventes EU 2023
result = con.execute(
    """
    SELECT 
        product, 
        AVG(amount) as avg_amount,
        COUNT(*) as sales_count
    FROM 's3://datalake/sales/year=2023/*/*.parquet'
    WHERE region = 'EU'
    GROUP BY product
    ORDER BY avg_amount DESC
    """
).fetchdf()

print(result)

# Export optionnel
result.to_csv('query-result.csv', index=False)
print("Résultat exporté en CSV.")

Configures DuckDB for MinIO/S3 and runs SQL queries on partitioned Parquet (pruning year=2023). Ultra-fast pushdown aggregations. Run python query-lake.py. Pitfall: forget LOAD httpfs → endpoint error; read policy required. Scales to PB with workers.

Add Incremental Data

Data lakes evolve: append without downtime. Reuse the ingest script for new batches (schema evolution via Parquet metadata). Verify atomicity.

Append 2024 Data Script

append-data.py
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from minio import Minio
from io import BytesIO
import numpy as np
from datetime import datetime

client = Minio("localhost:9000", access_key="minioadmin", secret_key="minioadmin", secure=False)

# Nouvelles data 2024 (évolue : ajoute 'category')
np.random.seed(123)
dates = pd.date_range('2024-01-01', '2024-03-31', freq='H')[:500]
data_new = pd.DataFrame({
    'id': range(1000, 1500),
    'date': dates,
    'product': np.random.choice(['Laptop Pro', 'Phone Ultra'], 500),
    'amount': np.random.uniform(500, 2000, 500),
    'region': np.random.choice(['EU', 'US'], 500),
    'category': np.random.choice(['Electronics', 'Mobile'], 500)  # Schema evolution
})

data_new['year'] = data_new['date'].dt.year.astype(str)
data_new['month'] = data_new['date'].dt.month.astype(str).str.zfill(2)

for (year, month), group in data_new.groupby(['year', 'month']):
    table = pa.Table.from_pandas(group)
    buffer = BytesIO()
    pq.write_table(table, buffer)
    buffer.seek(0)
    client.put_object(
        "datalake",
        f"sales/year={year}/month={month}/sales.parquet",
        io.BytesIO(buffer.read()),
        length=buffer.tell(),
        content_type="application/parquet"
    )
    print(f"Appended : year={year}/month={month}")

print("Append terminé. Re-query pour vérifier.")

Appends 2024 data with new 'category' column (schema evolution supported). Same partitioning pattern. Run after initial ingest. Pitfall: accidental overwrite—use timestamped filenames in production. DuckDB auto-detects new columns.

Best Practices

  • Always partition: year/month/day for pruning (x100 perf gains).
  • Use table formats: Migrate to Iceberg/Delta Lake for ACID, time-travel (vs raw Parquet).
  • Central catalog: Integrate Hive Metastore or AWS Glue for metadata.
  • Data quality: Validate schemas/NULLs on ingestion (Great Expectations).
  • Security: Least-privilege policies, SSE-S3 encryption; no public in production.
  • Monitoring: Prometheus + MinIO metrics for usage/throughput.

Common Pitfalls to Avoid

  • Data swamp: Without governance/catalog, data becomes unmanageable—enforce tagging/zones (raw/curated).
  • No partitioning/compression: Full scans → slow queries/high costs; always use Snappy/Zstd.
  • Forced schema-on-write: Kills lake flexibility—stick to schema-on-read, fix at consumption.
  • No versioning: Loses audits—use Iceberg snapshots or S3 versioning.
  • Local vs production mismatch: Public policy OK for dev, but VPC/IAM in cloud.

Next Steps

  • Production: Migrate to AWS S3 + Athena + Glue Catalog (swap endpoint).
  • Advanced: Apache Iceberg with Spark/Trino for transactional tables.
  • Tools: Airflow for orchestration, dbt for transformations.
  • Resources: MinIO Docs, DuckDB S3.
Check out our Data Engineering Training for Iceberg/Databricks masterclasses.