Introduction
A data lake is a centralized storage for raw data at massive scale, following the schema-on-read principle: data arrives without an imposed schema and is interpreted only when read. Unlike structured data warehouses (schema-on-write), it handles huge volumes of heterogeneous data: JSON logs, images, CSV files, videos. In 2026, with the explosion of AI data (LLM fine-tuning, RAG), data lakes power ML pipelines with rich historical data.
This intermediate tutorial guides you through building a scalable local data lake with MinIO (100% S3-compatible API), Python for partitioned ingestion (optimized Parquet), and DuckDB for vectorized SQL queries. Think of it like a natural lake where water (data) accumulates freely and can be filtered on demand. By the end, you'll have a working, extensible setup ready for AWS S3 or Azure. Estimated time: 30 minutes. Perfect for data engineers to bookmark. (138 words)
Prerequisites
- Docker and Docker Compose installed (version 20+).
- Python 3.10+ with
pip. - Internet access for Docker images and PyPI packages.
- Basic data engineering knowledge: Parquet, partitioning, S3-like storage.
pip install minio pandas pyarrow duckdb. Download mc (MinIO Client) using the provided code.Start MinIO with Docker Compose
version: '3.8'
services:
minio:
image: minio/minio:latest
container_name: minio-server
ports:
- "9000:9000" # API S3
- "9001:9001" # Console UI
environment:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadmin
volumes:
- minio_data:/data
command: server /data --console-address ":9001"
volumes:
minio_data:
networks:
default:
name: datalake-netThis Docker Compose file launches MinIO as an S3-compatible server with data persistence. Ports 9000 (API) and 9001 (web UI). Run docker compose up -d to start. Access the UI at http://localhost:9001 (login: minioadmin/minioadmin). Pitfall: without a volume, data is lost on restart.
Access the MinIO Interface
Run docker compose up -d in the file's directory. Check logs: docker logs minio-server. Open http://localhost:9001 for the intuitive UI to explore buckets and monitor usage. MinIO perfectly emulates AWS S3, ideal for dev/testing without cloud costs. Next step: CLI client mc for automation.
Install and Configure the MinIO mc Client
#!/bin/bash
# Télécharger mc (Linux/macOS, adaptez pour Windows)
curl -O https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/mc
# Configurer alias pour notre MinIO local
mc alias set myminio http://localhost:9000 minioadmin minioadmin
# Vérifier connexion
echo "Buckets disponibles :"
mc ls myminio
# Nettoyage optionnel : rm mc si pas sudoThis script installs mc (MinIO Client, like AWS CLI for S3) and creates an alias 'myminio' pointing to your local instance. Run bash setup-mc.sh. Verify: mc admin info myminio. Pitfall: firewall blocking port 9000—allow it. mc is essential for scripting buckets and policies.
Create the Data Lake Bucket
Buckets are logical containers in S3/MinIO, like root directories. For a data lake, create a 'datalake' bucket with a public read policy (local only). This enables direct queries without heavy auth. Use mc for idempotency: rerun without errors.
Create Bucket and Read Policy
#!/bin/bash
BUCKET=datalake
# Créer bucket si inexistant
mc mb myminio/$BUCKET
# Policy JSON pour read public (local dev only)
cat > public-read.json << EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": "*",
"Action": ["s3:GetObject"],
"Resource": ["arn:aws:s3:::$BUCKET/*"]
}
]
}
EOF
mc policy add public-read myminio/$BUCKET public-read.json
# Vérifier
echo "Policy :"
mc policy get-json myminio/$BUCKET
mc ls myminioCreates 'datalake' bucket and applies JSON policy allowing anonymous reads (safe locally). Run bash create-bucket.sh. Check MinIO UI. Pitfall: malformed policy—requires valid JSON; in production, use fine-grained IAM. This makes the lake queryable without credentials.
Ingest Partitioned Data
Simulate sales data: 1000 rows (id, date, product, amount, region). Partitioning by year/month optimizes queries (pruning). Convert CSV to Parquet (columnar, compressed, embedded schema). Use MinIO Python SDK for atomic uploads.
Python Script: Generate and Upload Parquet Data
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from minio import Minio
from io import BytesIO
import numpy as np
from datetime import datetime, timedelta
client = Minio(
"localhost:9000",
access_key="minioadmin",
secret_key="minioadmin",
secure=False
)
# Générer data sample
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2023-12-31', freq='H')[:1000]
data = pd.DataFrame({
'id': range(1000),
'date': dates,
'product': np.random.choice(['Laptop', 'Phone', 'Tablet'], 1000),
'amount': np.random.uniform(100, 1000, 1000),
'region': np.random.choice(['EU', 'US', 'ASIA'], 1000)
})
# Partitionner par year/month
data['year'] = data['date'].dt.year.astype(str)
data['month'] = data['date'].dt.month.astype(str).str.zfill(2)
for (year, month), group in data.groupby(['year', 'month']):
table = pa.Table.from_pandas(group)
buffer = BytesIO()
pq.write_table(table, buffer)
buffer.seek(0)
client.put_object(
"datalake",
f"sales/year={year}/month={month}/sales.parquet",
io.BytesIO(buffer.read()),
length=buffer.tell(),
content_type="application/parquet"
)
print(f"Uploadé : year={year}/month={month}")
print("Ingestion terminée. Vérifiez UI MinIO.")Generates a partitioned sales dataset, converts to Parquet (10x more efficient than CSV), and uploads via MinIO SDK. Year/month partitioning enables auto-pruning. Run python ingest-data.py. Pitfall: without pyarrow, schema is lost; use buffer for in-memory writes. Scales to TB with Dask.
Query the Data Lake with SQL
DuckDB: Embedded OLAP engine that reads Parquet directly from S3 without ETL. Configure endpoint/credentials for MinIO. Pushdown queries: partition filters pruned, vectorized aggregations. Analogy: SQL on files like on a database.
Python Script: SQL Queries with DuckDB
import duckdb
con = duckdb.connect()
# Installer/charger extension S3
con.execute("INSTALL httpfs; LOAD httpfs;")
# Config MinIO comme S3 (no secure)
con.execute """
SET s3_endpoint='localhost:9000';
SET s3_access_key='minioadmin';
SET s3_secret_key='minioadmin';
SET s3_region='us-east-1';
"""
# Query partitionnée : ventes EU 2023
result = con.execute(
"""
SELECT
product,
AVG(amount) as avg_amount,
COUNT(*) as sales_count
FROM 's3://datalake/sales/year=2023/*/*.parquet'
WHERE region = 'EU'
GROUP BY product
ORDER BY avg_amount DESC
"""
).fetchdf()
print(result)
# Export optionnel
result.to_csv('query-result.csv', index=False)
print("Résultat exporté en CSV.")Configures DuckDB for MinIO/S3 and runs SQL queries on partitioned Parquet (pruning year=2023). Ultra-fast pushdown aggregations. Run python query-lake.py. Pitfall: forget LOAD httpfs → endpoint error; read policy required. Scales to PB with workers.
Add Incremental Data
Data lakes evolve: append without downtime. Reuse the ingest script for new batches (schema evolution via Parquet metadata). Verify atomicity.
Append 2024 Data Script
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from minio import Minio
from io import BytesIO
import numpy as np
from datetime import datetime
client = Minio("localhost:9000", access_key="minioadmin", secret_key="minioadmin", secure=False)
# Nouvelles data 2024 (évolue : ajoute 'category')
np.random.seed(123)
dates = pd.date_range('2024-01-01', '2024-03-31', freq='H')[:500]
data_new = pd.DataFrame({
'id': range(1000, 1500),
'date': dates,
'product': np.random.choice(['Laptop Pro', 'Phone Ultra'], 500),
'amount': np.random.uniform(500, 2000, 500),
'region': np.random.choice(['EU', 'US'], 500),
'category': np.random.choice(['Electronics', 'Mobile'], 500) # Schema evolution
})
data_new['year'] = data_new['date'].dt.year.astype(str)
data_new['month'] = data_new['date'].dt.month.astype(str).str.zfill(2)
for (year, month), group in data_new.groupby(['year', 'month']):
table = pa.Table.from_pandas(group)
buffer = BytesIO()
pq.write_table(table, buffer)
buffer.seek(0)
client.put_object(
"datalake",
f"sales/year={year}/month={month}/sales.parquet",
io.BytesIO(buffer.read()),
length=buffer.tell(),
content_type="application/parquet"
)
print(f"Appended : year={year}/month={month}")
print("Append terminé. Re-query pour vérifier.")Appends 2024 data with new 'category' column (schema evolution supported). Same partitioning pattern. Run after initial ingest. Pitfall: accidental overwrite—use timestamped filenames in production. DuckDB auto-detects new columns.
Best Practices
- Always partition: year/month/day for pruning (x100 perf gains).
- Use table formats: Migrate to Iceberg/Delta Lake for ACID, time-travel (vs raw Parquet).
- Central catalog: Integrate Hive Metastore or AWS Glue for metadata.
- Data quality: Validate schemas/NULLs on ingestion (Great Expectations).
- Security: Least-privilege policies, SSE-S3 encryption; no public in production.
- Monitoring: Prometheus + MinIO metrics for usage/throughput.
Common Pitfalls to Avoid
- Data swamp: Without governance/catalog, data becomes unmanageable—enforce tagging/zones (raw/curated).
- No partitioning/compression: Full scans → slow queries/high costs; always use Snappy/Zstd.
- Forced schema-on-write: Kills lake flexibility—stick to schema-on-read, fix at consumption.
- No versioning: Loses audits—use Iceberg snapshots or S3 versioning.
- Local vs production mismatch: Public policy OK for dev, but VPC/IAM in cloud.
Next Steps
- Production: Migrate to AWS S3 + Athena + Glue Catalog (swap endpoint).
- Advanced: Apache Iceberg with Spark/Trino for transactional tables.
- Tools: Airflow for orchestration, dbt for transformations.
- Resources: MinIO Docs, DuckDB S3.