Skip to content
Learni
View all tutorials
Data Engineering

How to Create an ETL Job with Talend in 2026

Lire en français

Introduction

Talend Open Studio remains the go-to open-source ETL tool for data engineers in 2026, thanks to its drag-and-drop interface that speeds up development while generating native, scalable Java code. Why use it? It handles massive data volumes (Big Data with Spark), integrates AI for auto-documenting jobs, and seamlessly connects with cloud platforms (AWS, Azure) and NoSQL databases. In a world where data teams spend 80% of their time on data plumbing, Talend cuts that by 70% according to Gartner 2025.

This intermediate tutorial walks you through building a full ETL job: read a customers CSV, transform (cleaning, aggregation), and load into MySQL. You'll cover installation, context setup, key components (tFileInputDelimited, tMap, tMySQLOutput), code generation, and CLI execution. By the end, you'll have a production-ready job—bookmark it for your real projects. Ready to supercharge your pipelines? (128 words)

Prerequisites

  • Java 8 or 11 installed (OpenJDK recommended; check with java -version).
  • MySQL 8+ with a local server (root/password: root).
  • Disk space 2GB for Talend Open Studio 8.0+.
  • Basic knowledge of SQL, Java, and CSV.
  • Download Talend Open Studio for Data Integration from talend.com (free).

Download and Install Talend

install-talend.sh
#!/bin/bash

# Download Talend Open Studio 8.0 (adapt URL to latest 2026 version)
URL="https://github.com/Talend/open-studio-dist/releases/download/8.0.2/TOS_DI-20230220_1200-V8.0.2.zip"
wget $URL

# Unzip
unzip TOS_DI-*.zip -d /opt/talend/

# Launch (Linux/Mac; for Windows: double-click)
cd /opt/talend/TOS_DI-*/
./Talend-Studio

# Verify: Studio opens with Repository and Integration perspectives

This script automates downloading and launching Talend. It places the archive in /opt for a system-wide install. Note: On Windows, use PowerShell or 7-Zip; verify checksums for production security. Avoid sudo for permissions.

Create a New Project and Job

Launch Talend Studio: accept the license, create a new project 'MonProjetETL' via File > New > Project. In Repository (left view), right-click Jobs > Create job named 'CSV_to_MySQL'.

The interface opens: components palette on the right (Search: tFileInputDelimited), design area in the center, schema editor at the bottom. Think of it like Lego for data flows—each block is a reusable component. Drag tFileInputDelimited onto the canvas to read CSV. Configure: File name = "/path/to/clients.csv" (create a sample CSV: id,nom,email | 1,Durand,durand@test.com).

Set Up Context (Global Variables)

context.properties
db_host=localhost
db_port=3306
db_name=etl_db
db_user=root
db_pass=root
csv_input_path=/tmp/clients.csv
csv_output_path=/tmp/clients_clean.csv
batch_size=1000
log_level=INFO

Contexts externalize configs (dev/prod) without recompiling. In Talend, go to Repository > Contexts > Create context group, add these vars (String/Int types). Link them to the job with Ctrl+Space. Pitfall: forget quotes for paths with spaces.

Define Input Schema and Add tMap

Input schema (double-click tFileInput): Edit schema > +3 cols: id (Integer), nom (String 50), email (String 100). Use Guess schema if CSV has headers.

Connect to tMap (drag from palette). tMap is the transformation heart: input table on the left, output on the right. Drag cols to output, add expressions: e.g., email_clean = email:toUppercase(). Row filter: !email.contains("spam"). Output: 3 cols + total_records (Integer, globalMap). Analogy: Excel formulas at industrial scale.

SQL Script to Prepare MySQL Database

create_table.sql
CREATE DATABASE IF NOT EXISTS etl_db;
USE etl_db;

CREATE TABLE IF NOT EXISTS clients_clean (
  id INT PRIMARY KEY,
  nom VARCHAR(50) NOT NULL,
  email VARCHAR(100) UNIQUE,
  processed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert sample for testing
INSERT INTO clients_clean (id, nom, email) VALUES (1, 'Dupont', 'dupont@test.com');

Run this SQL in MySQL Workbench/CLI before the job. It creates the DB/table with constraints. In production, add indexes on email. Pitfall: schema type mismatches between Talend and DB cause RuntimeException.

Add Database Output and Run the Job

Connect tMap output to tMySQLOutput (config: DB connection > Repository > Create MySQL > host/context.db_host, etc.; Action: Insert; Sync schema with tMap output).

Run: toolbar play ▶️. Logs at bottom: stats on rows processed. Error? Check properties. Export job: right-click > Export Java Project (for CLI).

Java Expression Examples in tMap

tmap_expression.java
// Expression to clean email (copy into tMap cell)
((String)row1.email != null ? row1.email.toLowerCase().replaceAll("\\s+", "") : "") + "@validated.com"

// Row filter (If condition)
!row1.email.contains("@fake.com") && row1.nom.length() > 2

// Global variable for counter
((Integer)globalMap.get("total_records") == null ? 1 : ((Integer)globalMap.get("total_records") + 1))

These Java snippets go into tMap cells/filters. Talend uses Talend Java User Guide syntax. Why? Power: regex, TalendDate, custom routines. Pitfall: null checks are mandatory, or NPE crashes the job.

Generated Java Code for the Complete Job (Simplified)

CSV_to_MySQL_0_1.java
import routines.Numeric;
import routines.DataOperations;
import routines.TalendDataGenerator;
import routines.TalendString;
import routines.java;
import routines.system.*;
import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.io.BufferedReader;
import java.io.FileReader;
import java.sql.*;

public class CSV_to_MySQL_0_1 {

  protected static void logIgnoredError(String message, Throwable cause) {
    System.err.println(message);
    if (cause != null) { cause.printStackTrace(); }
  }

  public static class row1Struct {
    public int id;
    public String nom;
    public String email;
  }

  public static class out1Struct {
    public int id;
    public String nom;
    public String email_clean;
  }

  private static class Context {
    public String db_host = "localhost";
    // ... other contexts
  }

  public static void main(String[] args) {
    Context ctx = new Context();
    java.util.Map<String, Object> globalMap = java.util.Collections.synchronizedMap(new java.util.HashMap<String, Object>());
    globalMap.put("ctx", ctx);

    try {
      // Simulate tFileInputDelimited
      File csvFile = new File("/tmp/clients.csv");
      BufferedReader br = new BufferedReader(new FileReader(csvFile));
      String line;
      int rowCount = 0;
      while ((line = br.readLine()) != null) {
        if (rowCount++ == 0) continue; // skip header
        String[] fields = line.split(",");
        row1Struct row1 = new row1Struct();
        row1.id = Integer.parseInt(fields[0]);
        row1.nom = fields[1];
        row1.email = fields[2];

        // tMap transformation
        out1Struct out1 = new out1Struct();
        out1.id = row1.id;
        out1.nom = row1.nom.toUpperCase();
        out1.email_clean = (row1.email != null ? row1.email.toLowerCase() : "");

        if (!out1.email_clean.contains("fake")) {
          // Simulate tMySQLOutput
          Class.forName("com.mysql.cj.jdbc.Driver");
          Connection conn = DriverManager.getConnection("jdbc:mysql://" + ctx.db_host + "/etl_db", "root", "root");
          PreparedStatement pstmt = conn.prepareStatement("INSERT INTO clients_clean (id, nom, email) VALUES (?, ?, ?)");
          pstmt.setInt(1, out1.id);
          pstmt.setString(2, out1.nom);
          pstmt.setString(3, out1.email_clean);
          pstmt.executeUpdate();
          pstmt.close();
          conn.close();
          globalMap.put("total_records", rowCount);
          System.out.println("Inserted: " + out1.id);
        }
      }
      br.close();
      System.out.println("Job complete: " + globalMap.get("total_records") + " rows.");
    } catch (Exception e) {
      logIgnoredError("Main job error", e);
    }
  }
}

This Java code is a simplified export of a Talend job (generate via Export > Standalone). It reads CSV, applies tMap (upper/lower), inserts into MySQL if filter passes. Add MySQL JAR to lib. Compile: javac -cp mysql-connector.jar CSV_to_MySQL_0_1.java. Pitfall: handle UTF-8 encodings for accents.

Run the Job from Command Line

run-job.sh
#!/bin/bash

# After Talend export (job zip), unzip and cd
cd CSV_to_MySQL_0_1/

# Compile (add Talend jars + MySQL)
javac -cp "lib/*:mysql-connector-j-8.0.33.jar" *.java

# Run with contexts
java -cp "lib/*:mysql-connector-j-8.0.33.jar:.:$CLASSPATH" CSV_to_MySQL_0_1 "-Dfile.encoding=UTF-8" "-Dcontext=db_host=localhost"

# Logs: rows processed, errors in stderr

For production, export as Standalone and run headless. Add Talend routines/DB driver jars. Use -Dprops to override context. Pitfall: use absolute paths for CSV/DB, test with nohup for background runs.

Best Practices

  • Modularize: Reusable Java routines (Repository > Routines) for custom logging/validation.
  • Error handling: tWarn/tDie + OnComponentError for reject flows.
  • Performance: Enable parallelize (tParal...), batchSize=10000 on DB output.
  • Versioning: Native Git integration, branches per environment.
  • Security: Encrypt context passwords (Talend Crypt), avoid hardcoded creds.

Common Errors to Avoid

  • Schema mismatch: Always sync input/output schemas, re-run Guess.
  • NullPointer in tMap: Use TalendNullHandling routines.
  • OutOfMemory: Add -Xmx4g JVM args, limit rows in tests.
  • DB connection: Test JDBC URL manually before job.

Next Steps