Skip to content
Learni
View all tutorials
AWS

How to Master Amazon Polly TTS in 2026

Lire en français

Introduction

Amazon Polly, AWS's TTS service, has evolved in 2026 with ultra-realistic Neural voices, expanded SSML support, and lexicons for perfect pronunciation. This expert tutorial walks you through integrating it into a Node.js app, from the basics to advanced use cases like audio streaming and phonetic customization. Why does it matter? Voice apps (AI assistants, audiobooks, accessibility) require minimal latency and human-like quality. Imagine converting dynamic text to smooth MP3 in under 200ms. With 15 years of experience, I share production-ready configs: 6 coded steps that are SEO-optimized and scalable. By the end, your TTS API will surpass open-source competitors. Ready to give your data a voice?

Prerequisites

  • AWS account with PollyFullAccess permissions (IAM policy)
  • Node.js 20+ and npm/yarn
  • AWS keys: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION=us-east-1 as environment variables
  • Advanced knowledge of async/await and Node.js streams
  • Audio player like VLC to test MP3 outputs

Installation and AWS SDK v3 Setup

terminal
mkdir polly-expert && cd polly-expert
npm init -y
npm install @aws-sdk/client-polly dotenv
npm install -D @types/node typescript ts-node

cat > .env << EOF
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
EOF

cat > tsconfig.json << 'EOF'
{
  "compilerOptions": {
    "target": "ES2022",
    "module": "NodeNext",
    "strict": true,
    "esModuleInterop": true
  }
}
EOF

This script sets up a TypeScript project with AWS SDK v3 for Polly (optimal modularity). Env vars keep credentials secure—never hardcode them. Pitfall: Forgetting dotenv exposes keys in production—use SSM Parameter Store instead.

First Call: Basic TTS Synthesis

Before diving into expert features, let's validate the connection. This code generates a simple MP3 with a standard voice. Note OutputFormat.MP3 for web compatibility.

Basic Functional TTS Script

basic-polly.ts
import { PollyClient, SynthesizeSpeechCommand, OutputFormat } from '@aws-sdk/client-polly';
import * as fs from 'fs';
import * as dotenv from 'dotenv';
dotenv.config();

const client = new PollyClient({ region: process.env.AWS_REGION });

const synthesize = async (text: string) => {
  const input = {
    Text: text,
    OutputFormat: OutputFormat.Mp3,
    VoiceId: 'Joanna',
    Engine: 'standard',
  };
  const command = new SynthesizeSpeechCommand(input);
  const response = await client.send(command);
  if (response.AudioStream) {
    const audioBuffer = await streamToBuffer(response.AudioStream);
    fs.writeFileSync('output.mp3', audioBuffer);
    console.log('✅ Fichier généré : output.mp3');
  }
};

const streamToBuffer = (stream: NodeJS.ReadableStream): Promise<Buffer> =>
  new Promise((resolve, reject) => {
    const chunks: Buffer[] = [];
    stream.on('data', (chunk) => chunks.push(chunk));
    stream.on('end', () => resolve(Buffer.concat(chunks)));
    stream.on('error', reject);
  });

synthesize('Bonjour, ceci est un test Amazon Polly en 2026.');

// Exécuter : npx ts-node basic-polly.ts

This complete script synthesizes text to MP3 using SynthesizeSpeechCommand. streamToBuffer handles the binary stream efficiently (memory-optimized). Pitfall: Without Engine: 'standard', Neural voices fail—always test locally first.

Advanced Level: SSML for Expert Prosody

SSML (Speech Synthesis Markup Language) lets you control intonation, pauses, and emphasis, like a vocal conductor. Example: Slow down numbers for clarity.

TTS with Custom SSML

ssml-polly.ts
import { PollyClient, SynthesizeSpeechCommand, OutputFormat, Engine } from '@aws-sdk/client-polly';
import * as fs from 'fs';
import * as dotenv from 'dotenv';
dotenv.config();

const client = new PollyClient({ region: process.env.AWS_REGION });

const synthesizeSSML = async (ssml: string) => {
  const input = {
    Text: ssml,
    TextType: 'ssml',
    OutputFormat: OutputFormat.Mp3,
    VoiceId: 'Mathieu',
    Engine: Engine.Neural,
  };
  const command = new SynthesizeSpeechCommand(input);
  const response = await client.send(command);
  if (response.AudioStream) {
    const audioBuffer = await streamToBuffer(response.AudioStream);
    fs.writeFileSync('ssml-output.mp3', audioBuffer);
    console.log('✅ SSML généré : ssml-output.mp3');
  }
};

const streamToBuffer = (stream: NodeJS.ReadableStream): Promise<Buffer> =>
  new Promise((resolve, reject) => {
    const chunks: Buffer[] = [];
    stream.on('data', (chunk) => chunks.push(chunk));
    stream.on('end', () => resolve(Buffer.concat(chunks)));
    stream.on('error', reject);
  });

// SSML avancé : emphase, pause, rate
synthesizeSSML(`<speak>
  Bonjour !<break time="500ms"/>
  <prosody rate="slow">Le prix est <emphasis level="strong">99,99 €</emphasis>.</prosody>
  <prosody pitch="high">Parfait pour 2026 !</prosody>
</speak>`);

// Exécuter : npx ts-node ssml-polly.ts

SSML with Engine.Neural and TextType: 'ssml' enables human-like prosody. , , and structure the speech. Pitfall: Invalid SSML returns InvalidSSML—validate with the AWS SSML validator.

Custom Lexicons for Pronunciation

Lexicons fix phonetics (e.g., acronyms, proper names). Create one with PutLexicon for reuse across calls.

Lexicon Creation and Usage

lexicon-polly.ts
import { PollyClient, PutLexiconCommand, SynthesizeSpeechCommand, OutputFormat, Engine, DeleteLexiconCommand } from '@aws-sdk/client-polly';
import * as fs from 'fs';
import * as dotenv from 'dotenv';
dotenv.config();

const client = new PollyClient({ region: process.env.AWS_REGION });

const lexiconName = 'ExpertLexicon2026';

const createLexicon = async () => {
  const lexicon = {
    Name: lexiconName,
    Content: `<?xml version="1.0" encoding="UTF-8"?>
    <lexicon version="1.0"
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      alphabet="ipa" xml:lang="fr-FR">
      <lexeme>
        <grapheme>Learni</grapheme>
        <alias>le ar ni</alias>
      </lexeme>
      <lexeme>
        <grapheme>API</grapheme>
        <alias>a p i</alias>
      </lexeme>
    </lexicon>`,
  };
  const command = new PutLexiconCommand(lexicon);
  await client.send(command);
  console.log('✅ Lexicon créé');
};

const synthesizeWithLexicon = async () => {
  const input = {
    Text: 'Learni Dev publie une API Polly en 2026.',
    OutputFormat: OutputFormat.Mp3,
    VoiceId: 'Celine',
    Engine: Engine.Neural,
    LexiconNames: [lexiconName],
  };
  const command = new SynthesizeSpeechCommand(input);
  const response = await client.send(command);
  if (response.AudioStream) {
    const audioBuffer = await streamToBuffer(response.AudioStream);
    fs.writeFileSync('lexicon-output.mp3', audioBuffer);
    console.log('✅ Avec lexicon : lexicon-output.mp3');
  }
};

const cleanup = async () => {
  const delCmd = new DeleteLexiconCommand({ Name: lexiconName });
  await client.send(delCmd);
  console.log('🧹 Lexicon supprimé');
};

const streamToBuffer = (stream: NodeJS.ReadableStream): Promise<Buffer> =>
  new Promise((resolve, reject) => {
    const chunks: Buffer[] = [];
    stream.on('data', (chunk) => chunks.push(chunk));
    stream.on('end', () => resolve(Buffer.concat(chunks)));
    stream.on('error', reject);
  });

await createLexicon();
await synthesizeWithLexicon();
await cleanup();

// Exécuter : npx ts-node lexicon-polly.ts

XML lexicon in IPA fixes 'Learni' and 'API'. LexiconNames applies it. Auto-cleanup avoids quotas (10 lexicons/region). Pitfall: Wrong alphabet ('ipa' vs 'x-sampa') silences words—validate with GetLexicon.

Real-Time Streaming for Low Latency

For live apps (chatbots), stream directly to WebSocket or

TTS Streaming Server (Next.js API)

app/api/tts/route.ts
import { PollyClient, SynthesizeSpeechCommand, OutputFormat, Engine } from '@aws-sdk/client-polly';
import { NextRequest, NextResponse } from 'next/server';
import * as dotenv from 'dotenv';
dotenv.config();

const client = new PollyClient({ region: process.env.AWS_REGION as string });

export async function POST(request: NextRequest) {
  try {
    const { text } = await request.json();
    if (!text || text.length > 3000) {
      return NextResponse.json({ error: 'Texte invalide' }, { status: 400 });
    }

    const input = {
      Text: text,
      OutputFormat: OutputFormat.Mp3,
      VoiceId: 'Lea',
      Engine: Engine.Neural,
    };

    const command = new SynthesizeSpeechCommand(input);
    const response = await client.send(command);

    if (response.AudioStream) {
      return new NextResponse(response.AudioStream as any, {
        headers: {
          'Content-Type': 'audio/mpeg',
          'Cache-Control': 'no-cache',
        },
      });
    }
    return NextResponse.json({ error: 'Synthèse échouée' }, { status: 500 });
  } catch (error) {
    console.error(error);
    return NextResponse.json({ error: 'Erreur serveur' }, { status: 500 });
  }
}

// Usage : POST /api/tts { "text": "Texte à vocaliser" } → stream MP3

Next.js 15+ route streams AudioStream directly (~150ms latency). audio/mpeg headers work in browsers. Pitfall: No try/catch leads to crashes on Polly quotas (5 req/s)—add rate-limiting with Upstash Redis.

S3 Integration for Scalable Storage

For massive audiobooks, upload to S3 instead of local disks. Use StartSpeechSynthesisTask for long async tasks (>3k chars).

Async S3 Task with Polly

s3-task-polly.ts
import { PollyClient, StartSpeechSynthesisTaskCommand, OutputFormat, Engine, GetSpeechSynthesisTaskCommand } from '@aws-sdk/client-polly';
import * as dotenv from 'dotenv';
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
dotenv.config();

const polly = new PollyClient({ region: process.env.AWS_REGION });
const s3 = new S3Client({ region: process.env.AWS_REGION });

const bucket = 'your-polly-bucket-2026'; // Créez-le avant

const startTask = async (text: string, outputS3Key: string) => {
  const input = {
    OutputS3BucketName: bucket,
    OutputS3KeyPrefix: outputS3Key,
    Text: text,
    OutputFormat: OutputFormat.Mp3,
    VoiceId: 'Dmitri',
    Engine: Engine.Neural,
  };
  const command = new StartSpeechSynthesisTaskCommand(input);
  const task = await polly.send(command);
  return task.SynthesisTask?.TaskId;
};

const pollTask = async (taskId: string) => {
  while (true) {
    const statusCmd = new GetSpeechSynthesisTaskCommand({ TaskId: taskId! });
    const status = await polly.send(statusCmd);
    if (status.SynthesisTask?.TaskStatus === 'completed') {
      console.log('✅ Tâche terminée, fichier sur S3');
      break;
    } else if (status.SynthesisTask?.TaskStatus === 'failed') {
      throw new Error('Tâche échouée');
    }
    await new Promise(r => setTimeout(r, 2000));
  }
};

const longText = 'Texte très long pour audiobook... (répétez 10k mots)'.repeat(100);
const taskId = await startTask(longText, 'audiobook.mp3');
await pollTask(taskId!);

// Bonus : Télécharger depuis S3 si besoin
// const getCmd = new GetObjectCommand({ Bucket: bucket, Key: 'audiobook.mp3' });
// const s3obj = await s3.send(getCmd);

StartSpeechSynthesisTaskCommand handles texts >3k chars to S3 (cost-effective). Polling with GetSpeechSynthesisTask tracks status. Pitfall: Bucket without pubRead blocks upload—enable S3 CORS for web access.

Best Practices

  • Cache audio files: Use Redis to reuse identical syntheses (save 80% on Polly costs).
  • Choose Neural voices: Standard voices are outdated; Neural for realism (but pricier—run A/B tests).
  • Handle quotas: 5 req/s, 1M chars/day on free tier—implement exponential retry with SDK waiters.
  • Secure SSML: Sanitize inputs to prevent injections.
  • Monitor usage: Track CloudWatch SpeechCharacters metrics to optimize billing.

Common Errors to Avoid

  • InvalidLexiconId: Lexicon not found—list with ListLexiconsCommand before use.
  • XRequestLimitExceeded: No rate-limiting—use AWS API Gateway throttling.
  • Marks out of range: Malformed SSML—validate XML with a lib like xmldom.
  • Unhandled stream: Forgetting stream.on('error') causes memory leaks—always promisify.

Next Steps