Introduction
Amazon Polly, AWS's TTS service, has evolved in 2026 with ultra-realistic Neural voices, expanded SSML support, and lexicons for perfect pronunciation. This expert tutorial walks you through integrating it into a Node.js app, from the basics to advanced use cases like audio streaming and phonetic customization. Why does it matter? Voice apps (AI assistants, audiobooks, accessibility) require minimal latency and human-like quality. Imagine converting dynamic text to smooth MP3 in under 200ms. With 15 years of experience, I share production-ready configs: 6 coded steps that are SEO-optimized and scalable. By the end, your TTS API will surpass open-source competitors. Ready to give your data a voice?
Prerequisites
- AWS account with PollyFullAccess permissions (IAM policy)
- Node.js 20+ and npm/yarn
- AWS keys:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION=us-east-1as environment variables - Advanced knowledge of async/await and Node.js streams
- Audio player like VLC to test MP3 outputs
Installation and AWS SDK v3 Setup
mkdir polly-expert && cd polly-expert
npm init -y
npm install @aws-sdk/client-polly dotenv
npm install -D @types/node typescript ts-node
cat > .env << EOF
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
EOF
cat > tsconfig.json << 'EOF'
{
"compilerOptions": {
"target": "ES2022",
"module": "NodeNext",
"strict": true,
"esModuleInterop": true
}
}
EOFThis script sets up a TypeScript project with AWS SDK v3 for Polly (optimal modularity). Env vars keep credentials secure—never hardcode them. Pitfall: Forgetting dotenv exposes keys in production—use SSM Parameter Store instead.
First Call: Basic TTS Synthesis
Before diving into expert features, let's validate the connection. This code generates a simple MP3 with a standard voice. Note OutputFormat.MP3 for web compatibility.
Basic Functional TTS Script
import { PollyClient, SynthesizeSpeechCommand, OutputFormat } from '@aws-sdk/client-polly';
import * as fs from 'fs';
import * as dotenv from 'dotenv';
dotenv.config();
const client = new PollyClient({ region: process.env.AWS_REGION });
const synthesize = async (text: string) => {
const input = {
Text: text,
OutputFormat: OutputFormat.Mp3,
VoiceId: 'Joanna',
Engine: 'standard',
};
const command = new SynthesizeSpeechCommand(input);
const response = await client.send(command);
if (response.AudioStream) {
const audioBuffer = await streamToBuffer(response.AudioStream);
fs.writeFileSync('output.mp3', audioBuffer);
console.log('✅ Fichier généré : output.mp3');
}
};
const streamToBuffer = (stream: NodeJS.ReadableStream): Promise<Buffer> =>
new Promise((resolve, reject) => {
const chunks: Buffer[] = [];
stream.on('data', (chunk) => chunks.push(chunk));
stream.on('end', () => resolve(Buffer.concat(chunks)));
stream.on('error', reject);
});
synthesize('Bonjour, ceci est un test Amazon Polly en 2026.');
// Exécuter : npx ts-node basic-polly.tsThis complete script synthesizes text to MP3 using SynthesizeSpeechCommand. streamToBuffer handles the binary stream efficiently (memory-optimized). Pitfall: Without Engine: 'standard', Neural voices fail—always test locally first.
Advanced Level: SSML for Expert Prosody
SSML (Speech Synthesis Markup Language) lets you control intonation, pauses, and emphasis, like a vocal conductor. Example: Slow down numbers for clarity.
TTS with Custom SSML
import { PollyClient, SynthesizeSpeechCommand, OutputFormat, Engine } from '@aws-sdk/client-polly';
import * as fs from 'fs';
import * as dotenv from 'dotenv';
dotenv.config();
const client = new PollyClient({ region: process.env.AWS_REGION });
const synthesizeSSML = async (ssml: string) => {
const input = {
Text: ssml,
TextType: 'ssml',
OutputFormat: OutputFormat.Mp3,
VoiceId: 'Mathieu',
Engine: Engine.Neural,
};
const command = new SynthesizeSpeechCommand(input);
const response = await client.send(command);
if (response.AudioStream) {
const audioBuffer = await streamToBuffer(response.AudioStream);
fs.writeFileSync('ssml-output.mp3', audioBuffer);
console.log('✅ SSML généré : ssml-output.mp3');
}
};
const streamToBuffer = (stream: NodeJS.ReadableStream): Promise<Buffer> =>
new Promise((resolve, reject) => {
const chunks: Buffer[] = [];
stream.on('data', (chunk) => chunks.push(chunk));
stream.on('end', () => resolve(Buffer.concat(chunks)));
stream.on('error', reject);
});
// SSML avancé : emphase, pause, rate
synthesizeSSML(`<speak>
Bonjour !<break time="500ms"/>
<prosody rate="slow">Le prix est <emphasis level="strong">99,99 €</emphasis>.</prosody>
<prosody pitch="high">Parfait pour 2026 !</prosody>
</speak>`);
// Exécuter : npx ts-node ssml-polly.tsSSML with Engine.Neural and TextType: 'ssml' enables human-like prosody. , , and structure the speech. Pitfall: Invalid SSML returns InvalidSSML—validate with the AWS SSML validator.
Custom Lexicons for Pronunciation
Lexicons fix phonetics (e.g., acronyms, proper names). Create one with PutLexicon for reuse across calls.
Lexicon Creation and Usage
import { PollyClient, PutLexiconCommand, SynthesizeSpeechCommand, OutputFormat, Engine, DeleteLexiconCommand } from '@aws-sdk/client-polly';
import * as fs from 'fs';
import * as dotenv from 'dotenv';
dotenv.config();
const client = new PollyClient({ region: process.env.AWS_REGION });
const lexiconName = 'ExpertLexicon2026';
const createLexicon = async () => {
const lexicon = {
Name: lexiconName,
Content: `<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
alphabet="ipa" xml:lang="fr-FR">
<lexeme>
<grapheme>Learni</grapheme>
<alias>le ar ni</alias>
</lexeme>
<lexeme>
<grapheme>API</grapheme>
<alias>a p i</alias>
</lexeme>
</lexicon>`,
};
const command = new PutLexiconCommand(lexicon);
await client.send(command);
console.log('✅ Lexicon créé');
};
const synthesizeWithLexicon = async () => {
const input = {
Text: 'Learni Dev publie une API Polly en 2026.',
OutputFormat: OutputFormat.Mp3,
VoiceId: 'Celine',
Engine: Engine.Neural,
LexiconNames: [lexiconName],
};
const command = new SynthesizeSpeechCommand(input);
const response = await client.send(command);
if (response.AudioStream) {
const audioBuffer = await streamToBuffer(response.AudioStream);
fs.writeFileSync('lexicon-output.mp3', audioBuffer);
console.log('✅ Avec lexicon : lexicon-output.mp3');
}
};
const cleanup = async () => {
const delCmd = new DeleteLexiconCommand({ Name: lexiconName });
await client.send(delCmd);
console.log('🧹 Lexicon supprimé');
};
const streamToBuffer = (stream: NodeJS.ReadableStream): Promise<Buffer> =>
new Promise((resolve, reject) => {
const chunks: Buffer[] = [];
stream.on('data', (chunk) => chunks.push(chunk));
stream.on('end', () => resolve(Buffer.concat(chunks)));
stream.on('error', reject);
});
await createLexicon();
await synthesizeWithLexicon();
await cleanup();
// Exécuter : npx ts-node lexicon-polly.tsXML lexicon in IPA fixes 'Learni' and 'API'. LexiconNames applies it. Auto-cleanup avoids quotas (10 lexicons/region). Pitfall: Wrong alphabet ('ipa' vs 'x-sampa') silences words—validate with GetLexicon.
Real-Time Streaming for Low Latency
For live apps (chatbots), stream directly to WebSocket or
TTS Streaming Server (Next.js API)
import { PollyClient, SynthesizeSpeechCommand, OutputFormat, Engine } from '@aws-sdk/client-polly';
import { NextRequest, NextResponse } from 'next/server';
import * as dotenv from 'dotenv';
dotenv.config();
const client = new PollyClient({ region: process.env.AWS_REGION as string });
export async function POST(request: NextRequest) {
try {
const { text } = await request.json();
if (!text || text.length > 3000) {
return NextResponse.json({ error: 'Texte invalide' }, { status: 400 });
}
const input = {
Text: text,
OutputFormat: OutputFormat.Mp3,
VoiceId: 'Lea',
Engine: Engine.Neural,
};
const command = new SynthesizeSpeechCommand(input);
const response = await client.send(command);
if (response.AudioStream) {
return new NextResponse(response.AudioStream as any, {
headers: {
'Content-Type': 'audio/mpeg',
'Cache-Control': 'no-cache',
},
});
}
return NextResponse.json({ error: 'Synthèse échouée' }, { status: 500 });
} catch (error) {
console.error(error);
return NextResponse.json({ error: 'Erreur serveur' }, { status: 500 });
}
}
// Usage : POST /api/tts { "text": "Texte à vocaliser" } → stream MP3Next.js 15+ route streams AudioStream directly (~150ms latency). audio/mpeg headers work in browsers. Pitfall: No try/catch leads to crashes on Polly quotas (5 req/s)—add rate-limiting with Upstash Redis.
S3 Integration for Scalable Storage
For massive audiobooks, upload to S3 instead of local disks. Use StartSpeechSynthesisTask for long async tasks (>3k chars).
Async S3 Task with Polly
import { PollyClient, StartSpeechSynthesisTaskCommand, OutputFormat, Engine, GetSpeechSynthesisTaskCommand } from '@aws-sdk/client-polly';
import * as dotenv from 'dotenv';
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
dotenv.config();
const polly = new PollyClient({ region: process.env.AWS_REGION });
const s3 = new S3Client({ region: process.env.AWS_REGION });
const bucket = 'your-polly-bucket-2026'; // Créez-le avant
const startTask = async (text: string, outputS3Key: string) => {
const input = {
OutputS3BucketName: bucket,
OutputS3KeyPrefix: outputS3Key,
Text: text,
OutputFormat: OutputFormat.Mp3,
VoiceId: 'Dmitri',
Engine: Engine.Neural,
};
const command = new StartSpeechSynthesisTaskCommand(input);
const task = await polly.send(command);
return task.SynthesisTask?.TaskId;
};
const pollTask = async (taskId: string) => {
while (true) {
const statusCmd = new GetSpeechSynthesisTaskCommand({ TaskId: taskId! });
const status = await polly.send(statusCmd);
if (status.SynthesisTask?.TaskStatus === 'completed') {
console.log('✅ Tâche terminée, fichier sur S3');
break;
} else if (status.SynthesisTask?.TaskStatus === 'failed') {
throw new Error('Tâche échouée');
}
await new Promise(r => setTimeout(r, 2000));
}
};
const longText = 'Texte très long pour audiobook... (répétez 10k mots)'.repeat(100);
const taskId = await startTask(longText, 'audiobook.mp3');
await pollTask(taskId!);
// Bonus : Télécharger depuis S3 si besoin
// const getCmd = new GetObjectCommand({ Bucket: bucket, Key: 'audiobook.mp3' });
// const s3obj = await s3.send(getCmd);StartSpeechSynthesisTaskCommand handles texts >3k chars to S3 (cost-effective). Polling with GetSpeechSynthesisTask tracks status. Pitfall: Bucket without pubRead blocks upload—enable S3 CORS for web access.
Best Practices
- Cache audio files: Use Redis to reuse identical syntheses (save 80% on Polly costs).
- Choose Neural voices: Standard voices are outdated; Neural for realism (but pricier—run A/B tests).
- Handle quotas: 5 req/s, 1M chars/day on free tier—implement exponential retry with SDK waiters.
- Secure SSML: Sanitize inputs to prevent
injections. - Monitor usage: Track CloudWatch
SpeechCharactersmetrics to optimize billing.
Common Errors to Avoid
- InvalidLexiconId: Lexicon not found—list with
ListLexiconsCommandbefore use. - XRequestLimitExceeded: No rate-limiting—use AWS API Gateway throttling.
- Marks out of range: Malformed SSML—validate XML with a lib like
xmldom. - Unhandled stream: Forgetting
stream.on('error')causes memory leaks—always promisify.
Next Steps
- AWS Docs: Amazon Polly Developer Guide
- 2026 Voices: Neural French Voices List
- Advanced: Integrate with Lex/Transcribe for full voice pipelines.
- Learni AWS Expert Training: Master Bedrock + Polly for voice AI.