Introduction
Document AI pipelines enable automating the extraction of information from unstructured documents such as PDFs, invoices, or contracts. In 2026, they combine modern OCR, language models, and orchestration for reliable results. This tutorial guides you step by step in building a complete pipeline, from ingestion to structured export. You will learn to handle errors, optimize performance, and integrate reusable AI components.
Prerequisites
- Python 3.11 or higher
- Intermediate knowledge of Python and Pydantic
- Google Cloud account (optional for Document AI)
- Basic knowledge of PDF and JSON formats
Installing Dependencies
pip install pymupdf pytesseract pillow pydantic python-dotenv
pip install google-cloud-documentaiThese packages provide PDF manipulation, OCR, and data validation. google-cloud-documentai is included for real-world use with the Google API.
Pydantic Data Model
from pydantic import BaseModel, Field
from typing import List, Optional
class ExtractedField(BaseModel):
key: str
value: str
confidence: float = Field(ge=0.0, le=1.0)
class DocumentResult(BaseModel):
document_id: str
extracted_fields: List[ExtractedField]
raw_text: strThis model ensures pipeline outputs are always structured and automatically validated.
PDF Ingestion Module
import fitz
from pathlib import Path
def extract_text_from_pdf(pdf_path: Path) -> str:
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
doc.close()
return text.strip()Uses PyMuPDF to extract raw text quickly. Handles multi-page documents without loading everything into memory.
OCR Component
import pytesseract
from PIL import Image
import fitz
def ocr_pdf_page(pdf_path: str, page_num: int = 0) -> str:
doc = fitz.open(pdf_path)
page = doc[page_num]
pix = page.get_pixmap(dpi=300)
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
text = pytesseract.image_to_string(img, lang='fra')
doc.close()
return textConverts the page to a high-resolution image then applies Tesseract in French. Ideal for scanned documents.
Main Pipeline
from ingestion import extract_text_from_pdf
from ocr import ocr_pdf_page
from models import DocumentResult, ExtractedField
def run_document_pipeline(pdf_path: str, use_ocr: bool = False) -> DocumentResult:
if use_ocr:
raw = ocr_pdf_page(pdf_path)
else:
raw = extract_text_from_pdf(pdf_path)
# Extraction simplifiée (à remplacer par LLM)
fields = [ExtractedField(key="client", value="Exemple SA", confidence=0.92)]
return DocumentResult(document_id=pdf_path, extracted_fields=fields, raw_text=raw)Orchestrates ingestion and extraction steps. The use_ocr flag allows easy switching between native text and OCR.
Best Practices
- Always validate data with Pydantic
- Use flags to enable or disable OCR
- Log each step with metadata
- Version prompts and models used
- Test with real and noisy documents
Common Errors to Avoid
- Forgetting to handle exceptions when opening corrupted PDFs
- Not normalizing text before AI extraction
- Ignoring model confidence scores
- Running OCR on all documents without checking for native text
Further Reading
Discover our advanced training on document AI and pipeline orchestration at Learni.