Complete Series

Building Production OCR Pipelines:
What I Learned the Hard Way

Five posts documenting a real OCR project — from the first rabbit hole to a fully automated document pipeline running for less than a penny per page.

5-post series ~25 min total Andrew Judd — developer, consultant

1. What OCR Is and When to Use It

OCR — Optical Character Recognition — converts images or scanned documents into machine-readable text. That definition sounds simple, but building a reliable OCR pipeline in production is a different problem entirely.

Most of the complexity isn't in the OCR step itself. It's in everything around it: preprocessing images to improve accuracy, cleaning the raw text output, mapping extracted values to a schema, handling edge cases, and deciding when the confidence is high enough to trust the result without human review.

OCR is the right solution when:

  • You're receiving documents (PDFs, scans, photos) that contain data you need to process programmatically
  • Manual data entry is the current solution and it's creating a bottleneck or introducing errors
  • The documents follow a consistent enough structure that extraction rules are viable
  • Volume is high enough that automation has clear ROI, or speed matters and humans can't keep up

OCR is probably not the answer when:

Documents are purely machine-generated (export as data instead). Structure varies wildly with no consistent pattern. The cost of errors exceeds the cost of manual review. Or you have fewer than a hundred documents and a person can handle them in an hour.

2. The Five-Post Arc

These posts follow a real project from first experiment to production system. Read them in order for the full narrative, or jump to the one that matches where you are.

3. Hard-Won Lessons

Across five posts and a production deployment, a few lessons kept showing up. These apply regardless of which OCR tool you choose:

The OCR step is rarely the hard part

Every team starting an OCR project thinks the challenge is getting the text out. It's not. The hard parts are cleaning noisy output, handling the 10% of documents that look different from the other 90%, and deciding what "good enough" accuracy means for your use case.

Build a confidence threshold from day one

OCR APIs return confidence scores for a reason. A pipeline that silently accepts low-confidence extractions will create data integrity problems that are painful to debug later. Route low-confidence results to a review queue early — even if nobody checks it at first.

Preprocessing pays off more than switching APIs

Before blaming your OCR engine for poor accuracy, check the input. Deskewing, denoising, increasing contrast, and converting to the right colour space can improve accuracy by 20–40% on scanned documents. Garbage in, garbage out — this is always true in OCR.

"Free" open-source has a real cost

Tesseract is free to run. But making it reliable on a diverse document set takes significant engineering time — preprocessing pipelines, exception handling, output normalization, server maintenance. At real business document volumes, a cloud API at $1–2 per thousand pages is almost always cheaper than the engineering cost of maintaining a self-hosted alternative.

Design for humans in the exception path

A fully automated pipeline still needs a clear path for when automation fails. Build the exception UI — the review queue, the correction interface, the reprocessing mechanism — before you think you need it. You'll need it sooner than you expect.

4. Which OCR Tool Should You Use?

The three tools that matter for most production workloads are Tesseract (open source), AWS Textract, and Google Vision API. They solve meaningfully different problems.

Tesseract

Free / Self-hosted

Best for clean, printed documents at high volume where you want zero per-page cost and are willing to manage the infrastructure and post-processing work.

Struggles with: handwriting, low-quality scans, complex layouts, skewed documents

AWS Textract

~$1.50–$15 / 1k pages

Best for structured documents — forms, invoices, tax documents, applications. Understands document layout and returns key-value pairs from forms, not just raw text.

Struggles with: handwriting, extremely low-quality scans

Google Vision API

~$1.50 / 1k images

Best for photos of documents, handwritten text, and multilingual content. Excellent at reading text at odd angles or in real-world conditions.

Struggles with: structured form extraction (use Document AI for that)

Full comparison: Tesseract vs Textract vs Google Vision

Detailed breakdown of accuracy, cost, use cases, and decision guide

5. Frequently Asked Questions

What is OCR and when should I use it?

OCR converts images or scanned documents into machine-readable text. Use it when you need to extract data from PDFs, document photos, or scanned forms — and manual data entry is the current alternative.

Should I use Tesseract or a cloud OCR API?

Tesseract is free and great for clean documents at high volume if you're willing to manage the infrastructure. Cloud APIs (Textract, Google Vision) are easier to integrate, more accurate on challenging documents, and cost less than a penny per page at typical business volumes — which makes them the right default for most teams.

How much does OCR cost in production?

Cloud APIs run $1–$2 per 1,000 pages for basic text extraction — under a penny per document. Tesseract is free per-page but requires servers and significant engineering time to make reliable. The real cost of "free" OCR is usually measured in developer days, not dollars.

What is the best OCR API for PDFs?

AWS Textract for structured forms and invoices — it understands document layout and extracts key-value pairs. Google Vision for PDFs containing photos, handwriting, or multilingual text. For plain printed text, either works well.

Can OCR handle handwritten text?

Google Vision handles handwriting best. AWS Textract has limited handwriting support. Tesseract is not suitable for handwriting. If handwriting is a core requirement, Google Vision or a specialized handwriting OCR is the right choice.

Need an OCR Pipeline Built?

I've been building document automation systems for over 18 years.

If you're looking at a document automation problem and want to skip the rabbit hole — reach out. We can usually scope the right approach in a single conversation.