8-Post Cluster

Building Production OCR Pipelines:
What I Learned the Hard Way

Q: What is OCR and when should I use it?

OCR (Optical Character Recognition) converts images or scanned documents into machine-readable text. Use it when you need to extract structured data from PDFs, photos of documents, scanned forms, or any image that contains text you need to process programmatically.

Q: Should I use Tesseract or a cloud OCR API?

Tesseract is free and runs locally, making it ideal for high-volume workloads where cost is a constraint and document quality is good. Cloud APIs like AWS Textract and Google Vision are better for low-quality scans, structured forms, handwriting, or when you want to avoid managing infrastructure. For most production workloads, the per-page cost of cloud APIs is negligible compared to the engineering time saved.

Q: How much does OCR cost in production?

Tesseract is free to run but requires servers and engineering time to maintain. Cloud APIs typically cost $1–$2 per 1,000 pages for basic text extraction — under a penny per document at most business volumes. AWS Textract charges more ($15/1,000 pages) when you need structured form and table extraction.

Q: What is the best OCR API for PDFs?

AWS Textract excels at structured PDFs — invoices, tax forms, applications — because it understands document layout and can extract key-value pairs from forms. Google Vision is better for photos of documents and handwritten text. For plain printed text in PDFs, all three options (Tesseract, Textract, Vision) perform well.

Q: Can OCR handle handwritten text?

Google Vision API handles handwriting best among the major options. AWS Textract has mediocre handwriting support. Tesseract struggles significantly with handwriting and is not recommended for it. If handwriting is a core requirement, Google Vision or a specialized handwriting OCR API is the right choice.

Eight posts on OCR in production — from the first rabbit hole to a fully automated pipeline, plus applied examples from a real recipe digitization project.

8 posts ~40 min total Andrew Judd — developer, consultant

In This Series

1. What OCR Is and When to Use It

OCR — Optical Character Recognition — converts images or scanned documents into machine-readable text. That definition sounds simple, but building a reliable OCR pipeline in production is a different problem entirely.

Most of the complexity isn't in the OCR step itself. It's in everything around it: preprocessing images to improve accuracy, cleaning the raw text output, mapping extracted values to a schema, handling edge cases, and deciding when the confidence is high enough to trust the result without human review.

OCR is the right solution when:

You're receiving documents (PDFs, scans, photos) that contain data you need to process programmatically
Manual data entry is the current solution and it's creating a bottleneck or introducing errors
The documents follow a consistent enough structure that extraction rules are viable
Volume is high enough that automation has clear ROI, or speed matters and humans can't keep up

OCR is probably not the answer when:

Documents are purely machine-generated (export as data instead). Structure varies wildly with no consistent pattern. The cost of errors exceeds the cost of manual review. Or you have fewer than a hundred documents and a person can handle them in an hour.

2. The Five-Post Arc

These posts follow a real project from first experiment to production system. Read them in order for the full narrative, or jump to the one that matches where you are.

Post 1 Exploration

The OCR Rabbit Hole

The first experiment: what happens when you point an OCR library at a real document and see what comes back. This post covers the initial discovery — what worked, what surprised me, and why the problem turned out to be much harder than "just read the text."

Post 2 Post-processing

When the Cleanup Code Becomes the Project

Raw OCR output is almost never what you need. This post is about the second phase of every OCR project: normalizing, correcting, and structuring the extracted text. How the "quick cleanup step" became the most complex part of the pipeline.

Post 3 Turning point

One API Call Changed Everything

Switching from a self-hosted OCR solution to a managed cloud API. What changed, what got dramatically simpler, and why the per-document cost turned out to be almost irrelevant compared to the engineering time saved.

Post 4 Cost analysis

Less Than a Penny Per Document

The actual numbers: what cloud OCR costs at real business volumes, how to think about the build-vs-buy decision, and why "free" open-source OCR often isn't free once you account for the time to make it reliable.

Post 5 Full pipeline

From Inbox to Database Without a Human in the Middle

The complete automated pipeline: email arrives, attachment is extracted, OCR runs, data is validated and stored — with no manual step unless confidence falls below threshold. How all the pieces fit together end to end.

Applied Series Recipe Digitization

OCR applied to a specific, hard problem: digitizing shoeboxes of handwritten recipe cards. Three posts from building Flour Power — covering what breaks, what it actually costs, and what process works.

Recipe Post 1 Failure modes

What OCR Actually Gets Wrong on Handwritten Recipes

OCR demos always use a clean printed card. Real recipe boxes are faded and handwritten, where errors in quantities and abbreviations quietly ruin the dish. Here's what actually breaks.

Recipe Post 2 Cost breakdown

What It Really Costs to Digitize a Box of Handwritten Recipes

Everyone asks the price per card. It's the wrong question. The OCR runs about thirty cents a box; the real bill is ten hours of human review. Here's the honest math and three ways to handle it.

Recipe Post 3 Full process

The Best Way to Convert Handwritten Recipes to Digital

Search results sell scanners. This covers the actual process: what "digital" means, how to capture well, which engine handles handwriting, and why the confirm-and-correct step is the one that determines success.

3. Hard-Won Lessons

Across five posts and a production deployment, a few lessons kept showing up. These apply regardless of which OCR tool you choose:

The OCR step is rarely the hard part

Every team starting an OCR project thinks the challenge is getting the text out. It's not. The hard parts are cleaning noisy output, handling the 10% of documents that look different from the other 90%, and deciding what "good enough" accuracy means for your use case.

Build a confidence threshold from day one

OCR APIs return confidence scores for a reason. A pipeline that silently accepts low-confidence extractions will create data integrity problems that are painful to debug later. Route low-confidence results to a review queue early — even if nobody checks it at first.

Preprocessing pays off more than switching APIs

Before blaming your OCR engine for poor accuracy, check the input. Deskewing, denoising, increasing contrast, and converting to the right colour space can improve accuracy by 20–40% on scanned documents. Garbage in, garbage out — this is always true in OCR.

"Free" open-source has a real cost

Tesseract is free to run. But making it reliable on a diverse document set takes significant engineering time — preprocessing pipelines, exception handling, output normalization, server maintenance. At real business document volumes, a cloud API at $1–2 per thousand pages is almost always cheaper than the engineering cost of maintaining a self-hosted alternative.

Design for humans in the exception path

A fully automated pipeline still needs a clear path for when automation fails. Build the exception UI — the review queue, the correction interface, the reprocessing mechanism — before you think you need it. You'll need it sooner than you expect.

4. Which OCR Tool Should You Use?

The three tools that matter for most production workloads are Tesseract (open source), AWS Textract, and Google Vision API. They solve meaningfully different problems.

Tesseract

Free / Self-hosted

Best for clean, printed documents at high volume where you want zero per-page cost and are willing to manage the infrastructure and post-processing work.

Struggles with: handwriting, low-quality scans, complex layouts, skewed documents

AWS Textract

~$1.50–$15 / 1k pages

Best for structured documents — forms, invoices, tax documents, applications. Understands document layout and returns key-value pairs from forms, not just raw text.

Struggles with: handwriting, extremely low-quality scans

Google Vision API

~$1.50 / 1k images

Best for photos of documents, handwritten text, and multilingual content. Excellent at reading text at odd angles or in real-world conditions.

Struggles with: structured form extraction (use Document AI for that)

Full comparison: Tesseract vs Textract vs Google Vision

Detailed breakdown of accuracy, cost, use cases, and decision guide

5. Frequently Asked Questions

What is OCR and when should I use it?

OCR converts images or scanned documents into machine-readable text. Use it when you need to extract data from PDFs, document photos, or scanned forms — and manual data entry is the current alternative.

Should I use Tesseract or a cloud OCR API?

Tesseract is free and great for clean documents at high volume if you're willing to manage the infrastructure. Cloud APIs (Textract, Google Vision) are easier to integrate, more accurate on challenging documents, and cost less than a penny per page at typical business volumes — which makes them the right default for most teams.

How much does OCR cost in production?

Cloud APIs run $1–$2 per 1,000 pages for basic text extraction — under a penny per document. Tesseract is free per-page but requires servers and significant engineering time to make reliable. The real cost of "free" OCR is usually measured in developer days, not dollars.

What is the best OCR API for PDFs?

AWS Textract for structured forms and invoices — it understands document layout and extracts key-value pairs. Google Vision for PDFs containing photos, handwriting, or multilingual text. For plain printed text, either works well.

Can OCR handle handwritten text?

Google Vision handles handwriting best. AWS Textract has limited handwriting support. Tesseract is not suitable for handwriting. If handwriting is a core requirement, Google Vision or a specialized handwriting OCR is the right choice.

Need an OCR Pipeline Built?

I've been building document automation systems for over 18 years.

OCR is often one step in a larger small business automation pipeline. If you're looking at a document problem and want to skip the rabbit hole — reach out. We can usually scope the right approach in a single conversation.

Start a Conversation Read the Blog

Building Production OCR Pipelines: What I Learned the Hard Way