May 22, 2026

Less Than a Penny Per Document

People hear "I replaced my OCR pipeline with a vision model" and the first thing they ask about is cost. Fair question. I assumed it would be expensive too. Under a penny per document.

People hear "vision model" and assume expensive.

Fair. I assumed the same thing.

The Bill

Under a penny per document.

GPT-4o charges about $2.50 per million input tokens right now. A document photo is maybe 1,000-2,000 tokens for the image plus a few hundred for the prompt and response. That's $0.003 to $0.008.

Less than one cent.

What Nobody Compares

Textract is cheap per page too. About $1.50 per thousand pages. Per-unit, it's actually cheaper than the vision API.

But per-unit API cost is a terrible comparison.

Here's what the Textract approach actually cost:

My entire Saturday. Pipeline, pre-processing, regex parsers, manual review queue. At any reasonable hourly rate, that's thousands of dollars. Before a single document is correctly processed.

70% manual review. Half the time faster to just retype the thing than hunt for all the errors.

And the vision API approach? Two hours on Sunday morning. Write the integration, test a few documents, tweak the prompt. Done. 5-10% flagged for review, and those are quick fixes. A digit, an abbreviation. Not a full retype.

Numbers

500 documents:

	Textract	Vision API
API cost	~$0.75	~$2.50
Dev time	~40 hrs @ $100/hr = $4,000	~2 hrs @ $100/hr = $200
Manual review	~350 docs @ 5 min = 29 hrs	~35 docs @ 2 min = 1.2 hrs
Maintenance (3 months)	~20 hrs	~0 hrs
Total	~$6,000+	~$320

Higher per API call. Lower in every other way.

When To Use Which

I'm not going to pretend the vision API is always right. Traditional OCR still makes sense when you've got millions of identical documents from the same template, same layout, same fields in the same spots. Template matching works great there. No need to pay for a model that understands context when there's nothing to understand.

Same thing if you can't make external API calls. Air-gapped networks, edge devices, strict data residency. Tesseract locally and that's that.

And compliance. Your OCR provider might already have the certs you need. Your vision API provider might not.

But handwritten documents? Mixed layouts? Documents where you need structure and not just characters? Anything where time-to-value matters? Vision API every time.

The Quick Test

Look at one of your documents.

Could you hand it to a random person and they'd get it in a few seconds?

If yes - vision model. Less than a penny.

If a template could extract the data - traditional OCR is cheaper at volume.

If a human would struggle with it too - neither approach saves you. That's a data quality problem, not a tool problem.

If you're sitting on a stack of documents that need digitizing and you've either already been down the OCR road or you've been putting it off because you know how it goes - this is worth looking at.

Less than a penny per document. That's what I'm actually paying.

Part of the OCR pipelines series.

Share this article

Less Than a Penny Per Document

The Bill

What Nobody Compares

Numbers

When To Use Which

The Quick Test

More from the Blog

DNS for People Who've Been Faking It

Load Times and Your Websites

The Best Way to Convert Handwritten Recipes to Digital (What Actually Works)

Want to Work Together?