Less Than a Penny Per Document
People hear "I replaced my OCR pipeline with a vision model" and the first thing they ask about is cost. Fair question. I assumed it would be expensive too. Under a penny per document.
People hear "vision model" and assume expensive.
Fair. I assumed the same thing.
The Bill
Under a penny per document.
GPT-4o charges about $2.50 per million input tokens right now. A document photo is maybe 1,000-2,000 tokens for the image plus a few hundred for the prompt and response. That's $0.003 to $0.008.
Less than one cent.
What Nobody Compares
Textract is cheap per page too. About $1.50 per thousand pages. Per-unit, it's actually cheaper than the vision API.
But per-unit API cost is a terrible comparison.
Here's what the Textract approach actually cost:
My entire Saturday. Pipeline, pre-processing, regex parsers, manual review queue. At any reasonable hourly rate, that's thousands of dollars. Before a single document is correctly processed.
70% manual review. Half the time faster to just retype the thing than hunt for all the errors.
And the vision API approach? Two hours on Sunday morning. Write the integration, test a few documents, tweak the prompt. Done. 5-10% flagged for review, and those are quick fixes. A digit, an abbreviation. Not a full retype.
Numbers
500 documents:
| | Textract | Vision API | |--|---------|------------| | API cost | ~$0.75 | ~$2.50 | | Dev time | ~40 hrs @ $100/hr = $4,000 | ~2 hrs @ $100/hr = $200 | | Manual review | ~350 docs @ 5 min = 29 hrs | ~35 docs @ 2 min = 1.2 hrs | | Maintenance (3 months) | ~20 hrs | ~0 hrs | | Total | ~$6,000+ | ~$320 |
Higher per API call. Lower in every other way.
When To Use Which
I'm not going to pretend the vision API is always right. Traditional OCR still makes sense when you've got millions of identical documents from the same template, same layout, same fields in the same spots. Template matching works great there. No need to pay for a model that understands context when there's nothing to understand.
Same thing if you can't make external API calls. Air-gapped networks, edge devices, strict data residency. Tesseract locally and that's that.
And compliance. Your OCR provider might already have the certs you need. Your vision API provider might not.
But handwritten documents? Mixed layouts? Documents where you need structure and not just characters? Anything where time-to-value matters? Vision API every time.
The Quick Test
Look at one of your documents.
Could you hand it to a random person and they'd get it in a few seconds?
If yes - vision model. Less than a penny.
If a template could extract the data - traditional OCR is cheaper at volume.
If a human would struggle with it too - neither approach saves you. That's a data quality problem, not a tool problem.
If you're sitting on a stack of documents that need digitizing and you've either already been down the OCR road or you've been putting it off because you know how it goes - this is worth looking at.
Less than a penny per document. That's what I'm actually paying.