One API Call Changed Everything
Sunday morning. I'm close to just giving up and typing everything by hand. But I want to try one more thing. Instead of OCR to get characters, what if I just ask a vision model what the document says?
Sunday morning. I'm about ready to type everything by hand and call it a weekend.
But I want to try one more thing. Instead of OCR to extract characters and then code to figure out what those characters mean - what if I just send the image to a vision model and ask what the document says?
The Code
import openai
import base64
import json
from pathlib import Path
def extract_document_vision(image_path):
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Extract handwritten documents into structured
JSON. Identify:
- title: the document title or heading
- items: array of {quantity, unit, item} for any
listed items with measurements
- instructions: array of step strings for any
procedural content
- notes: any additional annotations or side notes
If something is crossed out, ignore it.
If you can't read something clearly, make your
best interpretation and add a "uncertain": true
flag to that field."""
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]
}
],
response_format={"type": "json_object"}
)
message = response.choices[0].message
if message.content is None:
raise ValueError(f"No content returned. Finish reason: {response.choices[0].finish_reason}. Refusal: {message.refusal}")
return json.loads(message.content)
if __name__ == "__main__":
for image_file in sorted(Path("images").glob("*")):
print(f"\n--- {image_file.name} ---")
try:
result = extract_document_vision(image_file)
print(json.dumps(result, indent=2))
except ValueError as e:
print(f"Skipped: {e}")
No pre-processing. No regex. No parser. One API call and a prompt in English.
Compare that to yesterday.
What Came Back
The document that Textract turned into "2 1/4 c fleur, 1 tso bokrig sado":
{
"title": "Chocolate Chip Cookies",
"items": [
{"quantity": "2 1/4", "unit": "cups", "item": "flour"},
{"quantity": "1", "unit": "tsp", "item": "baking soda"},
{"quantity": "1", "unit": "tsp", "item": "salt"},
{"quantity": "1", "unit": "cup", "item": "butter", "notes": "2 sticks"}
],
"instructions": [
"Preheat oven to 375°F",
"Combine flour, baking soda and salt in small bowl",
"Beat butter, granulated sugar, brown sugar and vanilla extract in large mixer bowl until creamy",
"Add eggs, beating well",
"Gradually beat in flour mixture",
"Stir in chocolate chips",
"Drop rounded tablespoon of dough onto ungreased baking sheets",
"Bake for 9 to 11 minutes or until golden brown"
],
"notes": null
}
First try. "c" became "cups." "tsp" stayed "tsp" because that's already standard. It caught "(2 sticks)" as a note on the butter and put it in the right field.
Why
OCR asks "what characters are in this image?" Hard problem when the characters are messy handwriting.
The vision model asks "what does this document say?" Sounds like the same question. It's not.
Think about how you read someone's handwriting. You don't decode each letter and build words from shapes. You look at the whole thing and between context and layout and your knowledge of language, you just know. Even when individual letters are a mess.
That's what's happening here. The model isn't a better letter-recognizer. It's skipping that problem entirely.
The Stuff That Broke OCR
The crossed-out line that killed my parser? Vision model saw the strikethrough, ignored it, read the correction. No code for that. Just worked.
Marginal notes Textract mixed into the main text? Identified as supplementary. Put in the "notes" field.
Abbreviations Tesseract turned into garbage? Interpreted from context.
The layout I spent 200 lines of regex on? Figured out on its own. Titles in "title." Items in "items." Steps in "instructions."
Three Approaches, Same Document
Tesseract:
Chocohite Ch p Cookes
2 114 cps flcar
1 tso bokrg sado
l tsp sit
1 c (2 stcks) btter
Textract (after all the pre-processing and parsing):
Title: Chocokite Chtp Cookes (confidence: 0.67)
Items:
- 2 1/4 c fleur
- 1 tso bokrig sado
- l tsp slt
- 1 c (2 stcks) btter
[MANUAL REVIEW REQUIRED - 4 items below confidence threshold]
Vision API:
{
"title": "Chocolate Chip Cookies",
"items": [
{"quantity": "2 1/4", "unit": "cups", "item": "flour"},
{"quantity": "1", "unit": "tsp", "item": "baking soda"},
{"quantity": "1", "unit": "tsp", "item": "salt"},
{"quantity": "1", "unit": "cup", "item": "butter", "notes": "2 sticks"}
]
}
| Metric | Tesseract | Textract | Vision API | |--------|-----------|----------|------------| | Character accuracy | 30-40% | 40-60% | 95%+ | | Structure accuracy | N/A | ~30% | ~90% | | Manual review needed | ~90% | ~70% | ~5-10% | | Pre-processing | Yes | Yes (6 params) | None | | Lines of code | ~50 | 300+ | ~30 | | Dev time | ~4 hours | ~40 hours | ~2 hours |
The vision model's mistakes are small. A "3" that might be an "8." An abbreviation it flagged as uncertain. Stuff you catch in seconds. Not garbled output you have to retype.
By Sunday afternoon, everything is processed. The thing I spent all of Saturday failing to do took a couple of hours once I changed the approach.