May 20, 2026

When the Cleanup Code Becomes the Project

Tesseract can't handle the handwriting. Time to spend some money. AWS Textract has built-in handwriting detection. It's better. But "better" and "good enough" are different things.

Tesseract can't do handwriting. Time to spend money.

AWS Textract. Cloud service, built-in handwriting detection, pay per page. If I'm paying for it, the output should at least be usable.

Textract

import boto3
from pathlib import Path

def extract_document(image_path):
    client = boto3.client('textract')
    
    with open(image_path, 'rb') as f:
        image_bytes = f.read()
    
    response = client.detect_document_text(
        Document={'Bytes': image_bytes}
    )
    
    lines = []
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            lines.append({
                'text': block['Text'],
                'confidence': block['Confidence']
            })
    
    return lines

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        for line in extract_document(image_file):
            print(f"  [{line['confidence']:5.1f}%] {line['text']}")

Confidence scores are a nice touch. Accuracy is better - maybe 40-60% on a good document.

But "better" isn't "good enough." "2 1/4 cups flour" comes back as "2 1/4 c fleur." "1 tsp baking soda" becomes "1 tso bokrig sado."

The Real Problem

Even when it gets the words right, Textract doesn't know what any of it means. Flat text. Lines in reading order. My documents have titles, ingredient lists, instruction paragraphs, notes in margins. Textract sees none of that. Just characters on a page.

So now I'm writing parsers.

import boto3
import json
import re
from pathlib import Path

def extract_document(image_path):
    client = boto3.client('textract')
    
    with open(image_path, 'rb') as f:
        image_bytes = f.read()
    
    response = client.detect_document_text(
        Document={'Bytes': image_bytes}
    )
    
    lines = []
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            lines.append({
                'text': block['Text'],
                'confidence': block['Confidence']
            })
    
    return lines

def parse_structured_data(raw_lines):
    title = None
    items = []
    instructions = []
    
    quantity_pattern = r'^(\d+[\s/]*\d*)\s*(cups?|tbsp?|tsp|oz|lbs?|g|ml|c)\s+(.+)'
    
    for line in raw_lines:
        text = line['text'].strip()
        match = re.match(quantity_pattern, text, re.IGNORECASE)
        
        if match:
            items.append({
                'quantity': match.group(1),
                'unit': match.group(2),
                'item': match.group(3)
            })
        elif not title:
            title = text
        else:
            instructions.append(text)
    
    return {'title': title, 'items': items, 'instructions': instructions}

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        raw_lines = extract_document(image_file)
        result = parse_structured_data(raw_lines)
        print(json.dumps(result, indent=2))

Works on 30% of the documents. The other 70% break at least one assumption. Title not on line one. Quantities written backwards. Abbreviations I've never seen. Crossed-out text mixed into the content. Multi-line entries split apart.

Every new document, a new edge case. Every new edge case, another if, another regex.

Saturday Night

Here's where I'm at:

Pre-processing with 6 configurable parameters
200+ lines of regex and heuristics
70% of documents still need a human
Accuracy I'm being generous calling 30%

The parser is now more work than just typing things by hand.

And every time I fix one document's output, three others break. The heuristics are fragile. Interconnected. Basically untestable because no two documents look alike.

One document has a crossed-out line. Original text scratched out, correction written above. Any person glances at it and reads the correction. Half a second.

Textract returns both lines. Jumbled. My parser doesn't know what a strikethrough is. Teaching it would mean analyzing the spatial layout of ink strokes. That's not a text problem anymore. That's a computer vision problem.

I'm a full day in. The system I'm building reads worse than I do, and the code to make it slightly less bad is growing faster than the documents it's supposed to process.

Part of the OCR pipelines series.

Share this article

When the Cleanup Code Becomes the Project

Textract

The Real Problem

Saturday Night

More from the Blog

The Best Way to Convert Handwritten Recipes to Digital (What Actually Works)

What It Really Costs to Digitize a Box of Handwritten Recipes

What OCR Actually Gets Wrong on Handwritten Recipes

Want to Work Together?