May 19, 2026

The OCR Rabbit Hole

Long weekend. I've got a pile of handwritten documents that need to become structured data. Should be a weekend project. OCR exists, it's been around forever, this is a solved problem. Right?

Long weekend. Pile of handwritten documents on the desk. They need to become structured data - searchable fields in an app, not just scanned images.

OCR's been around forever. This should take a couple of hours.

The Documents

They're handwritten. Different people, different paper, different decades. Faded ink on some. Crossed-out words on others. Notes crammed into margins where there wasn't really room.

Any person can read these in about five seconds. Remember that.

Tesseract

Saturday morning. Tesseract is free, it's open source, I can run it on my own machine. Good enough to start.

import pytesseract
from PIL import Image
from pathlib import Path

def extract_document(image_path):
    img = Image.open(image_path)
    text = pytesseract.image_to_string(img)
    return text

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        print(extract_document(image_file))

Printed text? Fine. Handwriting? "2 1/4 cups flour" comes back as "2 114 cps flcar."

I try different config options. Page segmentation modes. OCR engine modes. Character allowlists. Each one helps somewhere and breaks something else.

Pre-Processing

Maybe the images are the problem.

import pytesseract
from PIL import Image, ImageFilter, ImageEnhance, ImageOps
import numpy as np
from pathlib import Path

def preprocess_document(image_path):
    img = Image.open(image_path)
    img = img.convert('L')
    img = ImageOps.autocontrast(img, cutoff=2)
    img = img.filter(ImageFilter.SHARPEN)
    img = ImageEnhance.Contrast(img).enhance(1.8)
    
    img_array = np.array(img)
    threshold = np.mean(img_array)
    img_array = ((img_array > threshold) * 255).astype(np.uint8)
    
    return Image.fromarray(img_array)

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        img = preprocess_document(image_file)
        print(pytesseract.image_to_string(img))

Contrast boosting helps the faded ones. Hurts the clear ones. Binarization destroys the subtle parts of the handwriting. Sharpening makes some letters readable and others bleed together.

Every fix is a trade-off. I'm tuning parameters per document. That's not automation.

30-40% accuracy on the handwriting. Saturday morning gone.

Tesseract is great at printed text. That's not what I have.

This post is part of the OCR series. Next: When the Cleanup Code Becomes the Project

Share this article

The OCR Rabbit Hole

The Documents

Tesseract

Pre-Processing

More from the Blog

The Best Way to Convert Handwritten Recipes to Digital (What Actually Works)

What It Really Costs to Digitize a Box of Handwritten Recipes

What OCR Actually Gets Wrong on Handwritten Recipes

Want to Work Together?