The OCR Rabbit Hole
Long weekend. I've got a pile of handwritten documents that need to become structured data. Should be a weekend project. OCR exists, it's been around forever, this is a solved problem. Right?
Long weekend. Pile of handwritten documents on the desk. They need to become structured data - searchable fields in an app, not just scanned images.
OCR's been around forever. This should take a couple of hours.
The Documents
They're handwritten. Different people, different paper, different decades. Faded ink on some. Crossed-out words on others. Notes crammed into margins where there wasn't really room.
Any person can read these in about five seconds. Remember that.
Tesseract
Saturday morning. Tesseract is free, it's open source, I can run it on my own machine. Good enough to start.
import pytesseract
from PIL import Image
from pathlib import Path
def extract_document(image_path):
img = Image.open(image_path)
text = pytesseract.image_to_string(img)
return text
if __name__ == "__main__":
for image_file in sorted(Path("images").glob("*")):
print(f"\n--- {image_file.name} ---")
print(extract_document(image_file))
Printed text? Fine. Handwriting? "2 1/4 cups flour" comes back as "2 114 cps flcar."
I try different config options. Page segmentation modes. OCR engine modes. Character allowlists. Each one helps somewhere and breaks something else.
Pre-Processing
Maybe the images are the problem.
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance, ImageOps
import numpy as np
from pathlib import Path
def preprocess_document(image_path):
img = Image.open(image_path)
img = img.convert('L')
img = ImageOps.autocontrast(img, cutoff=2)
img = img.filter(ImageFilter.SHARPEN)
img = ImageEnhance.Contrast(img).enhance(1.8)
img_array = np.array(img)
threshold = np.mean(img_array)
img_array = ((img_array > threshold) * 255).astype(np.uint8)
return Image.fromarray(img_array)
if __name__ == "__main__":
for image_file in sorted(Path("images").glob("*")):
print(f"\n--- {image_file.name} ---")
img = preprocess_document(image_file)
print(pytesseract.image_to_string(img))
Contrast boosting helps the faded ones. Hurts the clear ones. Binarization destroys the subtle parts of the handwriting. Sharpening makes some letters readable and others bleed together.
Every fix is a trade-off. I'm tuning parameters per document. That's not automation.
30-40% accuracy on the handwriting. Saturday morning gone.
Tesseract is great at printed text. That's not what I have.