Easily convert scanned or photographed documents into editable, machine-readable text using Optical Character Recognition (OCR).

From this:

Image description]

To This:

Image description

Method 1: Using AWS Textract API

This method leverages Amazon's powerful Textract OCR service to extract text from images. It works well for printed text, especially on structured documents.

Features:

  • Batch process multiple files
  • Automatically saves extracted content to .txt files
  • Supports line-by-line text extraction
  • Easy to extend and integrate

Sample code:

Image description

Method 2: Pytesseract

Tools We Use

  • Python
  • OpenCV – For image processing
  • Pytesseract – A Python wrapper for Google’s Tesseract-OCR Engine
  • Pillow (PIL) – For additional image support
  • CSV – To export results

Before diving into the code, make sure you have Tesseract installed and properly linked in your script. For macOS users:

pytesseract.pytesseract.tesseract_cmd = r'/opt/homebrew/bin/tesseract'

Step 1: Image Preprocessing Pipeline

OCR engines work best when the input text is clear, contrast-rich, and isolated from background noise. Here's the sequence of transformations applied to each image:

  1. Grayscale Conversion
    Converting the image to grayscale helps reduce complexity and makes it easier to isolate text.

  2. Sharpening
    We apply a Laplacian filter to enhance edges and make the text pop.

  3. Inversion
    OCR engines often perform better when text is black on a white background.

  4. Thresholding
    We binarise the image using Otsu’s method, separating text from background.

  5. Denoising
    Removing small artifacts with a median blur filter.

  6. Font Smoothing (Optional)
    If text appears too thin or thick, you can apply morphological operations (dilation/erosion).

Step 2: OCR Text Extraction

Once the image is processed, we convert it back to RGB (required by Tesseract) and extract the text.

Step 3: Save to CSV

Finally, we store the extracted text in a CSV file for later use.

Sample code:

Image description

Image description

What Next?

Once your text is extracted, check out my companion repository:
https://github.com/TheOxfordDeveloper/Parsing-unstructured-data.git

It shows how to clean and convert raw OCR text into structured, tabular formats, like this:

Image description