Hey there, tech enthusiasts! 👋 Ever wanted to extract text from PDF documents but found traditional OCR solutions lacking in accuracy and context understanding? That's exactly the challenge I decided to tackle in my recent project. In this article, I'll take you through my journey of comparing the OCR capabilities of two powerhouse Large Language Models available through Amazon Bedrock: Claude 3.7 Sonnet and Amazon's own Nova Pro.

The PDF Challenge: Beyond Traditional OCR

PDF documents present a unique challenge for text extraction. While they may look like simple text documents to human eyes, they're actually complex containers that can include various elements:

  • Text layers that may or may not be selectable
  • Images with embedded text
  • Complex layouts with tables and multi-column formats
  • Mixed font styles and sizes
  • Potential scanning artifacts

Traditional OCR tools like Tesseract often struggle with maintaining the original formatting, understanding tables, or handling lower quality scans. This is where modern multimodal LLMs enter the picture, offering a more interpretative approach to text extraction.

Project Overview: A Bedrock-Powered PDF Reader

My llms-ocr-comparation project aims to answer a specific question: how do two of Amazon Bedrock's most capable models—Claude 3.7 Sonnet and Amazon Nova Pro—compare when extracting text from PDF documents?

The project structure is straightforward:

├── documents.ipynb   # The main notebook with all code
├── documents/        # Input PDF files
├── images/           # Converted PDF pages as images
├── texts/            # Extracted text results
└── README.md

How It Works: The Technical Deep Dive

Looking at the code in documents.ipynb, we can see a well-structured pipeline for PDF text extraction:

Step 1: PDF to Image Conversion

The first step uses the PyMuPDF (fitz) library to convert each page of a PDF into a high-resolution image:

document = fitz.open("./documents/sample.pdf")

for page_number, page in enumerate(document):
    document_image = f"./images/page_{page_number + 1}.jpeg"
    pix = page.get_pixmap(alpha=False, dpi=300)
    pix.save(document_image)

This conversion is crucial because it normalizes the input for both models—whether the original PDF had selectable text or not, we're converting everything to an image to test the pure OCR capabilities of these LLMs.

Step 2: Setting Up the Models

The notebook defines two functions, one for each model:

def extract_text_with_claude_3_7_sonnet(base64_image):
    start_time = time.time()
    model_id = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
    # Function code...

def extract_text_with_nova_pro(base64_image):
    start_time = time.time()
    model_id = "us.amazon.nova-pro-v1:0"
    # Function code...

What's particularly interesting is the carefully crafted prompt used for both models:

instructions = """Please extract and format the readable text from the provided image, respecting the original structure as much as possible. Follow these instructions:

- For continuous text, keep the original separation by line breaks.
- If there are tables, use Markdown syntax to present them in an organized way:

Example of expected output:

- For plain text: [Extracted text with line breaks as necessary]

- For documents with tables:
| Header1 | Header2 | Header3 |
|---------|---------|---------|
|  Data1  |  Value1 |  Value1 |
|  Data2  |  Value2 |  Value2 |

Note: Avoid adding additional interpretations or comments to the extracted content."""

This prompt does something crucial that traditional OCR tools can't do: it provides context and instructions about how to interpret and format the extracted text, particularly for tables.

Step 3: Parallel Processing with asyncio

One of the clever aspects of this implementation is the use of asyncio to process both models concurrently:

async def parallel_process():
    return await asyncio.gather(
        loop.run_in_executor(executor, extract_text_with_claude_3_7_sonnet, document_image),
        loop.run_in_executor(executor, extract_text_with_nova_pro, document_image)
    )

claude_3_7_sonnet_result, nova_pro_result = await parallel_process()

This approach maximizes efficiency by sending the same image to both models simultaneously, rather than waiting for one to complete before starting the next.

The LLM Advantage in OCR: Beyond Character Recognition

Looking at the code and the README, it's clear that this project is exploring how modern LLMs are transforming what we traditionally think of as OCR. While traditional OCR tools focus on character recognition, these LLMs are doing something much more sophisticated:

1. Contextual Understanding

Traditional OCR operates on a character-by-character or word-by-word basis. LLMs, however, can "read" the document more like a human would, using context to improve accuracy. If a character is partially obscured or ambiguous, the model can make an educated guess based on surrounding words and the overall context of the document.

2. Format Preservation

The prompt specifically instructs the models to preserve formatting, including tables. This is evident in how the models are asked to convert tables to Markdown format, maintaining the relationships between data cells—something traditional OCR often fails at.

3. Intelligent Interpretation

LLMs can distinguish between different document elements—headings, body text, tables, etc.—and format them appropriately. This level of document understanding goes well beyond simple text extraction.

The Results: Claude 3.7 Sonnet vs. Nova Pro

Looking at the output metrics from the notebook:

Claude 3.7 Sonnet:
  Input Tokens  : 1666
  Output Tokens : 1036
  Start Time    : 1746124263.978323
  End Time      : 1746124292.999416

Amazon Nova Pro:
  Input Tokens  : 2223
  Output Tokens : 971
  Start Time    : 1746124263.98382
  End Time      : 1746124279.478841

We can observe some interesting differences:

Performance Metrics

  • Processing Speed: Nova Pro completed the task about 13 seconds faster (15.5 seconds vs. 29 seconds for Claude)
  • Token Efficiency: Claude used fewer input tokens (1666 vs. 2223) but produced slightly more output tokens (1036 vs. 971)

While the README doesn't include qualitative comparisons of the actual text extraction results, these metrics alone highlight an interesting tradeoff: Nova Pro offers faster processing, while Claude appears to be more token-efficient on the input side.

The Speed vs. Accuracy Tradeoff

Based on the implementation and metrics, we can infer that there's likely a speed vs. accuracy tradeoff between these models:

  • Nova Pro appears optimized for speed, processing the same image in roughly half the time
  • Claude 3.7 Sonnet takes longer but might be doing more thorough analysis of the content

This type of comparison is exactly what makes this project valuable—understanding these tradeoffs is crucial for developers choosing the right model for their specific use case.

Practical Applications and Use Cases

The ability to accurately extract and interpret text from PDFs has numerous applications across industries:

Document Processing Automation

  • Legal Document Analysis: Extract clauses, terms, and key information from contracts
  • Financial Document Processing: Parse statements, invoices, and reports
  • Healthcare Records Management: Extract patient information and medical data from forms

Knowledge Management

  • Research Paper Analysis: Extract text and data from academic papers
  • Technical Documentation: Convert PDF manuals into searchable knowledge bases
  • Archival Digitization: Make historical documents accessible and searchable

Data Entry and Form Processing

  • Automated Form Data Extraction: Pull information from filled forms into databases
  • Receipt and Invoice Processing: Extract line items, totals, and vendor information
  • Business Card Information Extraction: Populate CRM systems from scanned business cards

Beyond Basic OCR: The Future is Interpretative

What makes this approach revolutionary is the shift from character recognition to document understanding. These LLMs aren't just identifying letters and words—they're interpreting documents holistically.

Intelligent Document Processing (IDP)

As mentioned in the README's contributing section, this project could be extended to include IDP comparison. This is where the real power of LLM-based OCR shines—not just extracting text but understanding document types, identifying key fields, and extracting structured information without predefined templates.

For example, given an invoice, these models could:

  1. Recognize it as an invoice (document classification)
  2. Extract structured data (invoice number, date, line items)
  3. Identify relationships between data elements

Handling Edge Cases

Traditional OCR systems often fail with:

  • Handwritten notes
  • Low-quality scans
  • Unusual layouts
  • Mixed languages

LLMs excel in these scenarios because they bring human-like interpretative capabilities to the task. They can fill in gaps using context and make educated guesses about unclear content.

Implementation Best Practices from the Project

Looking at the code implementation, there are several best practices worth highlighting:

1. Well-Crafted Prompts

The detailed instructions given to both models demonstrate the importance of clear, specific prompting. By explicitly asking for table formatting in Markdown, the prompt guides the models toward a specific output format.

2. High-Resolution Image Processing

Using a 300 DPI setting for PDF page conversion ensures that the models have high-quality images to work with, improving extraction accuracy:

pix = page.get_pixmap(alpha=False, dpi=300)

3. Parallel Processing for Efficiency

The asyncio implementation allows both models to run concurrently, making the comparison more efficient:

async def parallel_process():
    return await asyncio.gather(
        loop.run_in_executor(executor, extract_text_with_claude_3_7_sonnet, document_image),
        loop.run_in_executor(executor, extract_text_with_nova_pro, document_image)
    )

4. Comprehensive Metrics Tracking

The project tracks not just the extracted text but also performance metrics like processing time and token usage, enabling quantitative comparison.

Future Extensions and Improvements

As mentioned in the README's contributing section, this project lays the groundwork for several potential improvements:

1. Evaluation Metrics

Adding formal evaluation metrics like BLEU or ROUGE would provide quantitative measures of extraction quality, especially when ground truth is available.

2. Post-Processing Optimization

The extracted text could be further processed to improve formatting, correct common OCR errors, or extract structured data into specific formats like JSON.

3. Expanded Model Comparison

Testing against other models like GPT-4 Vision or Gemini would provide a more comprehensive comparison across the LLM landscape.

4. Specialized Fine-Tuning

For specific document types or domains, fine-tuning these models could yield even better results.

Conclusion: The New Frontier of Document Understanding

This project demonstrates that we're entering a new era of document processing—one where AI doesn't just recognize text but truly understands documents. The comparison between Claude 3.7 Sonnet and Amazon Nova Pro highlights the impressive capabilities of modern LLMs in this space, while also revealing the tradeoffs developers need to consider.

For those working with document processing pipelines, the message is clear: traditional OCR is being rapidly surpassed by these more interpretative, context-aware approaches. By leveraging the document understanding capabilities of LLMs, we can create more accurate, more resilient text extraction systems.

Whether you're working with legal contracts, financial statements, or research papers, this LLM-powered approach to document processing offers significant advantages over traditional OCR. And as these models continue to improve, so too will their ability to understand and extract information from the documents that power our businesses and institutions.

Want to try it yourself? Check out the full project on GitHub and see how these powerful Amazon Bedrock models compare on your own PDFs!