Image description

The Problem: Unsearchable PDFs

We've all been there. You receive an important document as a PDF, but when you try to search for specific text, nothing happens. That's because many PDFs, especially scanned documents, are essentially just images of text rather than actual text content.

This creates several problems:

  • You can't search for specific information
  • You can't copy and paste text
  • You can't use screen readers or other accessibility tools
  • You can't easily extract or analyze the content

Introducing PDF-OCR CLI

Image description

To solve this problem, I created PDF-OCR CLI, an open-source tool that transforms scanned PDFs into fully searchable documents. It's built with TypeScript and leverages the power of Mistral AI's OCR capabilities, with optional text verification using Together.ai's LLM.

How It Works

The tool follows a simple but powerful pipeline:

  1. Takes your PDF as input
  2. Processes each page with Mistral API's OCR
  3. Optionally verifies and improves text quality with an LLM
  4. Reassembles everything into a searchable PDF

Getting Started in 2 Minutes

Installation

# Install globally
npm install -g pdf-ocr-cli

# Or use without installing
npx pdf-ocr-cli --input input.pdf --output output.pdf

Set Up API Keys

Create a .env file in your working directory:

echo "MISTRAL_API_KEY=your_mistral_api_key_here" > .env
echo "TOGETHER_API_KEY=your_together_api_key_here" >> .env

Basic Usage

# Process a PDF file
pdf-ocr --input input.pdf --output output.pdf

# With verification to improve OCR quality
pdf-ocr --input input.pdf --output output.pdf --verify

Real-World Use Cases

1. Digitizing Research Papers

As a developer who reads a lot of research papers, I often encounter PDFs that are scanned copies. With PDF-OCR CLI, I can quickly make these papers searchable, allowing me to find specific sections or references without scrolling through the entire document.

2. Processing Legal Documents

Legal documents often come as scanned PDFs. By making them searchable, lawyers and paralegals can quickly find relevant clauses or terms, saving hours of manual reading.

3. Archiving Historical Documents

Libraries and archives can use this tool to make historical documents more accessible and searchable, preserving knowledge while making it more usable.

Advanced Features

Handling Large Documents

For large documents, you can control the processing with options like:

# Process 3 pages at a time
pdf-ocr --input input.pdf --output output.pdf --concurrency 3

# Process only the first 10 pages
pdf-ocr --input input.pdf --output output.pdf --max-pages 10

Improving OCR Quality

The --verify flag uses an LLM to check and improve the OCR results:

pdf-ocr --input input.pdf --output output.pdf --verify

This is particularly useful for documents with complex layouts, poor scan quality, or unusual fonts.

The Technical Details

PDF-OCR CLI is built with TypeScript and follows a modular architecture:

  1. PDF Splitter: Divides PDFs into individual pages
  2. OCR Module: Extracts text using Mistral API
  3. Content Verification: Improves text with LLM (optional)
  4. Text-to-PDF Converter: Converts text back to PDF
  5. PDF Merger: Combines processed pages

The tool is designed to be robust, with features like:

  • Configurable retry mechanisms for API calls
  • Adjustable concurrency for processing multiple pages
  • Detailed logging options for troubleshooting

Why I Built This

As a developer who works with a lot of documentation, I was frustrated by the limitations of scanned PDFs. Existing OCR solutions were either expensive, closed-source, or difficult to integrate into my workflow.

I wanted a simple CLI tool that I could use in scripts or automation pipelines, and that leveraged the latest AI capabilities for high-quality text extraction. PDF-OCR CLI is the result of that need.

Open Source and Contributions

PDF-OCR CLI is open source under the ISC license. Contributions are welcome! Whether it's adding new features, improving documentation, or reporting bugs, every contribution helps make the tool better.

Check out the GitHub repository to get started.

Conclusion

PDF-OCR CLI transforms the way we work with scanned documents, making them as useful and accessible as natively digital PDFs. Give it a try and let me know what you think in the comments!


Have you encountered problems with unsearchable PDFs? What solutions have you tried? Let me know in the comments!