My experience testing for a demonstration “Visual Grounding” feature of Docling on Intel CPU.

Image description

Introduction

In discussions with our business partners, we always try to propose tools and solution to answer to specific business requirements. In a recent talk for a specific use-case I discovered “visual grounding” among other capacities of Docling set of toolings.

“Visual Grounding”, in general terms is a task that aims at locating objects in an image based on a natural language query. This task, along with image captioning, visual question answering or content based image retrieval links image data with the text modality. This feature exists in Docling and there is a sample notebook provided as example.

Implementation

In order to test the visual grounding capacity use-case I tried to run the sample notebook, but as a Python app.

Excerpt from the official documentation;

This example showcases Docling’s visual grounding capabilities, which can be combined with any agentic AI / RAG framework.
In this instance, we illustrate these capabilities leveraging the LangChain Docling integration, along with a Milvus vector store, as well as sentence-transformers embeddings. Also it is mentioned that… 👉 For best conversion speed, use GPU acceleration whenever available; e.g. if running on Colab, use GPU-enabled runtime.

What is needed to run the application locally?

  • First of all you need a Vector Database (Milvus here), and to run Milvus locally, I use Podman.
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh 

mkdir volumes/milvus

bash ./standalone_embed.sh start
  • If you want to test your local Milvus, there goes a sample code!
from pymilvus import connections

try:
    connections.connect(host="127.0.0.1", port="19530")
    print("Milvus connection successful!")
    connections.disconnect("default")
except Exception as e:
    print(f"Milvus connection failed: {e}")
  • Ideally you’ll need a virtual environment for the Python code.
python3.12 -m venv myenv
source myenv/bin/activate
  • If you want to test your Python version (or on command line) to determine that you actually are using the chosen version.
import sys
print(sys.version)
  • Installing the application’s dependencies.
pip install --upgrade pip
pip install langchain-docling langchain-core langchain-huggingface langchain_milvus langchain matplotlib python-dotenv
pip install hf_xet
pip install --upgrade langchain-huggingface
  • An environment file.
HF_TOKEN="YOUR-HUGGINGFACE-TOKEN"
TOKENIZERS_PARALLELISM=true

The variable “TOKENIZERS_PARALLELISM” is not mentioned in the example, but during the execution I got this on the terminal. False works better on my configuration with my hardware specifications.

  • And finally the code 🧑‍💻
import os
from pathlib import Path
from tempfile import mkdtemp
from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_docling.loader import ExportType
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_huggingface import HuggingFaceEndpoint
import json
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from langchain_docling import DoclingLoader
from docling.chunking import HybridChunker, DocMeta
from docling.datamodel.document import DoclingDocument

def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata
        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)

load_dotenv()

# --- Configuration ---
HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN")
#SOURCES = ["https://arxiv.org/pdf/2408.09869"] # from the original code
SOURCES = ["./input/2408.00986v1.pdf"]  # Docling Technical Report downloaded locally
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
GEN_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1" # or whatever model available and of your choice
QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
    "Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
DOC_STORE = {}
DOC_STORE_ROOT = Path(mkdtemp())

def clip_text(text, threshold=100):
    return f"{text[:threshold]}..." if len(text) > threshold else text

def main():
    global DOC_STORE

    # --- Document Converter Setup ---
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=PdfPipelineOptions(
                    generate_page_images=True,
                    images_scale=2.0,
                ),
            )
        }
    )

    # --- Document Loading and Saving to Doc Store ---
    for source in SOURCES:
        dl_doc = converter.convert(source=source).document
        file_path = Path(DOC_STORE_ROOT / f"{dl_doc.origin.binary_hash}.json")
        dl_doc.save_as_json(file_path)
        DOC_STORE[dl_doc.origin.binary_hash] = file_path

    # --- Document Loading and Chunking ---
    loader = DoclingLoader(
        file_path=SOURCES,
        converter=converter,
        export_type=ExportType.DOC_CHUNKS,
        chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
    )
    docs = loader.load()
    print("Documents loaded and chunked.")

    # --- Ingestion ---
    embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)
    vectorstore = Milvus.from_documents(
        documents=docs,
        embedding=embedding,
        collection_name="docling_demo",
        connection_args={"uri": MILVUS_URI},
        index_params={"index_type": "FLAT"},
        drop_old=True,
    )
    print("Documents embedded and stored in Milvus.")

    # --- RAG ---
    retriever = vectorstore.as_retriever(search_kwargs={"k": TOP_K})
    llm = HuggingFaceEndpoint(
        repo_id=GEN_MODEL_ID,
        huggingfacehub_api_token=HF_TOKEN,
        task="text-generation"  
    )
    question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
    rag_chain = create_retrieval_chain(retriever, question_answer_chain)
    resp_dict = rag_chain.invoke({"input": QUESTION})
    clipped_answer = clip_text(resp_dict["answer"], threshold=200)
    print(f"\nQuestion:\n{resp_dict['input']}\n\nAnswer:\n{clipped_answer}")

    # --- Visual Grounding ---
    print("\nVisual Grounding:")
    for i, doc in enumerate(resp_dict["context"][:]):
        image_by_page = {}
        print(f"Source {i+1}:")
        print(f"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}")
        meta = DocMeta.model_validate(doc.metadata["dl_meta"])

        # loading the full DoclingDocument from the document store:
        dl_doc = DoclingDocument.load_from_json(DOC_STORE.get(meta.origin.binary_hash))

        for doc_item in meta.doc_items:
            if doc_item.prov:
                prov = doc_item.prov[0]  # here we only consider the first provenance item
                page_no = prov.page_no
                if img := image_by_page.get(page_no):
                    pass
                else:
                    page = dl_doc.pages[prov.page_no]
                    print(f"  page: {prov.page_no}")
                    img = page.image.pil_image
                    image_by_page[page_no] = img
                bbox = prov.bbox.to_top_left_origin(page_height=page.size.height)
                bbox = bbox.normalized(page.size)
                thickness = 2
                padding = thickness + 2
                bbox.l = round(bbox.l * img.width - padding)
                bbox.r = round(bbox.r * img.width + padding)
                bbox.t = round(bbox.t * img.height - padding)
                bbox.b = round(bbox.b * img.height + padding)
                draw = ImageDraw.Draw(img)
                draw.rectangle(
                    xy=bbox.as_tuple(),
                    outline="blue",
                    width=thickness,
                )
        for p in image_by_page:
            img = image_by_page[p]
            plt.figure(figsize=[15, 15])
            plt.imshow(img)
            plt.axis("off")
            plt.title(f"Page {p+1}")
            plt.show() # Consider plt.savefig(f"page_{p+1}.png") to save images instead of showing

if __name__ == "__main__":
    main()

The application takes extremely a long time to run on CPU (even with my configuration which is OK! As mentioned on the documentation, GPU should be privileged).

Image description

If you see the icon below, you’ll know that the code will give you some output 😀

Image description

You’ll get an output on the terminal and also the generated graphical results.

Token indices sequence length is longer than the specified maximum sequence length for this model (528 > 512). Running this sequence through the model will result in indexing errors
Documents loaded and chunked.
Documents embedded and stored in Milvus.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
/Users/alainairom/Devs/dataprepkit-work/docling-visualgrounding/myenv/lib/python3.12/site-packages/huggingface_hub/utils/_deprecation.py:131: FutureWarning: 'post' (from 'huggingface_hub.inference._client') is deprecated and will be removed from version '0.31.0'. Making direct POST requests to the inference server is not supported anymore. Please use task methods instead (e.g. `InferenceClient.chat_completion`). If your use case is not supported, please open an issue in https://github.com/huggingface/huggingface_hub.
  warnings.warn(warning_message, FutureWarning)

Question:
Which are the main AI models in Docling?

Answer:
The main AI models mentioned in the document are Bayesian networks (BNs) and neural networks.

Visual Grounding:
Source 1:
  text: "1 Introduction\nIn recent years, artificial intelligence (AI) has attracted significant research interest, fueled by its potential to revolutionize various practical applications. Among the many AI models, Bayesian networks (BNs) [6] stand out in fields that demand extensive expert knowledge. One of the most impactful areas for BNs research is healt..."
  page: 1
  page: 2
...

Image description

Image description

Et voilà ✌️

Conclusion

Visual grounding is a pivotal concept in document analysis, serving as a bridge between textual content and its corresponding visual elements. Its utility lies in enabling machines to understand and reason about documents in a more human-like manner by explicitly linking words and phrases to specific regions within the document image. This capability is crucial for tasks such as information extraction from visually rich documents like invoices or scientific papers with figures, question answering that requires identifying and reasoning about visual components, and enhancing document accessibility by providing textual descriptions of visual elements. By grounding text in the visual layout, AI systems can achieve a deeper, multimodal understanding of documents, leading to more accurate and context-aware processing and interpretation.

Docling offers this capacity among others 🏅.

Image description

Links