Hands-on experience building a RAG with Docling and LangChain.

Image description

Introduction

This article and the sample notebook demonstrates the powerful integration of LangChain Docling alongside a Milvus vector store and sentence-transformers embeddings to build sophisticated Language Model (LLM) applications. The core of this example lies in the DoclingLoader component, a versatile tool designed to streamline the incorporation of diverse document types into LLM workflows with both ease and speed. Furthermore, it harnesses Docling's rich document format to enable advanced, document-native grounding capabilities. The functionality of DoclingLoader is explored through its two distinct export modes, ExportType.MARKDOWN for processing entire documents and the default ExportType.DOC_CHUNKS for handling individual document segments, with the flexibility to switch between these modes controlled by the EXPORT_TYPE parameter, thereby tailoring the downstream pipeline accordingly.

In this exploration, I aim to provide a practical account of my hands-on experience as I attempted to implement the LangChain Docling example in its original form. Navigating the intricacies of my specific development environment, which encompasses particular software versions and system configurations, presented a unique set of challenges. This narrative will detail the steps I undertook to adapt the example, highlighting the modifications and adjustments that were necessary to achieve a functional implementation tailored to the nuances of my local setup, ultimately offering insights into the practicalities of deploying such integrations.

Image description

Implementation

I had to make slight changes in the code in order to run the notebook so it would run on my laptop (Intel/CPU/Notebook running with VSCode).

Outside the notebook

  • Preparing a virtual environment.
python3 -m venv venv
source venv/bin/activate

pip install ipykernel jupyter
python -m ipykernel install --user --name=app --display-name="Python (app)"
  • Milvus setup if you don’t have it!
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh 

mkdir volumes/milvus

bash ./standalone_embed.sh start

The notebook 📓!

  • Installing the required packages.
%pip install langchain-docling langchain-core langchain-huggingface langchain_milvus langchain python-dotenv
%pip install sentence-transformers
%pip install langchain_community
%pip install huggingface_hub
#### output
Requirement already satisfied: langchain-docling in /Users/alainairom/miniforge3/lib/python3.12/site-packages (0.2.0)
Requirement already satisfied: langchain-core in /Users/alainairom/miniforge3/lib/python3.12/site-packages (0.3.51)
Requirement already satisfied: langchain-huggingface in /Users/alainairom/miniforge3/lib/python3.12/site-packages (0.1.2)
Requirement already satisfied: langchain_milvus in /Users/alainairom/miniforge3/lib/python3.12/site-packages (0.1.9)
Requirement already satisfied: langchain in /Users/alainairom/miniforge3/lib/python3.12/site-packages (0.3.23)
Requirement already satisfied: python-dotenv in /Users/alainairom/miniforge3/lib/python3.12/site-packages (1.1.0)
Requirement already satisfied: docling<3.0.0,>=2.14.0 in /Users/alainairom/miniforge3/lib/python3.12/site-packages (from langchain-docling) (2.30.0)
...
...
  • Preparation and setup;
import os
from pathlib import Path
from tempfile import mkdtemp

from dotenv import load_dotenv
from langchain_core.prompts import PromptTemplate
from langchain_docling.loader import ExportType
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline  # Import pipeline
from langchain_community.llms import HuggingFacePipeline



def _get_env_from_colab_or_os(key):
    try:
        from google.colab import userdata

        try:
            return userdata.get(key)
        except userdata.SecretNotFoundError:
            pass
    except ImportError:
        pass
    return os.getenv(key)


load_dotenv()

# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

HF_TOKEN = _get_env_from_colab_or_os("HF_TOKEN")
print(f"The value of HF_TOKEN is: '{HF_TOKEN}'")


FILE_PATH = ["https://arxiv.org/pdf/2408.09869"]  # Docling Technical Report
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
GEN_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"


# Added modifications - Mistral 7B Instruct v0.1
tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL_ID, token=HF_TOKEN)
model = AutoModelForCausalLM.from_pretrained(GEN_MODEL_ID, token=HF_TOKEN)

# Create the text-generation pipeline
pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Initialize Langchain LLM using the pipeline
llm = HuggingFacePipeline(pipeline=pipeline)
### END of added modifications

EXPORT_TYPE = ExportType.DOC_CHUNKS
QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
    "Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
TOP_K = 3
MILVUS_URI = str(Path(mkdtemp()) / "docling.db")
  • And the output… (you will need a lot of disk space).
The value of HF_TOKEN is: xxxxx
tokenizer_config.json: 100%
 2.10k/2.10k [00:00<00:00, 100kB/s]
tokenizer.model: 100%
 493k/493k [00:00<00:00, 8.37MB/s]
tokenizer.json: 100%
 1.80M/1.80M [00:00<00:00, 4.73MB/s]
special_tokens_map.json: 100%
 414/414 [00:00<00:00, 59.4kB/s]
config.json: 100%
 720/720 [00:00<00:00, 92.1kB/s]
....

You can remove all downloaded files from /Users/yourprofile/.cache/huggingface/hub

  • Document Loading step (and the output).
from langchain_docling import DoclingLoader

from docling.chunking import HybridChunker

loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=EXPORT_TYPE,
    chunker=HybridChunker(tokenizer=EMBED_MODEL_ID),
)

docs = loader.load()
Note: a message saying "Token indices sequence length is longer than the specified maximum sequence length..." can be ignored in this case — details here.
  • Determining the splits and inspecting the splits.
if EXPORT_TYPE == ExportType.DOC_CHUNKS:
    splits = docs
elif EXPORT_TYPE == ExportType.MARKDOWN:
    from langchain_text_splitters import MarkdownHeaderTextSplitter

    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[
            ("#", "Header_1"),
            ("##", "Header_2"),
            ("###", "Header_3"),
        ],
    )
    splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
else:
    raise ValueError(f"Unexpected export type: {EXPORT_TYPE}")
for d in splits[:3]:
    print(f"- {d.page_content=}")
print("...")
- d.page_content='arXiv:2408.09869v5  [cs.CL]  9 Dec 2024'
- d.page_content='Docling Technical Report\nVersion 1.0\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Fabian Lindlbauer Kasper Dinkla Lokesh Mishra Yusik Kim Shubham Gupta Rafael Teixeira de Lima Valery Weber Lucas Morin Ingmar Meijer Viktor Kuropiatnyk Peter W. J. Staar\nAI4K Group, IBM Research R¨uschlikon, Switzerland'
- d.page_content='Abstract\nThis technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.'
...
  • Information ingestion step.
import json
from pathlib import Path
from tempfile import mkdtemp

from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus

embedding = HuggingFaceEmbeddings(model_name=EMBED_MODEL_ID)


milvus_uri = str(Path(mkdtemp()) / "docling.db")  # or set as needed
vectorstore = Milvus.from_documents(
    documents=splits,
    embedding=embedding,
    collection_name="docling_demo",
    connection_args={"uri": milvus_uri},
    index_params={"index_type": "FLAT"},
    drop_old=True,

- RAG implementation

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_huggingface import HuggingFaceEndpoint

retriever = vectorstore.as_retriever(search_kwargs={"k": TOP_K})
llm = HuggingFaceEndpoint(
    repo_id=GEN_MODEL_ID,
    huggingfacehub_api_token=HF_TOKEN,
)


def clip_text(text, threshold=100):
    return f"{text[:threshold]}..." if len(text) > threshold else text
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.
  • And the final part with the expected output!
question_answer_chain = create_stuff_documents_chain(llm, PROMPT)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
resp_dict = rag_chain.invoke({"input": QUESTION})

clipped_answer = clip_text(resp_dict["answer"], threshold=200)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{clipped_answer}")
for i, doc in enumerate(resp_dict["context"]):
    print()
    print(f"Source {i + 1}:")
    print(f"  text: {json.dumps(clip_text(doc.page_content, threshold=350))}")
    for key in doc.metadata:
        if key != "pk":
            val = doc.metadata.get(key)
            clipped_val = clip_text(val) if isinstance(val, str) else val
            print(f"  {key}: {clipped_val}")
Question:
Which are the main AI models in Docling?

Answer:
Docling initially releases two AI models, a layout analysis model and TableFormer. The layout analysis model is an accurate object-detector for page elements, and TableFormer is a state-of-the-art tab...

Source 1:
  text: "3.2 AI models\nAs part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis model, an accurate object-detector for page elements [13]. The second model is TableFormer [12, 9], a state-of-the-art table structure re..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/50', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 108.0, 't': 405.1419982910156, 'r': 504.00299072265625, 'b': 330.7799987792969, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 2:
  text: "3 Processing pipeline\nDocling implements a linear pipeline of operations, which execute sequentially on each given document (see Fig. 1). Each document is first parsed by a PDF backend, which retrieves the programmatic text tokens, consisting of string content and its coordinates on the page, and also renders a bitmap image of each page to support ..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/26', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 2, 'bbox': {'l': 108.0, 't': 273.01800537109375, 'r': 504.00299072265625, 'b': 176.83799743652344, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 796]}]}], 'headings': ['3 Processing pipeline'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 3:
  text: "6 Future work and contributions\nDocling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecognition model, a code-recognition model and more. This will help improve the quality of conversion for specific types of ..."
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/76', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 322.468994140625, 'r': 504.00299072265625, 'b': 259.0169982910156, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}, {'self_ref': '#/texts/77', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 108.0, 't': 251.6540069580078, 'r': 504.00299072265625, 'b': 198.99200439453125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 402]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 11465328351749295394, 'filename': '2408.09869v5.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

It works 🏆

Image description

Conclusion

In conclusion, this journey through implementing the LangChain Docling example has underscored both the potential and the practical considerations involved in leveraging advanced LLM integrations. The hands-on experience detailed here highlights the importance of understanding system dependencies, troubleshooting environment-specific issues, and iteratively refining the implementation to achieve the desired functionality. Ultimately, the successful integration of LangChain Docling with tools like Milvus and sentence-transformers offers a powerful pathway to building sophisticated, document-aware LLM applications, provided that the practical realities of implementation are thoughtfully addressed.

Links