Docling “Enrichment Features”

Using Docling enrichment features!

Image description

Introduction

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. In short, Docling prepares any types of documents for AI 🦾!

Among other features, Docling allows to enrich the conversion pipeline with additional steps which process specific document components, e.g. code blocks, pictures, etc. The extra steps usually require extra models executions which may increase the processing time consistently. For this reason most enrichment models are disabled by default.

Image description

The documentation for the features above are available on Hugging Face. for example;

ds4sd/DocumentFigureClassifier: https://huggingface.co/ds4sd/DocumentFigureClassifier
ds4sd/CodeFormula: https://huggingface.co/ds4sd/CodeFormula

Using and implementation

Implementing the enrichment features is quite straightforward and examples are provided.

docling --enrich-code FILE

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()
pipeline_options.do_code_enrichment = True

converter = DocumentConverter(format_options={
    InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
})

result = converter.convert("https://arxiv.org/pdf/2501.17887")
doc = result.document

But… for one of my tasks I wanted to discover the “document’s” properties beforehand, because not all documents have all the sections which are described in the documentation!

So I wrote a simple application which extracts the “doc” object properties which I could use for my usage.

Preparation 👨‍🍳

python3.12 -m venv myenv
source myenv/bin/activate

pip install --upgrade pip
pip install docling

The main code 🪬

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat

def main():
    """
    This Python app demonstrates converting a PDF from a URL using the docling library
    with code enrichment enabled and writes the document properties to a Markdown file.
    """
    output_filename = "doc_properties.md"

    try:
        pipeline_options = PdfPipelineOptions()
        pipeline_options.generate_picture_images = True
        pipeline_options.images_scale = 2
        pipeline_options.do_picture_classification = True

        converter = DocumentConverter(format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        })

        try:
            result = converter.convert("https://arxiv.org/pdf/2501.17887")
            doc = result.document

            with open(output_filename, "w", encoding="utf-8") as md_file:
                md_file.write("# Document Properties\n\n")
                md_file.write("This file contains the properties of the converted document.\n\n")

                for attr_name in dir(doc):
                    if not attr_name.startswith("__"):
                        md_file.write(f"## `{attr_name}`\n\n")
                        try:
                            attr_value = getattr(doc, attr_name)
                            md_file.write(f"```
{% endraw %}
\n{attr_value}\n
{% raw %}
```\n\n")
                        except Exception as e:
                            md_file.write(f"*Error accessing property:* `{e}`\n\n")

            print(f"\nDocument properties written to '{output_filename}'")

        except Exception as conversion_error:
            print(f"Error during conversion: {conversion_error}")

    except Exception as e:
        print(f"An error occurred during setup: {e}")


if __name__ == "__main__":
    main()

And the excerpt of the output markdown 📄

Image description

Et voilà 🦆

Conclusion

The Docling enrichment features enhance the document conversion process by adding specific processing steps for different document components like code blocks and formulas, enabling advanced parsing and extraction of information such as code language and LaTeX representations, thereby creating a richer and more structured document model.

Introduction

Using and implementation

Conclusion

Links

Comments (0)

Read More

#reading

#popular

Docling “Enrichment Features”

Introduction

Using and implementation

Conclusion

Links

Comments (0)

Read More

Model routing for function calling with Arcee Conductor

Remote Development with Cursor?

Top 15 Builder.ai Alternatives for 2025: Explore the Best App Development Platforms

What is Deep Learning

#reading

#popular