After getting tired of writing endless boilerplate to extract structured data from documents with LLMs, I built ContextGem - a free, open-source framework that makes this radically easier.

What makes it different?

✅ Automated dynamic prompts and data modeling
✅ Precise reference mapping to source content
✅ Built-in justifications for extractions
✅ Nested context extraction
✅ Works with any LLM provider
and more built-in abstractions that save developer time.

Simple LLM extraction in just a few lines:

from contextgem import Aspect, Document, DocumentLLM

# Define what to extract
doc = Document(raw_text="Your document text here...")
doc.aspects = [
    Aspect(
        name="Intellectual property",
        description="Clauses on intellectual property rights",
    )
]

# Extract with any LLM
llm = DocumentLLM(model="/", api_key="")
doc = llm.extract_all(doc)

# Get results
print(doc.aspects[0].extracted_items)

Features a native DOCX converter, support for multiple LLMs, and full serialization - all under Apache 2.0 permissive license.

View project on GitHub: https://github.com/shcherbak-ai/contextgem

Try it out and let me know your thoughts!