Hello Dev community! 👋

Today I want to share my capstone project I built for the Google Gen AI course on Kaggle. It was a great learning journey! My mission? To solve a problem many of us face with complex machinery or software: those giant, intimidating PDF manuals.

The Problem: Drowning in Documentation 😩

Imagine this: you are a technician in a factory. A machine, let's call it the "Boccaspruzzo 2024" (my fictional machine!), suddenly shows error code E-07. Production stop. Maybe it's overheating. Safety is important! You need to know fast: What does E-07 mean? What are the first steps? Is it safe to approach?

Where is the answer? Buried somewhere in a 300-page PDF manual. You start scrolling... searching... maybe the PDF search function not work well. Maybe the manual is old version. Maybe it's even not in your first language! Minutes feel like hours. This is stressful, inefficient, and potentially dangerous.

This isn't just about factories. Think about complex software documentation, API guides... information overload is real! We need better way to access specific information quickly from large documents.

My Solution: The "Smart Manual" - Giving Docs a Voice! 🤖💬

I thought: what if we could just ask the manual a question and get a clear, direct answer? This is where Generative AI comes in! The core idea is RAG - Retrieval Augmented Generation.

It sounds fancy, but the concept is logical:

  1. Teach the AI: Feed the entire content of the manual to an AI system, but in a smart way.
  2. Understand the Question: When a user asks something ("What is error E-07?"), the AI understands the meaning of the question.
  3. Retrieve Relevant Info: The AI system searches through its knowledge of the manual and finds the exact small pieces of text that are most relevant to answering that specific question. This is the "Retrieval" part. It's like finding the perfect paragraph without reading the whole book.
  4. Generate a Grounded Answer: The AI then uses only those retrieved text pieces, along with the original question, to generate a helpful, human-readable answer. This is the "Generation" part. We explicitly tell the AI: do not use any other knowledge, only the text I gave you! This is called grounding, and it's super important to prevent the AI from making things up (hallucinating).

So, the AI doesn't just know the manual, it uses the manual dynamically to answer questions based on the actual, official content.

How I Built It: The Tech Stack and Code Insights

I used Python on Kaggle, leveraging Google's Gemini models and some great open-source libraries. Here's a peek under the hood:

1. Reading the PDF (PyMuPDF)

First, we need the text. PyMuPDF (imported as fitz) is great for this. It lets you open a PDF, loop through pages, and extract raw text accurately.

import fitz # PyMuPDF

def extract_text_from_pdf(pdf_path):
    """Extracts text page by page from PDF."""
    # ... (open PDF, check if exists) ...
    doc = fitz.open(pdf_path)
    extracted_pages = []
    print(f"Extracting text from {doc.page_count} pages...")
    for page_index in range(doc.page_count):
        page = doc.load_page(page_index)
        page_text = page.get_text("text", sort=True) # Get text, sorted
        cleaned_text = "\n".join(line.strip() for line in page_text.splitlines() if line.strip()) # Basic cleaning
        if cleaned_text:
            extracted_pages.append((page_index + 1, cleaned_text)) # Store page number + text
    doc.close()
    return extracted_pages

2. Chunking: Making Text Digestible

LLMs (Large Language Models) have limits on how much text they can process at once (context window). Feeding a whole 300-page manual is impossible. Solution? Chunking! Break the text into smaller, overlapping pieces.

Why overlap? Imagine a critical instruction starts at the very end of chunk 1 and finishes at the beginning of chunk 2. Without overlap, the AI might miss the full context when retrieving either chunk. Overlap ensures complete thoughts are more likely captured within a single chunk.

def split_into_chunks(text, chunk_size, overlap):
    """Splits text into chunks with overlap."""
    if overlap >= chunk_size: raise ValueError("Overlap must be less than chunk size")
    chunks = []
    start_index = 0
    text_length = len(text)
    while start_index < text_length:
        chunks.append(text[start_index : start_index + chunk_size])
        # Move forward, ensuring overlap
        start_index += max(1, chunk_size - overlap)
    # Remove empty chunks if any
    return [chunk for chunk in chunks if chunk.strip()]

I chose a CHUNK_SIZE of 800 characters and CHUNK_OVERLAP of 80. This need tuning maybe, depends on the document.

3. Embeddings: Capturing Meaning (google-genai)

This is where the AI magic starts. How does the system find relevant text if it doesn't understand meaning? Embeddings! An embedding model (I used Google's text-embedding-004) reads a text chunk and converts it into a list of numbers (a vector). This vector represents the semantic meaning of the text. Chunks with similar meanings will have mathematically similar vectors.

# The custom class I built to integrate Gemini embeddings with ChromaDB
from chromadb import Documents, EmbeddingFunction, Embeddings
# Needs configured genai_client from google.generativeai
# class GeminiEmbeddingFunction(EmbeddingFunction): ...
    # __call__ method makes API call to genai_client.models.embed_content
    # Important: use task_type="retrieval_document" for indexing

4. Vector Database: Storing & Searching Meanings (ChromaDB)

Okay, we have chunks and their meaning vectors (embeddings). Where to store them efficiently? A Vector Database! I used ChromaDB. It's designed specifically to store these vectors and perform lightning-fast similarity searches.

When a user asks a question, we create an embedding for the question too (using task_type="retrieval_query"). Then we ask ChromaDB: "Find the top 5 text chunks whose vectors are closest (most similar) to this question vector."

# After setting up ChromaDB (db variable) and embedding the question...
question_embedding = query_embedding_function([user_question]) # task_type="retrieval_query"

if question_embedding:
    query_results = db.query(
        query_embeddings=question_embedding,
        n_results=5, # Find top 5 most similar chunks
        include=['documents', 'metadatas'] # Get the text and page number back
    )
    retrieved_documents = query_results.get('documents', [[]])[0]
    # Now retrieved_documents contains the most relevant text snippets!

5. Generation: Crafting the Answer (Gemini Flash & Prompting)

This is the RAG core. We take the retrieved_documents (the relevant context) and the user_question and feed them to a powerful generative model (like gemini-1.5-flash-latest). The prompt is EVERYTHING here.

My prompt tells Gemini:

  • You are an assistant for the "Boccaspruzzo" machine.
  • Answer ONLY using the Manual Excerpts: provided below.
  • If the answer isn't in the excerpts, say so clearly. DO NOT MAKE STUFF UP. (This is grounding).
  • Answer in the user's language and also Italian. Explain simply.
# The prompt template is crucial for grounding!
final_prompt = f"""You are a technical assistant ...
Answer the user's question based EXCLUSIVELY on the provided excerpts...
...

Manual Excerpts:
{retrieved_context} <--- The relevant text found by ChromaDB goes here

User Question: {user_question} <--- The original question

Detailed Answer (based ONLY on the excerpts above):"""

# Then generate content with the model
# llm_response = generation_model.generate_content(contents=[final_prompt], ...)

6. Bonus: Structured Data Logging (Few-Shot Prompting)

To show another Gen AI capability, I added a final step. Using few-shot prompting (giving the AI several examples of input and desired output), I asked Gemini to analyze the user question and the generated answer, then output a structured JSON object. This JSON contains metadata like language, topics, urgency, manual sections referenced, etc. This could be logged to a database (like MongoDB) for later analysis of how the system is used.

My Kaggle Notebook Workflow

The Kaggle notebook puts all this together:

  1. Setup: Install/uninstall chaos, API keys, warning filters.
    • // OK, after wrestling with pip to get the packages sorted, this warning started popping up. Silencing it below.
  2. PDF Handling: Download and extract text.
  3. Indexing: Chunk text, embed chunks, add to ChromaDB (only if DB is empty).
  4. RAG Pipeline: Ask a question, embed it, retrieve context from ChromaDB, build the grounded prompt, generate answer with Gemini Flash.
  5. JSON Output: Use few-shot prompt to generate the structured JSON log.

Hitting Roadblocks: The Not-So-Easy Parts

It wasn't all smooth sailing!

  • Dependency Hell on Kaggle: Seriously! Getting Python packages to work nice together in the pre-built Kaggle environment was a nightmare. google-genai, chromadb, protobuf, google-api-core, torch, datasets... they all have specific needs. I spent hours trying different combinations of pip uninstall and pip install to find a setup that worked without critical errors. I fighted with pip a lot! It need a lot of patience!
  • Ensuring Grounding: Just telling the AI "don't hallucinate" isn't enough. The prompt structure, explicitly providing the context and saying "use ONLY this", is vital. Important the prompt tell Gemini this clearly. It still might fail sometimes, especially if the retrieved context is ambiguous. Constant vigilance needed!
  • Chunking Strategy: Finding the right chunk size and overlap is more art than science. Too small, you lose context. Too big, you might exceed model limits or include irrelevant info in the retrieved chunk. This likely needs tuning for different types of documents.
  • PDF Limitations: My solution works for PDFs with selectable text. If the PDF is just scanned images of pages, PyMuPDF won't get text. You'd need Optical Character Recognition (OCR) first, adding complexity and potential errors.

Dreaming Bigger: Where Could This Go?

This project feels like a solid foundation. Here are some ideas for the future:

  • Real Agent Behavior: Make it conversational. Allow follow-up questions ("Okay, I checked that, what's next?"). Use memory to understand the context of the whole interaction (Long context window).
  • Full MLOps Integration: Automate the whole process. Set up a pipeline that watches for new manual versions, automatically processes them, updates the vector database, and maybe even monitors the quality of the answers.
  • Smarter Feedback Loop: Don't just generate JSON logs, use them! Analyze the questions people ask. Which parts of the manual are most queried? Which questions does the system fail to answer? Use this data to improve the manual itself, the RAG system's retrieval, or technician training.
  • Multimodality (Images & Diagrams): Modern models like Gemini Pro can understand images! Imagine asking: "Point to component X in Figure 3.4" or "What does this warning light in the photo mean?". Integrating image understanding would be amazing.
  • Function Calling: Allow the AI to trigger actions, like "Okay, create a maintenance ticket for this issue" by calling an external API.

Check Out the Code!

Building this "Smart Manual" was a fantastic way to apply different Gen AI techniques (RAG, embeddings, vector search, prompting, grounding, structured output) to solve a real-world problem.

I invite you to explore the complete code and run it yourself on Kaggle:
➡️ d3p4rt - Smart Manual on Kaggle

I'd love to hear what you think! Is this huge project useful? What would you add? Drop a comment below or on Kaggle!

Thanks for reading this deep dive! 😊