RAG is Retrieval Augmented Generation. It is an AI-technique that uses an integrated knowledge base for the information retrieval. The retrieved information in turn is used by the LLM as a context to generate the response for the user input.

Image generated by AI (Grok AI)

No tricks behind the title of this story other than just what I did. As a beginner, I did not know where to start, so I started exploring some options to learn about RAG. I came across langchain, faiss, langflow and other such libraries. Yet I wasn’t able to tie all the ends, so I did seek the knowledge of ChatGPT :D
I got the response of the entire code to implement RAG using open-source libraries and then passing the context to LLM (with Groq API for llama-3–70b). Let us go through the code and understand it.

Folder Structure

rag_groq/
│
├── main.py                  # Entry point
├── ingest.py                # Load & embed PDF content
├── query.py                 # Ask questions using Groq API
├── config.py                # Configuration (Groq key, model, etc.)
├── sample.pdf               # Your test document
└── requirements.txt         # Required Python libraries

config.py

# config.py
GROQ_API_KEY = "gsk_*"
GROQ_MODEL = "llama3-70b-8192"  # or another model hosted on Groq
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
VECTOR_DB_PATH = "vector_store"

Groq API key is required to access the llama-3 model to infer. We define the embedding model to convert the document content into embeddings and then store the chunks. We store the index for the chunks using faiss and it will be saved to the vector db path defined here.

main.py

from ingest import build_vector_store
from query import query_pdf

if __name__ == "__main__":
    print("📥 Ingesting PDF and building vector store...")
    build_vector_store("sample.pdf")

    print("✅ Ingest complete. You can now ask questions.")
    while True:
        query = input("🔍 Ask a question (or type 'exit'): ")
        if query.lower() == "exit":
            break
        response = query_pdf(query)
        print("\n💬 Response:", response, "\n")

This is the main driver code to execute the entire flow. We import the ingest.py file here to ingest the user document to feed custom knowledge base to the LLM (will look into it in detail in the following code snippet). In an infinite loop, we just do an input from the user (until user gives ‘exit’) to chat with the model for querying about the document or anything too. The querying part is done using the function query_pdf() from the query.py

ingest.py

import os
import PyPDF2
import faiss
import pickle
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from config import VECTOR_DB_PATH, EMBEDDING_MODEL

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        return "\n".join(page.extract_text() or '' for page in reader.pages)

def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    return [
        " ".join(words[i:i + chunk_size])
        for i in range(0, len(words), chunk_size - overlap)
    ]

def embed_chunks(chunks, model_name):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks, show_progress_bar=True)
    return embeddings

def build_vector_store(pdf_path):
    raw_text = extract_text_from_pdf(pdf_path)
    chunks = chunk_text(raw_text)
    embeddings = embed_chunks(chunks, EMBEDDING_MODEL)

    index = faiss.IndexFlatL2(len(embeddings[0]))
    index.add(embeddings)

    os.makedirs(VECTOR_DB_PATH, exist_ok=True)
    faiss.write_index(index, os.path.join(VECTOR_DB_PATH, "index.faiss"))

    with open(os.path.join(VECTOR_DB_PATH, "chunks.pkl"), "wb") as f:
        pickle.dump(chunks, f)

    print(f"Stored {len(chunks)} chunks in vector DB.")

if __name__ == "__main__":
    build_vector_store("sample.pdf")

To extract the content from the document, we use PyPDF2 library and then return it as a single string. It is then chunked using a default size 300 with an overlap of 50 on the contents page-wise. The chunked content list is then passed into a SentenceTransformer model (all-miniLM-L6-v2) and then encoded into embeddings. Then, the embeddings are indexed using faiss.IndexFlatL2 and stored in the vector db path, and then the chunks are stored in a pickle file.

query.py

import faiss
import pickle
from sentence_transformers import SentenceTransformer
from config import GROQ_API_KEY, GROQ_MODEL, VECTOR_DB_PATH, EMBEDDING_MODEL
from groq import Groq

client = Groq(api_key=GROQ_API_KEY)

def load_vector_store():
    index = faiss.read_index(f"{VECTOR_DB_PATH}/index.faiss")
    with open(f"{VECTOR_DB_PATH}/chunks.pkl", "rb") as f:
        chunks = pickle.load(f)
    return index, chunks

def get_top_k_chunks(query, index, chunks, k=5):
    embed_model = SentenceTransformer(EMBEDDING_MODEL)
    query_vec = embed_model.encode([query])
    D, I = index.search(query_vec, k)
    return [chunks[i] for i in I[0]]

def ask_llm(query, context):
    prompt = f"""Answer the question based on the context below.

Context:
{context}

Question:
{query}
"""
    response = client.chat.completions.create(
        model=GROQ_MODEL,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content

def query_pdf(query):
    index, chunks = load_vector_store()
    print(chunks,'\n\n')
    print(index,'\n\n')
    relevant_chunks = get_top_k_chunks(query, index, chunks)
    combined_context = "\n\n".join(relevant_chunks)
    return ask_llm(query, combined_context)

if __name__ == "__main__":
    while True:
        user_q = input("Ask a question (or type 'exit'): ")
        if user_q.lower() == "exit":
            break
        answer = query_pdf(user_q)
        print("\n🧠 Answer:", answer, "\n")

This is the part where querying happens. The chunks are loaded along with the stored index to hit the query against the top relevant chunks. Other than that, the basic setup of a prompt template to interact with the llama-3 model via Groq API is done.

Voila ! That’s it !

Output Screenshot

Output Screenshot

Happy Learning ! Happy coding !!