RAG is an architecture where an LLM (Large Language Model) is not asked to answer purely from its own parameters but is given external information at runtime to use while answering.

Goal:

  • Reduce hallucination.
  • Answer based on updated, private, or niche data without re-training the model.

Image description


Breaking Down RAG — Technical View

1. Retriever

  • Input: A user query (like "What are the side effects of aspirin?")
  • Goal: Find the most relevant information from an external knowledge source (documents, database, etc.)
  • How:
    • The user query is first passed through an embedding model (like OpenAI text-embedding-ada-002) that turns the text into a high-dimensional vector (like a list of 1536 numbers).
    • Stored documents are already embedded beforehand into the same vector space.
    • Use vector similarity search (cosine similarity, Euclidean distance, etc.) to find top k nearest documents.

This is exactly like "nearest neighbors search" but in high-dimensional space.

📚 Key Tech Terminologies:

  • Embedding: Mapping text into numerical vectors capturing semantic meaning.
  • Vector Database: Special database optimized for fast nearest-neighbor search (e.g., FAISS, Pinecone, ChromaDB).
  • Cosine Similarity: Mathematical measure of "how similar" two vectors are.
  • Approximate Nearest Neighbor (ANN): Search algorithm used to make vector search fast in high dimensions.

2. Augmentation

  • What: After retrieving top relevant documents (say top 5 chunks), we prepare a prompt that includes both:
    • The retrieved knowledge.
    • The original question.

Example:

Context:
- Aspirin is known to cause stomach ulcers if taken in high doses.
- Patients with asthma may experience worsened symptoms with aspirin.

Question:
What are the side effects of aspirin?

Answer:
  • This entire constructed text is fed into the LLM.

📚 Key Tech Terminologies:

  • Prompt Engineering: Crafting the right input text for a model.
  • Context Window: The maximum number of tokens (words/pieces) the model can take at once. (e.g., GPT-4 Turbo has 128k tokens)

3. Generator

  • Model's Job: Generate a coherent, fluent, and grounded answer using the provided context.
  • LLM behavior:
    • Reads the question and retrieved context.
    • Predicts the next token at each step (one token at a time) until the output is complete.

📚 Key Tech Terminologies:

  • Autoregressive Language Model: A model that predicts the next word/token given previous words.
  • Token: A word piece, not always a full word (e.g., "un-", "happi-", "-ness" could be tokens).

What Exactly is a Model (in LLMs)?

✅ Technically:

A model is a huge mathematical function:

$$ f(\text{input}) \to \text{output} $$

More specifically for LLMs:

  • It's a neural network.
  • With hundreds of layers of matrix multiplications and nonlinear transformations.
  • Trained to minimize a loss function (how wrong it was when predicting the next token).

Formula View:

  • If ( x ) = input tokens, and ( y ) = output tokens,
  • The model learns a function ( f_{\theta}(x) ) parameterized by θ (weights) such that:
    • ( \text{Loss} = \text{CrossEntropy}(f_{\theta}(x), y) ) is minimized.

✅ The model's weights θ are massive matricesGPT-3 had 175 billion parameters (numbers).


How are LLMs Trained Fastly?

Good question — because LLMs training seems impossible, yet it’s done!

Ingredients to fast LLM training:

  1. Parallelization:

    • Data Parallelism: Split the batch across GPUs.
    • Model Parallelism: Split parts of the model across GPUs.
    • Tensor Parallelism: Split individual matrices inside layers.
    • Pipeline Parallelism: Different GPUs work on different sequential layers.
  2. Hardware:

    • Specialized hardware: NVIDIA A100, H100 GPUs.
    • Use TPUs (Tensor Processing Units) in Google's case.
    • Very high memory bandwidth (HBM memory).
  3. Optimization Techniques:

    • Mixed Precision Training: Use float16 (half-precision) instead of float32 to make it faster without much loss.
    • Gradient Checkpointing: Save memory by recomputing certain activations during backprop.
    • AdamW Optimizer: Special gradient descent variant that helps converge faster.
  4. Training Tricks:

    • Curriculum Learning: Start easy, go harder.
    • Masked Language Modeling (for pretraining): Randomly mask some words to predict.
  5. Distributed Systems:

    • Use frameworks like DeepSpeed, Megatron-LM, FSDP (Fully Sharded Data Parallel).
    • Train on thousands of GPUs at the same time across multiple datacenters.

Training Dataset Size:

  • Trillion tokens (Common Crawl, books, code, forums, wikipedia).
  • LLMs learn token-by-token prediction, very, very slowly but over giant amounts of data.

Summary

Concept Description
Retriever Finds relevant data from external source
Augmentation Adds found documents to the user prompt
Generator LLM generates an answer based on the prompt
Model Giant neural network = mathematical function ( f_\theta(x) )
Training Fast Parallelization, hardware acceleration, optimizer tricks

⚡ Visual View of RAG ⚡

+-----------------+
  | User's Query (Q) |
  +-----------------+
           |
           v
 +----------------------+
 | Query Encoder (Embed) |
 | (ex: OpenAI Embedding)|
 +----------------------+
           |
           v
 +-------------------------+
 | Similarity Search Engine |
 | (Vector Database: FAISS, |
 | Pinecone, ChromaDB)      |
 +-------------------------+
           |
           v
 +------------------------------+
 | Retrieved Top-K Context Docs |
 +------------------------------+
           |
           v
 +-------------------------------------------------+
 | Construct Final Prompt:                         |
 | [Retrieved Context] + [Original User Question]   |
 +-------------------------------------------------+
           |
           v
 +--------------------------+
 |  Large Language Model (LLM)|
 | (ex: GPT-4, Llama 3)       |
 +--------------------------+
           |
           v
 +---------------------+
 | Generated Final Answer |
 +---------------------+

Waiting for comments!