📌 Problem Statement: Video Overload, Information Lost

In today’s digital age, we consume a massive amount of video content—tutorials, lectures, interviews, podcasts, and more. But revisiting a 30-minute YouTube video just to find one key idea? That’s inefficient.

Wouldn’t it be powerful if you could just ask a question and instantly get an answer from a video?

That’s exactly the problem I tackled in my recent Kaggle notebook using Generative AI + Embeddings + vector search.


🧠 The Solution: Gen AI + Embedding Search on YouTube Transcripts

The goal was to build a pipeline that:

  1. Fetches a YouTube video's transcript
  2. Splits it into manageable chunks
  3. Embeds those chunks using a Gemini embedding model
  4. Stores them in ChromaDB for fast similarity search
  5. Uses a Generative AI model to answer natural language queries from the user

📄Implementation Breakdown

1. Fetching the Transcript

We use youtube_transcript_api to extract the transcript directly from a YouTube video:

from youtube_transcript_api import YouTubeTranscriptApi

video_url = 'https://www.youtube.com/watch?v=pTB0EiLXUC8'
video_id = video_url.split('v=')[1]
transcript_text = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])

2. Chunking the Transcript

To make it easier to process and embed, we break the transcript into smaller parts:

def chunk_transcript(transcript, chunk_size=500):
    chunks, current_chunk = [], ""
    for item in transcript:
        text = item['text']
        if len(current_chunk) + len(text) <= chunk_size:
            current_chunk += " " + text
        else:
            chunks.append(current_chunk.strip())
            current_chunk = text
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

3. Embedding and Storing in ChromaDB

We embed each chunk using Google’s Gemini embedding model and store them in a ChromaDB collection for retrieval later:

embedding_response = client.models.embed_content(
    model="models/text-embedding-004",
    contents=chunks
)

embeddings = [e.values for e in embedding_response.embeddings]

for emb, meta in zip(embeddings, [{'chunk': c} for c in chunks]):
    collection.add(
        ids=[str(meta['chunk'])],
        embeddings=[emb],
        metadatas=meta,
        documents=[meta['chunk']]
    )

4. Querying with Generative AI

When a user enters a question, we:

  • Embed the query
  • Search for relevant chunks
  • Feed the results into Gemini to generate a concise answer
def generate_answer(query, relevant_chunks):
    context = " ".join(relevant_chunks)
    prompt = f"Question: {query}\n\nContext: {context}"
    response = client.models.generate_content(
        model="gemini-2.0-flash", 
        contents=prompt
    )
    return response.candidates[0].content.parts[0].text

5. User Query Execution

This final step uses the RAG pipeline to return an LLM-generated answer that’s directly grounded in the original transcript content

user_query = 'what problem did object oriented programming came to solve?'
relevant_chunks = get_relevant_chunks(user_query, collection)
if relevant_chunks:
    answer = generate_answer(user_query, relevant_chunks)
    print("Answer:", answer)
else:
    print("No relevant chunks found to generate an answer.")

🔮 Future Possibilities

  • Multilingual support: Transcripts in multiple languages and translation layers.
  • Transcript availability: Not all YouTube videos have transcripts or English subtitles. Audios can be extracted and then can be converted into transcripts using speech-to-text models.
  • Summarization layer: Automatically summarize full videos.

🏁 Conclusion

This project demonstrates how Gen AI and vector databases can work together to transform passive video content into an interactive knowledge base.

With just a YouTube link, you can now ask intelligent questions and get answers backed by the video transcript—all powered by embeddings and Gemini.

Check out the full notebook on Kaggle and try it out with your favorite videos!
upvote the notebook if you want!
Kaggle notebook link