📌 Problem Statement: Video Overload, Information Lost
In today’s digital age, we consume a massive amount of video content—tutorials, lectures, interviews, podcasts, and more. But revisiting a 30-minute YouTube video just to find one key idea? That’s inefficient.
Wouldn’t it be powerful if you could just ask a question and instantly get an answer from a video?
That’s exactly the problem I tackled in my recent Kaggle notebook using Generative AI + Embeddings + vector search.
🧠 The Solution: Gen AI + Embedding Search on YouTube Transcripts
The goal was to build a pipeline that:
- Fetches a YouTube video's transcript
- Splits it into manageable chunks
- Embeds those chunks using a Gemini embedding model
- Stores them in ChromaDB for fast similarity search
- Uses a Generative AI model to answer natural language queries from the user
📄Implementation Breakdown
1. Fetching the Transcript
We use youtube_transcript_api
to extract the transcript directly from a YouTube video:
from youtube_transcript_api import YouTubeTranscriptApi
video_url = 'https://www.youtube.com/watch?v=pTB0EiLXUC8'
video_id = video_url.split('v=')[1]
transcript_text = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
2. Chunking the Transcript
To make it easier to process and embed, we break the transcript into smaller parts:
def chunk_transcript(transcript, chunk_size=500):
chunks, current_chunk = [], ""
for item in transcript:
text = item['text']
if len(current_chunk) + len(text) <= chunk_size:
current_chunk += " " + text
else:
chunks.append(current_chunk.strip())
current_chunk = text
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
3. Embedding and Storing in ChromaDB
We embed each chunk using Google’s Gemini embedding model and store them in a ChromaDB collection for retrieval later:
embedding_response = client.models.embed_content(
model="models/text-embedding-004",
contents=chunks
)
embeddings = [e.values for e in embedding_response.embeddings]
for emb, meta in zip(embeddings, [{'chunk': c} for c in chunks]):
collection.add(
ids=[str(meta['chunk'])],
embeddings=[emb],
metadatas=meta,
documents=[meta['chunk']]
)
4. Querying with Generative AI
When a user enters a question, we:
- Embed the query
- Search for relevant chunks
- Feed the results into Gemini to generate a concise answer
def generate_answer(query, relevant_chunks):
context = " ".join(relevant_chunks)
prompt = f"Question: {query}\n\nContext: {context}"
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=prompt
)
return response.candidates[0].content.parts[0].text
5. User Query Execution
This final step uses the RAG pipeline to return an LLM-generated answer that’s directly grounded in the original transcript content
user_query = 'what problem did object oriented programming came to solve?'
relevant_chunks = get_relevant_chunks(user_query, collection)
if relevant_chunks:
answer = generate_answer(user_query, relevant_chunks)
print("Answer:", answer)
else:
print("No relevant chunks found to generate an answer.")
🔮 Future Possibilities
- Multilingual support: Transcripts in multiple languages and translation layers.
- Transcript availability: Not all YouTube videos have transcripts or English subtitles. Audios can be extracted and then can be converted into transcripts using speech-to-text models.
- Summarization layer: Automatically summarize full videos.
🏁 Conclusion
This project demonstrates how Gen AI and vector databases can work together to transform passive video content into an interactive knowledge base.
With just a YouTube link, you can now ask intelligent questions and get answers backed by the video transcript—all powered by embeddings and Gemini.
Check out the full notebook on Kaggle and try it out with your favorite videos!
upvote the notebook if you want!
Kaggle notebook link