Chatbots are getting scary good — but evaluating them? That’s still a pain. BLEU and ROUGE scores feel like trying to judge a movie by its subtitles. Human evaluation is time-consuming, inconsistent, and honestly… nobody has time for that.
So here’s the question I tackled in this project:
Can we let an LLM evaluate other LLMs?
Spoiler: Yep. And it’s shockingly effective.
The Big Idea: LLM Rating LLM 💡
We built an Auto-Eval system using Google’s Gemini 2.0 Flash model to rate chatbot responses on:
✅ Relevance – Does it actually answer the question?
✅ Helpfulness – Is the answer useful or just fluff?
✅ Clarity – Can a human actually understand it?
✅ Factual Accuracy – Is it hallucinating or nah?
And we didn’t just invent our own data — we pulled real conversations from the OpenAssistant dataset (OASST1). These are crowdsourced human-assistant chats, so it’s the real deal.
Setup: Let’s Get Nerdy ⚙️
Step 1: Load the Dataset
We used Hugging Face’s datasets
library to load OpenAssistant’s training data and converted it to a Pandas DataFrame.
from datasets import load_dataset
oasst = load_dataset("OpenAssistant/oasst1")
df = oasst["train"].to_pandas()
Step 2: Extract Prompt-Response Pairs
We filtered English conversations and merged assistant replies with the prompts that triggered them.
df = df[df['role'].isin(['prompter', 'assistant'])][['message_id', 'parent_id', 'text', 'role', 'lang']]
df = df[df['lang'] == 'en']
merged = df.merge(df, left_on="parent_id", right_on="message_id", suffixes=("_reply", "_prompt"))
merged = merged[['text_prompt', 'text_reply']].rename(columns={'text_prompt': 'prompt', 'text_reply': 'response'})
Prompt Engineering + Gemini Setup 🤖
We used few-shot prompting to make Gemini behave like an evaluator and return structured scores (in JSON format).
Here’s the eval prompt we send:
def build_eval_prompt(prompt, response):
return f"""
You are an evaluator. Rate this response to a user prompt.
Rate from 1 to 5 on:
- Relevance
- Helpfulness
- Clarity
- Factual accuracy
Return ONLY valid JSON:
{{
"relevance": X,
"helpfulness": X,
"clarity": X,
"factuality": X
}}
Prompt: {prompt}
Response: {response}
"""
And then we just hit Gemini with that:
import google.generativeai as genai
genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content(build_eval_prompt(prompt, response))
Running the Evaluation Loop 🧪
We ran the model on a sample of 15 prompt-response pairs and parsed the scores:
ratings = []
for _, row in sampled.iterrows():
try:
res = model.generate_content(build_eval_prompt(row['prompt'], row['response']))
json_block = res.text[res.text.find('{'):res.text.rfind('}')+1]
score = json.loads(json_block)
score.update(row)
ratings.append(score)
except Exception as e:
print("Error:", e)
Boom. Now we have an LLM scoring LLMs. Matrix-style.
Visualizing the Scores 📊
We saved everything to a CSV and used a Seaborn boxplot to get the vibes:
import seaborn as sns
import matplotlib.pyplot as plt
ratings_df = pd.DataFrame(ratings)
sns.boxplot(data=ratings_df[['relevance', 'helpfulness', 'clarity', 'factuality']])
plt.title("LLM Auto-Eval Score Distribution")
plt.show()
And the results? Pretty solid. Some outliers, but Gemini gave reasonable scores across all four dimensions.
Takeaways 🔍
✅ This works. Gemini can evaluate chatbot responses consistently.
🎯 It scales. No need to bug your friends to rate 200 replies.
🤖 Model comparisons just got easier. Want to compare GPT vs Claude vs Mistral? Auto-eval it.
What’s Next? 🛣️
📈 Add more examples and multiple models for A/B testing.
🤯 Detect hallucinations automatically.
🧑⚖️ Compare LLM vs human evaluations — who rates better?
🧪 Try It Yourself
Want to peek under the hood or run it with your own data?
👉 Check out the full notebook on Kaggle
Clone it, tweak it, break it (just don’t blame me 😅).
P.S : This post was rated 5/5 on clarity by my cat. And 2/5 on factuality by my anxiety.