Chatbots are getting scary good — but evaluating them? That’s still a pain. BLEU and ROUGE scores feel like trying to judge a movie by its subtitles. Human evaluation is time-consuming, inconsistent, and honestly… nobody has time for that.

So here’s the question I tackled in this project:
Can we let an LLM evaluate other LLMs?

Spoiler: Yep. And it’s shockingly effective.


The Big Idea: LLM Rating LLM 💡

We built an Auto-Eval system using Google’s Gemini 2.0 Flash model to rate chatbot responses on:

✅ Relevance – Does it actually answer the question?

✅ Helpfulness – Is the answer useful or just fluff?

✅ Clarity – Can a human actually understand it?

✅ Factual Accuracy – Is it hallucinating or nah?

And we didn’t just invent our own data — we pulled real conversations from the OpenAssistant dataset (OASST1). These are crowdsourced human-assistant chats, so it’s the real deal.


Setup: Let’s Get Nerdy ⚙️

Step 1: Load the Dataset

We used Hugging Face’s datasets library to load OpenAssistant’s training data and converted it to a Pandas DataFrame.

from datasets import load_dataset

oasst = load_dataset("OpenAssistant/oasst1")
df = oasst["train"].to_pandas()

Step 2: Extract Prompt-Response Pairs

We filtered English conversations and merged assistant replies with the prompts that triggered them.

df = df[df['role'].isin(['prompter', 'assistant'])][['message_id', 'parent_id', 'text', 'role', 'lang']]
df = df[df['lang'] == 'en']

merged = df.merge(df, left_on="parent_id", right_on="message_id", suffixes=("_reply", "_prompt"))
merged = merged[['text_prompt', 'text_reply']].rename(columns={'text_prompt': 'prompt', 'text_reply': 'response'})

Prompt Engineering + Gemini Setup 🤖

We used few-shot prompting to make Gemini behave like an evaluator and return structured scores (in JSON format).

Here’s the eval prompt we send:

def build_eval_prompt(prompt, response):
    return f"""
You are an evaluator. Rate this response to a user prompt.

Rate from 1 to 5 on:
- Relevance
- Helpfulness
- Clarity
- Factual accuracy

Return ONLY valid JSON:
{{
  "relevance": X,
  "helpfulness": X,
  "clarity": X,
  "factuality": X
}}

Prompt: {prompt}
Response: {response}
"""

And then we just hit Gemini with that:

import google.generativeai as genai

genai.configure(api_key="YOUR_API_KEY")
model = genai.GenerativeModel("gemini-2.0-flash")

response = model.generate_content(build_eval_prompt(prompt, response))

Running the Evaluation Loop 🧪

We ran the model on a sample of 15 prompt-response pairs and parsed the scores:

ratings = []

for _, row in sampled.iterrows():
    try:
        res = model.generate_content(build_eval_prompt(row['prompt'], row['response']))
        json_block = res.text[res.text.find('{'):res.text.rfind('}')+1]
        score = json.loads(json_block)
        score.update(row)
        ratings.append(score)
    except Exception as e:
        print("Error:", e)

Boom. Now we have an LLM scoring LLMs. Matrix-style.


Visualizing the Scores 📊

We saved everything to a CSV and used a Seaborn boxplot to get the vibes:

import seaborn as sns
import matplotlib.pyplot as plt

ratings_df = pd.DataFrame(ratings)
sns.boxplot(data=ratings_df[['relevance', 'helpfulness', 'clarity', 'factuality']])
plt.title("LLM Auto-Eval Score Distribution")
plt.show()

And the results? Pretty solid. Some outliers, but Gemini gave reasonable scores across all four dimensions.


Takeaways 🔍

✅ This works. Gemini can evaluate chatbot responses consistently.

🎯 It scales. No need to bug your friends to rate 200 replies.

🤖 Model comparisons just got easier. Want to compare GPT vs Claude vs Mistral? Auto-eval it.

What’s Next? 🛣️

📈 Add more examples and multiple models for A/B testing.

🤯 Detect hallucinations automatically.

🧑‍⚖️ Compare LLM vs human evaluations — who rates better?


🧪 Try It Yourself

Want to peek under the hood or run it with your own data?

👉 Check out the full notebook on Kaggle

Clone it, tweak it, break it (just don’t blame me 😅).

P.S : This post was rated 5/5 on clarity by my cat. And 2/5 on factuality by my anxiety.