I'm almost embarrassed to admit this, but for the longest time, I was using open-source LLMs completely wrong. It wasn’t until I started working on projects and diving into real-world deployments that I realized why my local setup was constantly hitting walls.
Here’s the tea 🫖 — and what I wish I knew months ago. Hopefully, this post helps you skip some headaches and build faster.
🚨 Mistake #1: Fine-Tuning Chat Models Instead of Base Models
If you're trying to fine-tune a model, don’t use the already fine-tuned version. Always start with the base model.
Why? Because chat models are already instruction-tuned. Stacking your custom instructions on top leads to weird behavior and overfitting. Base models are like a blank canvas — perfect for targeted fine-tuning without the baked-in assumptions.
🤯 Mistake #2: Using the Wrong Model for the Wrong Job
I used to throw Llama 3.2 at everything:
- Chatbot? ✅
- Code generation? ✅
- Long document summarization? ✅
Terrible idea.
Here’s what I learned:
- Llama-ChatQA is best for instruction following and dialogue.
- Code Llama is better for code generation and reasoning.
- Base models are best for custom fine-tuning.
Knowing this made a massive difference. My outputs improved instantly when I matched the model to the task.
Pro tip: Fine-tune base models for more precise results.
✨ Mistake #3: Not Formatting Prompts Correctly
Prompt formatting is crucial, especially with chat-style models like Llama-Chat.
If you’re not wrapping your instructions properly, the model can get confused or keep generating unnecessary outputs.
How to format prompts correctly: Use the [INST]
and [/INST]
tags:
[INST] Explain the difference between a hash map and an array. [/INST]
This structure helps the model understand exactly what you want, preventing it from auto-completing the prompt and giving you a clear response instead.
💰 Mistake #4: Not Using Base Models for Cheap Fine-Tuning
Want to train on your own dataset without burning cash?
Use the base model (not the instruct/chat model) combined with Lamini. This gives you more control and reduces costs.
🧠 Mistake #5: Skipping RAG (Retrieval-Augmented Generation)
Most hallucinations happen when you ask the model for information it doesn’t “know.”
The solution? Use a RAG (Retrieval-Augmented Generation) pipeline. Think of it like giving your model a cheat sheet during inference.
Examples:
- Ask questions over long PDFs → index docs, search, and inject into the prompt.
- Dynamic FAQ bots → search your knowledge base and generate answers on top.
Hallucinations drop, and accuracy rises.
🖥 Mistake #6: Only Running Models Locally
At first, I hosted everything locally — because it was free and felt “hackable.” But quickly, I hit some walls:
- Limited VRAM = can’t run larger models
- Can’t easily scale or share
- Harder to monitor/secure for production use
Now, I’m exploring hosted API services. Yes, they cost money. But:
- You can use larger models
- You can plug into real apps
- You can deploy publicly
It’s time to level up!
Final Thought
The open-source LLM ecosystem is evolving rapidly. It’s never been easier to get models running, but making them run well takes a bit of extra work.
Let me know if this helped or if you’re running into similar hurdles. I’ll be sharing more tips as I explore hosted APIs and production-ready RAG pipelines.
Hope this helps you avoid the same mistakes I made and helps you build better, faster!
This post was adapted from my original article on Medium. If you're interested in more insights and tips on working with Local LLMs, feel free to check out on Medium!