Have you ever wanted a smart AI assistant that understands your entire website and can answer questions like ChatGPT? In this tutorial, we’ll show you how to build it — without training your own LLM or managing any backend.
We’ll use:
✅ Olostep to crawl and extract website content
✅ ChromaDB to store and search content embeddings with metadata
✅ OpenAI (v1.7.6) for embeddings and GPT-4 summarization
✅ Streamlit to build a live chatbot UI
Perfect for product sites, documentation portals, and landing pages.
🔧 What You'll Need
pip install streamlit openai==1.7.6 chromadb requests
🧠 How It Works
- Crawl website pages using Olostep’s API
- Clean content and extract Markdown
- Embed each page with OpenAI embeddings
- Store everything in ChromaDB (including metadata)
- Let users ask questions via Streamlit
- Query top matches and summarize answers with GPT
🧩 Step-by-Step Implementation
1. Crawl Website
def start_crawl(url):
payload = {
"start_url": url,
"include_urls": ["/**"],
"max_pages": 10,
"max_depth": 3
}
headers = {"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"}
res = requests.post("https://api.olostep.com/v1/crawls", headers=headers, json=payload)
return res.json()["id"]
2. Wait and Retrieve Pages
def wait_for_crawl(crawl_id):
while True:
res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}", headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
if res.json()["status"] == "completed":
break
time.sleep(30)
def get_pages(crawl_id):
res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}/pages", headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
return res.json()["pages"]
3. Clean Markdown Content
import re
def clean_markdown(markdown):
markdown = re.sub(r'#+ |\* |\> ', '', markdown)
markdown = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', markdown)
markdown = re.sub(r'`|\*\*|_', '', markdown)
markdown = re.sub(r'\n{2,}', '\n', markdown)
return markdown.strip()
4. Initialize ChromaDB
import chromadb
from chromadb.utils import embedding_functions
chroma_client = chromadb.Client()
openai_embed_fn = embedding_functions.OpenAIEmbeddingFunction(
api_key="YOUR_OPENAI_API_KEY",
model_name="text-embedding-ada-002"
)
collection = chroma_client.get_or_create_collection(
name="website-content",
embedding_function=openai_embed_fn
)
5. Index Content
def retrieve_markdown(retrieve_id):
res = requests.get("https://api.olostep.com/v1/retrieve",
headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"},
params={"retrieve_id": retrieve_id, "formats": ["markdown"]})
return res.json().get("markdown_content", "")
def index_content(pages):
for page in pages:
try:
markdown = retrieve_markdown(page["retrieve_id"])
text = clean_markdown(markdown)
if len(text) > 20:
collection.add(
documents=[text],
metadatas=[{"url": page["url"]}],
ids=[page["retrieve_id"]]
)
print(f"✅ Indexed: {page['url']}")
except Exception as e:
print(f"⚠️ Error: {e}")
6. Summarize with GPT
import openai
client = openai.OpenAI(api_key="YOUR_OPENAI_API_KEY")
def summarize_with_gpt(question, chunks):
if not chunks:
return "Sorry, I couldn't find enough information."
prompt = f'''
Use the following website content to answer this question:
{''.join(chunks)}
Q: {question}
A:
'''
res = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.4,
)
return res.choices[0].message.content
7. Streamlit Frontend
import streamlit as st
st.set_page_config(page_title="Website Chatbot", page_icon="💬")
st.title("💬 Website Chatbot")
st.caption("Ask anything based on your website content.")
question = st.chat_input("Ask your question...")
if question:
with st.chat_message("user"):
st.markdown(question)
with st.spinner("Thinking..."):
chunks = query_website(question)
final_answer = summarize_with_gpt(question, chunks)
with st.chat_message("assistant"):
st.markdown(final_answer)
✅ Live Demo Preview
- Ask: What services do you offer?
- Ask: Where is your pricing page?
- Ask: How can I contact support?
The assistant will generate answers using real indexed content from your website.
🧠 Next Steps
- Save/load your ChromaDB collection
- Split large documents into smaller chunks
- Include source URLs in GPT responses
- Add memory to handle multi-turn chat
🎯 Conclusion
Congratulations! You've now built a fully functional AI-powered chatbot that can answer questions from your website using ChromaDB, Olostep, and OpenAI — all wrapped in a beautiful Streamlit app.
Whether for internal docs, support, or public knowledge, this gives you ChatGPT power without managing any LLMs.
Happy building! 🚀
https://gist.github.com/mdehsan873/f69481997f487e23b1d1282c82ce00f5