Hey there! 👋 I've been working with AI and web scraping for a while now, and I want to share some practical tips on how to make your documentation more AI-friendly. Instead of trying to block all bots (which never really works), let's focus on creating structured, reliable channels for beneficial AI while keeping the bad actors out.
📅 The LLM Knowledge Gap
Here's the thing about Large Language Models - they're only as good as their training data. Most LLMs have a knowledge cutoff date, meaning they don't know anything that happened after their last training session.
This creates a real problem when users ask about:
- New features released after the cutoff
- Breaking changes in recent updates
- Security patches and bug fixes
- Current best practices
That's why it's crucial to give LLMs direct access to your latest documentation. When an LLM can pull in your current docs on demand, it can provide accurate, up-to-date information instead of relying on potentially outdated training data.
🤔 Why AI Can't Read Your Docs
Before we dive into solutions, let's look at the most common roadblocks I've seen that prevent AI from properly accessing documentation:
- Client-Side Rendering: Some AI crawlers can't execute JavaScript, so if your content is rendered client-side, it's invisible to them
- Complex Navigation: JavaScript-based routing and hash navigation break traditional crawling
- Inconsistent Structure: Varying HTML layouts and tags make it hard for AI to understand content relationships
- Authentication Walls: Login requirements block automated access completely
- Rate Limiting: Aggressive rate limits prevent thorough documentation crawling
- Anti-Bot Measures: CAPTCHAs and similar tools block all automated access, good or bad
- Performance Issues: Slow loading times and timeouts interrupt the crawling process
The good news? All of these problems have solutions, and that's what we're going to cover next. Let's make your docs AI-friendly!
🚀 Roll Out the Welcome Mat for Good AI
Instead of making bots guess their way through your site, let's make it crystal clear who's welcome. Here's how:
-
Allowlisting: Use
robots.txt
to explicitly allow trusted bots. Set up your WAF or firewall to recognize these good actors by their user agents and IP ranges. -
Double-Check: Don't just trust the user agent string. Verify bots by:
- Checking their IP ownership
- Using verified bot lists (like Cloudflare's)
- Looking out for new verification standards
-
Clear Signals: Use
robots.txt
to publicly declare which bots you're cool with, even if you're enforcing rules elsewhere.
🤖 Common AI Crawlers You'll Encounter
First things first - you need to know who's knocking at your door. While robots.txt
isn't perfect, it's still the standard way to signal your intentions. Here's a quick reference of the main players:
Bot | User-Agent | What They Do | Allow Example | Block Example |
---|---|---|---|---|
OpenAI (ChatGPT) | GPTBot | Web crawling for model improvement | User-agent: GPTBot |
User-agent: GPTBot |
OpenAI (Plugin) | ChatGPT-User | Plugin actions | User-agent: ChatGPT-User |
User-agent: ChatGPT-User |
Google AI | Google-Extended | Gemini, Vertex AI | User-agent: Google-Extended |
User-agent: Google-Extended |
Anthropic | anthropic-ai | General crawling | User-agent: anthropic-ai |
User-agent: anthropic-ai |
Perplexity | PerplexityBot | Web crawling | User-agent: PerplexityBot |
User-agent: PerplexityBot |
Common Crawl | CCBot | Public web data | User-agent: CCBot |
User-agent: CCBot |
FacebookBot | Platform needs | User-agent: FacebookBot |
User-agent: FacebookBot |
Note: This list changes often, and remember - robots.txt
is more of a suggestion than a rule.
📦 Give AI Direct Access to Your Data
The best way to ensure AI gets your docs right? Give them direct access to structured data. Here's how to do it:
1. Use llms.txt
The llms.txt standard is a game-changer for making your site AI-friendly. It's like a robots.txt
for LLMs, but way more powerful. Here's why you should use it:
- Standardized Format: A simple markdown file that tells LLMs exactly how to use your site
-
Easy to Implement: Just add a
/llms.txt
file to your root directory - Human & Machine Readable: Works for both humans and AI
- Context-Aware: Helps LLMs understand your site's structure and purpose
-
Markdown Support: Automatically serves markdown versions of your pages (just add
.md
to URLs)
Here's a quick example of what your llms.txt
might look like:
# Your Project Name
> A brief description of what your project does and why it's awesome
## Documentation
- [Getting Started](https://your-site.com/getting-started.html.md): Quick start guide
- [API Reference](https://your-site.com/api.html.md): Complete API documentation
## Optional
- [Contributing Guide](https://your-site.com/contributing.html.md): How to contribute
The best part? It's super simple to implement and works with existing tools. Check out the llms.txt GitHub repo for more details and examples.
2. Build an API and MCP server
Create a public API specifically for your documentation. This gives you:
- Clean, structured data in JSON/Markdown
- More reliable than HTML scraping
- Better control over access
- Easier for AI devs to find and use
But here's the cool part - you can take it a step further by setting up an MCP (Model Context Protocol) server. This lets AI models request documentation directly from your source, ensuring they always get the latest version. Here's how it works:
// Example MCP server setup
const express = require('express');
const app = express();
// MCP endpoint for documentation
app.get('/mcp/docs', async (req, res) => {
const { version, section } = req.query;
// Fetch latest docs from your source
const docs = await fetchLatestDocs(version, section);
// Return in a format AI can easily consume
res.json({
content: docs,
metadata: {
lastUpdated: new Date(),
version: version,
source: 'your-docs-repo'
}
});
});
// Start the server
app.listen(3000, () => {
console.log('MCP server running on port 3000');
});
The benefits of using an MCP server:
- Always Fresh: AI gets the latest docs directly from your source
- Version Control: Specify which version of docs you want
- Structured Access: Clean, consistent data format
- Analytics: Track which parts of your docs are most useful to AI
You can even combine this with your llms.txt
file to point AI to your MCP endpoints:
# Your Project Docs
> Latest documentation available via MCP
## API Endpoints
- [MCP Documentation Server](https://your-site.com/mcp/docs): Get the latest docs
- [MCP Search](https://your-site.com/mcp/search): Search across all versions
This approach gives you the best of both worlds - structured access through an API and real-time updates through MCP. The AI can request exactly what it needs, when it needs it, and you maintain complete control over the content.
Just look at how all these companies and users building 3rd party tools are leveraging MCP servers. MCP Server Directory
3. Provide Direct Markdown Access
Sometimes the simplest solution is the best. You can make your documentation directly available as markdown files that AI can easily consume. Here's how:
-
Single File Approach: Create a comprehensive
docs.md
that links to all your documentation -
Downloadable Archive: Offer a
.zip
of all your docs in markdown format - GitHub/GitLab: Host your docs in a public repository (many LLMs can read these directly)
Here's a quick example of a well-structured markdown index:
# Project Documentation
## Core Features
- [Getting Started](getting-started.md)
- [API Reference](api.md)
- [Configuration](config.md)
## Advanced Topics
- [Architecture](architecture.md)
- [Performance Tuning](performance.md)
- [Security](security.md)
However, there are some challenges to watch out for:
- Version Control: Markdown files can get out of sync with your main docs
- Context Limits: Some docs might exceed LLM context windows (e.g., GPT-4's 128k limit)
- Formatting Issues: Complex markdown features might not render correctly in all LLMs
- Link Management: Relative links might break when files are moved or downloaded
To mitigate these issues:
- Use a docs-as-code approach with automated sync
- Split large docs into smaller, focused files
- Keep markdown simple and well-structured
- Use absolute URLs for external links
- Include a last-updated timestamp in each file
🏗️ Structure Your Content for Machines
Make your docs easy for machines to understand:
-
Use Semantic HTML:
,
,
- not just endless
s- Add Schema.org: Give machines explicit semantic info
- Keep It Clean: Remove ads, repeated nav links, and cookie banners from the main content
🔧 Fix Technical Roadblocks
Make sure bots can actually get to your content:
- Rendering: If you're heavy on JavaScript, use SSR or Dynamic Rendering
-
Navigation: Use standard
tags and provide sitemaps
- Code Examples: Make them accessible text with clear language tags
- Versioning: Make it clear in URLs and metadata
- Speed: Optimize for quick loading
🛡️ Anti-Scraping vs. AI: What Works?
Here's the lowdown on different techniques and how they stack up against AI scrapers:
Technique How It Works Against Basic Bots Against AI What to Do robots.txt Voluntary rules Moderate Low Use Allow for good bots; combine with other methods Meta Tags Page-level instructions N/A N/A Use sparingly; doesn't stop scraping CAPTCHA Human verification High Moderate Avoid on public docs; allowlist good bots Rate Limiting IP-based limits Moderate Low Higher limits for verified bots UA Filtering Block by user agent Low-Moderate Very Low Mainly for allowing good bots JS Rendering Client-side content High Moderate Use SSR for bots; provide APIs Complex Nav Non-standard links Moderate-High Low-Moderate Stick to standard links Inconsistent HTML Varying structure Low Low-Moderate Use consistent templates Login Walls Authentication Very High High Keep public access; API for trusted AI Fingerprinting Browser ID High Moderate-High Part of layered defense Behavior Analysis Pattern detection High High Tune carefully Bot Management Multi-signal detection Very High High-Very High Configure allowlisting Remember: The goal isn't to block all bots - it's to welcome the good ones while keeping the bad ones out. With these strategies, you can make your documentation both AI-friendly and secure.
🛠️ Tools That Bypass Restrictions
While we've focused on making your docs AI-friendly, it's worth noting that some companies are building tools to bypass these restrictions. For example, Firecrawl offers a service that can:
- Handle JavaScript-heavy sites
- Bypass rate limiting
- Parse dynamic content
- Convert web pages to LLM-ready data
However, these services come with a subscription cost and can be blocked by more sophisticated anti-bot measures. That's why implementing the solutions we've discussed is still the best long-term approach - it's more reliable, cost-effective, and maintains control over how your documentation is accessed.
Thanks for reading! If you found this guide helpful give it a ⭐️, and feel free to share it with others. If you notice any inaccuracies, please do let me know. I welcome all feedback. Hit me up on X/Twitter.
P.S. Don't forget to check out the MCP Server Directory to see how others are implementing these strategies in the wild.