Making Your Documentation AI-Friendly: A Practical Guide

Hey there! 👋   I've been working with AI and web scraping for a while now, and I want to share some practical tips on how to make your documentation more AI-friendly. Instead of trying to block all bots (which never really works), let's focus on creating structured, reliable channels for beneficial AI while keeping the bad actors out.

 

📅 The LLM Knowledge Gap

Here's the thing about Large Language Models - they're only as good as their training data. Most LLMs have a knowledge cutoff date, meaning they don't know anything that happened after their last training session.

This creates a real problem when users ask about:

  • New features released after the cutoff
  • Breaking changes in recent updates
  • Security patches and bug fixes
  • Current best practices

That's why it's crucial to give LLMs direct access to your latest documentation. When an LLM can pull in your current docs on demand, it can provide accurate, up-to-date information instead of relying on potentially outdated training data.

 

🤔 Why AI Can't Read Your Docs

Before we dive into solutions, let's look at the most common roadblocks I've seen that prevent AI from properly accessing documentation:

  • Client-Side Rendering: Some AI crawlers can't execute JavaScript, so if your content is rendered client-side, it's invisible to them
  • Complex Navigation: JavaScript-based routing and hash navigation break traditional crawling
  • Inconsistent Structure: Varying HTML layouts and tags make it hard for AI to understand content relationships
  • Authentication Walls: Login requirements block automated access completely
  • Rate Limiting: Aggressive rate limits prevent thorough documentation crawling
  • Anti-Bot Measures: CAPTCHAs and similar tools block all automated access, good or bad
  • Performance Issues: Slow loading times and timeouts interrupt the crawling process

The good news? All of these problems have solutions, and that's what we're going to cover next. Let's make your docs AI-friendly!

 

🚀 Roll Out the Welcome Mat for Good AI

Instead of making bots guess their way through your site, let's make it crystal clear who's welcome. Here's how:

  • Allowlisting: Use robots.txt to explicitly allow trusted bots. Set up your WAF or firewall to recognize these good actors by their user agents and IP ranges.
  • Double-Check: Don't just trust the user agent string. Verify bots by:
    • Checking their IP ownership
    • Using verified bot lists (like Cloudflare's)
    • Looking out for new verification standards
  • Clear Signals: Use robots.txt to publicly declare which bots you're cool with, even if you're enforcing rules elsewhere.

 

🤖 Common AI Crawlers You'll Encounter

First things first - you need to know who's knocking at your door. While robots.txt isn't perfect, it's still the standard way to signal your intentions. Here's a quick reference of the main players:

Bot User-Agent What They Do Allow Example Block Example
OpenAI (ChatGPT) GPTBot Web crawling for model improvement User-agent: GPTBot
Allow: /
User-agent: GPTBot
Disallow: /
OpenAI (Plugin) ChatGPT-User Plugin actions User-agent: ChatGPT-User
Allow: /
User-agent: ChatGPT-User
Disallow: /
Google AI Google-Extended Gemini, Vertex AI User-agent: Google-Extended
Allow: /
User-agent: Google-Extended
Disallow: /
Anthropic anthropic-ai General crawling User-agent: anthropic-ai
Allow: /
User-agent: anthropic-ai
Disallow: /
Perplexity PerplexityBot Web crawling User-agent: PerplexityBot
Allow: /
User-agent: PerplexityBot
Disallow: /
Common Crawl CCBot Public web data User-agent: CCBot
Allow: /
User-agent: CCBot
Disallow: /
Facebook FacebookBot Platform needs User-agent: FacebookBot
Allow: /
User-agent: FacebookBot
Disallow: /

Note: This list changes often, and remember - robots.txt is more of a suggestion than a rule.

 

📦 Give AI Direct Access to Your Data

The best way to ensure AI gets your docs right? Give them direct access to structured data. Here's how to do it:

1. Use llms.txt

The llms.txt standard is a game-changer for making your site AI-friendly. It's like a robots.txt for LLMs, but way more powerful. Here's why you should use it:

  • Standardized Format: A simple markdown file that tells LLMs exactly how to use your site
  • Easy to Implement: Just add a /llms.txt file to your root directory
  • Human & Machine Readable: Works for both humans and AI
  • Context-Aware: Helps LLMs understand your site's structure and purpose
  • Markdown Support: Automatically serves markdown versions of your pages (just add .md to URLs)

Here's a quick example of what your llms.txt might look like:

# Your Project Name

> A brief description of what your project does and why it's awesome

## Documentation

- [Getting Started](https://your-site.com/getting-started.html.md): Quick start guide
- [API Reference](https://your-site.com/api.html.md): Complete API documentation

## Optional

- [Contributing Guide](https://your-site.com/contributing.html.md): How to contribute

The best part? It's super simple to implement and works with existing tools. Check out the llms.txt GitHub repo for more details and examples.

2. Build an API and MCP server

Create a public API specifically for your documentation. This gives you:

  • Clean, structured data in JSON/Markdown
  • More reliable than HTML scraping
  • Better control over access
  • Easier for AI devs to find and use

But here's the cool part - you can take it a step further by setting up an MCP (Model Context Protocol) server. This lets AI models request documentation directly from your source, ensuring they always get the latest version. Here's how it works:

// Example MCP server setup
const express = require('express');
const app = express();

// MCP endpoint for documentation
app.get('/mcp/docs', async (req, res) => {
  const { version, section } = req.query;

  // Fetch latest docs from your source
  const docs = await fetchLatestDocs(version, section);

  // Return in a format AI can easily consume
  res.json({
    content: docs,
    metadata: {
      lastUpdated: new Date(),
      version: version,
      source: 'your-docs-repo'
    }
  });
});

// Start the server
app.listen(3000, () => {
  console.log('MCP server running on port 3000');
});

The benefits of using an MCP server:

  • Always Fresh: AI gets the latest docs directly from your source
  • Version Control: Specify which version of docs you want
  • Structured Access: Clean, consistent data format
  • Analytics: Track which parts of your docs are most useful to AI

You can even combine this with your llms.txt file to point AI to your MCP endpoints:

# Your Project Docs

> Latest documentation available via MCP

## API Endpoints

- [MCP Documentation Server](https://your-site.com/mcp/docs): Get the latest docs
- [MCP Search](https://your-site.com/mcp/search): Search across all versions

This approach gives you the best of both worlds - structured access through an API and real-time updates through MCP. The AI can request exactly what it needs, when it needs it, and you maintain complete control over the content.

Just look at how all these companies and users building 3rd party tools are leveraging MCP servers. MCP Server Directory

3. Provide Direct Markdown Access

Sometimes the simplest solution is the best. You can make your documentation directly available as markdown files that AI can easily consume. Here's how:

  • Single File Approach: Create a comprehensive docs.md that links to all your documentation
  • Downloadable Archive: Offer a .zip of all your docs in markdown format
  • GitHub/GitLab: Host your docs in a public repository (many LLMs can read these directly)

Here's a quick example of a well-structured markdown index:

# Project Documentation

## Core Features
- [Getting Started](getting-started.md)
- [API Reference](api.md)
- [Configuration](config.md)

## Advanced Topics
- [Architecture](architecture.md)
- [Performance Tuning](performance.md)
- [Security](security.md)

However, there are some challenges to watch out for:

  • Version Control: Markdown files can get out of sync with your main docs
  • Context Limits: Some docs might exceed LLM context windows (e.g., GPT-4's 128k limit)
  • Formatting Issues: Complex markdown features might not render correctly in all LLMs
  • Link Management: Relative links might break when files are moved or downloaded

To mitigate these issues:

  • Use a docs-as-code approach with automated sync
  • Split large docs into smaller, focused files
  • Keep markdown simple and well-structured
  • Use absolute URLs for external links
  • Include a last-updated timestamp in each file

 

🏗️ Structure Your Content for Machines

Make your docs easy for machines to understand:

  • Use Semantic HTML:
    ,
    , - not just endless
    s
  • Add Schema.org: Give machines explicit semantic info
  • Keep It Clean: Remove ads, repeated nav links, and cookie banners from the main content
  •  

    🔧 Fix Technical Roadblocks

    Make sure bots can actually get to your content:

    • Rendering: If you're heavy on JavaScript, use SSR or Dynamic Rendering
    • Navigation: Use standard tags and provide sitemaps
    • Code Examples: Make them accessible text with clear language tags
    • Versioning: Make it clear in URLs and metadata
    • Speed: Optimize for quick loading

     

    🛡️ Anti-Scraping vs. AI: What Works?

    Here's the lowdown on different techniques and how they stack up against AI scrapers:

    Technique How It Works Against Basic Bots Against AI What to Do
    robots.txt Voluntary rules Moderate Low Use Allow for good bots; combine with other methods
    Meta Tags Page-level instructions N/A N/A Use sparingly; doesn't stop scraping
    CAPTCHA Human verification High Moderate Avoid on public docs; allowlist good bots
    Rate Limiting IP-based limits Moderate Low Higher limits for verified bots
    UA Filtering Block by user agent Low-Moderate Very Low Mainly for allowing good bots
    JS Rendering Client-side content High Moderate Use SSR for bots; provide APIs
    Complex Nav Non-standard links Moderate-High Low-Moderate Stick to standard links
    Inconsistent HTML Varying structure Low Low-Moderate Use consistent templates
    Login Walls Authentication Very High High Keep public access; API for trusted AI
    Fingerprinting Browser ID High Moderate-High Part of layered defense
    Behavior Analysis Pattern detection High High Tune carefully
    Bot Management Multi-signal detection Very High High-Very High Configure allowlisting

     

    Remember: The goal isn't to block all bots - it's to welcome the good ones while keeping the bad ones out. With these strategies, you can make your documentation both AI-friendly and secure.

     

    🛠️ Tools That Bypass Restrictions

    While we've focused on making your docs AI-friendly, it's worth noting that some companies are building tools to bypass these restrictions. For example, Firecrawl offers a service that can:

    • Handle JavaScript-heavy sites
    • Bypass rate limiting
    • Parse dynamic content
    • Convert web pages to LLM-ready data

    However, these services come with a subscription cost and can be blocked by more sophisticated anti-bot measures. That's why implementing the solutions we've discussed is still the best long-term approach - it's more reliable, cost-effective, and maintains control over how your documentation is accessed.

     


    Thanks for reading! If you found this guide helpful give it a ⭐️, and feel free to share it with others. If you notice any inaccuracies, please do let me know. I welcome all feedback. Hit me up on X/Twitter.

    P.S. Don't forget to check out the MCP Server Directory to see how others are implementing these strategies in the wild.