We've built a system that extends LLM conversations, reduces token usage, and improves response times by intelligently compacting conversation history. Here's how context compaction works under the hood. #LLM #AI #DevTools

While working on debugging an API integration, I encountered the familiar "context window limit" error in my LLM assistant. With valuable error analysis and partial solutions in the conversation, I was forced to start a new session and lose this context.

This common frustration inspired us to develop a solution that could extend LLM conversations indefinitely without losing essential information. Today, I'm sharing Automatic Context Compaction in Forge, a system that reduces conversation history size while maintaining essential semantic information.

The Challenge of Context Management

When working on complex coding tasks, your conversation with an AI assistant can quickly grow to include:

  • Multiple rounds of questions and answers
  • Code snippets and explanations
  • Tool calls and their results
  • Debugging sessions and error analysis

As this context grows, you face several issues:

  • You hit token limits, forcing you to start new conversations
  • The cost of API calls increases with token usage
  • Response times slow down with larger contexts
  • The assistant loses focus on the most recent and relevant parts of the conversation

Enter Automatic Context Compaction

Forge has implemented an elegant solution to this problem with the Automatic Context Compaction feature. This mechanism intelligently manages your conversation history, ensuring you get the most out of your LLM interactions without sacrificing quality.

How It Works: The Technical Implementation

The context compaction system operates on these core principles:

  1. Efficient Token Monitoring: Our token counter estimates conversation size using a logarithmic sampling approach, avoiding the performance hit of counting every token.

  2. Pattern-Based Sequence Identification: The algorithm identifies compactible message sequences using a sliding window approach that looks for specific patterns:

[Assistant Message] → [Tool Call] → [Tool Result] → [Assistant Message]
  1. Context-Aware Summarization: Rather than summarizing the entire conversation, we only compact specific sequences. The compaction uses a specialized prompt that instructs the model to create a comprehensive assessment including:

    • Primary objectives and success criteria
    • Information categorization and key elements
    • File changes tracking
    • Action logs of important operations
    • Technical details and relationships
  2. Semantic Structure Preservation: User messages remain untouched, maintaining the conversational structure while only compressing assistant outputs.

  3. Controlled Information Retention: Each summary undergoes an entropy analysis to ensure information density stays within acceptable parameters.

Visual Representation of the Process:

BEFORE COMPACTION:
┌─────────────────────────────┐
│ User: Initial question      │
├─────────────────────────────┤
│ Assistant: First response   │◄──┐
├─────────────────────────────┤   │
│ Assistant: Tool call        │   │
├─────────────────────────────┤   │ Compactible
│ System: Tool result (300KB) │   │ Sequence
├─────────────────────────────┤   │
│ Assistant: Tool analysis    │◄──┘
├─────────────────────────────┤
│ User: Follow-up question    │
├─────────────────────────────┤
│ Assistant: Latest response  │ ◄── In retention window (preserved)
└─────────────────────────────┘

AFTER COMPACTION:
┌─────────────────────────────┐
│ User: Initial question      │
├─────────────────────────────┤
│ System: Compressed Summary  │ ◄── ~90% token reduction
│ - Key code patterns found   │
│ - Fixed authentication issue│
│ - found 3 vulnerabilites.   │
├─────────────────────────────┤
│ User: Follow-up question    │
├─────────────────────────────┤
│ Assistant: Latest response  │ ◄── Preserved in retention window
└─────────────────────────────┘

Key Features

  • Multiple Trigger Options:

    • Token threshold: Compacts when the estimated token count exceeds a limit
    • Turn threshold: Compacts after a certain number of conversation turns
    • Message threshold: Compacts when the message count exceeds a limit
  • Configurable Retention Window: Preserves the most recent messages by keeping them out of the compaction process

  • Smart Selective Compaction: Only compresses sequences of consecutive assistant messages and tool results, while preserving user messages

  • Tag-Based Extraction: Supports extracting specific content from summaries using tags

  • Model Selection: Use a different (potentially cheaper and faster) model for compaction than your primary conversation model

How to Try It Out

Ready to try this feature out? It's easy to set up in your forge.yaml configuration file. Here's a sample configuration:

commands:
  - name: fixme
    description: Looks for all the fixme comments in the code and attempts to fix them
    value: |
      Find all the FIXME comments in source-code files and attempt to fix them.

agents:
  - id: software-engineer
    max_walker_depth: 1024
    subscribe:
      - fixme
    compact:
      max_tokens: 2000
      token_threshold: 80000
      model: google/gemini-2.0-flash-001
      retention_window: 6
      prompt: "{{> system-prompt-context-summarizer.hbs }}"

Let's break down the compaction configuration:

  • max_tokens: Maximum allowed tokens for the summary (2000)
  • token_threshold: Triggers compaction when the context exceeds 80K tokens
  • model: Uses Gemini 2.0 Flash for compaction (efficient and cost-effective)
  • retention_window: Preserves the 6 most recent messages from compaction
  • prompt: Uses the built-in summarizer template for generating summaries

Configuration Options

The compact configuration section supports these parameters:

  • max_tokens: Maximum token limit for the summary
  • token_threshold: Token count that triggers compaction
  • turn_threshold: Conversation turn count that triggers compaction
  • message_threshold: Message count that triggers compaction
  • retention_window: Number of recent messages to preserve
  • model: Model to use for compaction
  • prompt: Custom prompt template for summarization
  • summary_tag: Tag name to extract content from when summarizing

Expected Benefits

Automatic Context Compaction offers several potentially significant advantages for LLM-assisted development tasks. While we're still gathering comprehensive metrics from early users, these are the key benefits we anticipate:

  • Extended conversation sessions: Continue complex debugging or development tasks without hitting context limits
  • Reduced token consumption: Lower API costs by eliminating redundant or less relevant context
  • Improved response times: Smaller context windows typically lead to faster model responses
  • Better context management: Focus the model on the most relevant parts of the conversation
  • More coherent assistance: Reduce the need to repeat information across multiple sessions

As we collect more user data, we'll share concrete metrics on how these benefits translate to real-world improvements. Initial feedback has been promising, with users reporting they can work through entire debugging sessions without the frustrating context resets that previously interrupted their workflow.

One user working on refactoring a legacy authentication system noted that what previously required multiple separate conversations could be completed in a single extended session with compaction enabled. The continuity significantly improved problem-solving, as the assistant maintained awareness of earlier discoveries throughout the debugging process.

Early User Feedback

Initial feedback from developers has been encouraging:

  1. Extended work sessions: "I've been able to work through debugging sessions without interruption - no more starting over due to context limits."

  2. Potential cost savings: Some users report they're using fewer tokens overall when working on complex tasks.

  3. Subjective speed improvements: Users note that responses often arrive more quickly with compacted contexts.

  4. Better context retention: "The assistant remained coherent throughout my debugging session - it remembered key information discussed earlier without repetition."

We're actively collecting more structured data on these benefits and will share detailed metrics in future updates as our user base expands.

Under The Hood: Engineering Challenges & Solutions

Building an effective context compaction system presented several non-trivial engineering challenges:

1. Determining What to Compact

We initially experimented with three approaches to sequence identification:

// Approach 1: Simple token-based chunking (rejected)
fn chunk_by_token_count(messages: &[Message], chunk_size: usize) -> Vec<MessageChunk> {
    // Split messages into fixed-size chunks
    // Problem: Breaks semantic units, disrupting conversation flow
}

// Approach 2: Time-based windowing (rejected)
fn chunk_by_time_window(messages: &[Message], window_hours: f64) -> Vec<MessageChunk> {
    // Group messages by time periods
    // Problem: Conversation intensity varies, leading to uneven chunks
}

// Approach 3: Pattern-based sequence detection (implemented)
fn identify_compactible_sequences(messages: &[Message]) -> Vec<MessageSequence> {
    // Identify patterns like: [Assistant] → [Tool Call] → [Tool Result] → [Assistant]
    // Benefit: Preserves semantic units and conversational flow
}

The pattern-based approach proved most effective as it preserved the semantic integrity of the conversation while maximizing compressibility.

2. Token Estimation

Token counting in large contexts can become a performance bottleneck. For efficient token estimation, we implemented a progressive sampling approach that estimates token counts without processing the entire text, achieving significant performance improvements while maintaining accuracy.

3. Preserving Critical Information

The most challenging aspect was ensuring that summarized information retained critical details. We developed a specialized prompt template that instructs the compaction model to:

  1. Prioritize executable code snippets
  2. Preserve error messages and their context
  3. Maintain reference to key files and locations
  4. Track ongoing debugging progress

Our template includes specific extraction directives like:

Preserve all code blocks completely if they are less than 50 lines.
For larger code blocks, focus on the modified sections and their immediately surrounding context.
Maintain all error messages verbatim with their stack traces summarized.
Ensure all file paths and line numbers are preserved exactly.

4. Implementation in Rust

The core compaction logic operates asynchronously, ensuring the main conversation remains responsive during compaction operations.

Repository and Contributing

Forge is an open-source project developed by Antinomy. You can find the source code and contribute at:

We welcome contributions from the community, including improvements to the context compaction system. If you're interested in contributing, check out our open issues or submit a pull request with your enhancements.

What's Next for Context Compaction

We're planning several enhancements for future releases:

  1. Adaptive Compaction Thresholds: The system will learn from your usage patterns and automatically adjust compaction parameters based on conversation characteristics.

  2. Multi-Mode Compaction: Different summarization strategies for different types of development tasks (debugging vs. feature development vs. code review).

  3. User-Guided Retention: Ability for users to mark specific messages as "never compact" to ensure critical information is preserved exactly as stated.

Take Action: Implementing Context Compaction

Context compaction isn't just a feature - it's a fundamental shift in how we can work with LLMs for development. Here's how to get started:

  1. Update your Forge installation: npm install -g @antinomyhq/forge

  2. Add compaction configuration to your forge.yaml file (see examples above)

  3. Experiment with different thresholds to find the optimal balance for your workflow

  4. Share your experiences with the community - we're collecting usage patterns to further optimize the system

If you find this useful, consider:

  • ⭐ Starring the Forge GitHub repository
  • 📢 Sharing this post with colleagues facing similar context management challenges
  • 🛠️ Contributing parameters that work well for specific development scenarios

The potential of large language models is only beginning to be realized, and solving the context limitation problem removes a significant barrier to their effectiveness as development partners.

Want to try Forge with context compaction for free? We're offering free access to readers of this blog post! Just comment on this GitHub issue and we'll set you up.