Tools like LLMLingua (by Microsoft) use language models to compress prompts by learning which parts can be dropped while preserving meaning. It’s powerful — but also relies on another LLM to optimize prompts for the LLM.

I wanted to try something different: a lightweight, rule-based semantic compressor that doesn't require training or GPUs — just smart heuristics, NLP tools like spaCy, and a deep respect for meaning.

The Challenge: Every Token Costs

In the world of Large Language Models (LLMs), every token comes with a price tag. For organizations running thousands of prompts daily, these costs add up quickly. But what if we could reduce these costs without sacrificing the quality of interactions?

Real Results: Beyond Theory

Our experimental Semantic Prompt Compressor has shown promising results in real-world testing. Analyzing 135 diverse prompts, we achieved:

  • 22.42% average compression ratio
  • Reduction from 4,986 → 3,868 tokens
  • 1,118 tokens saved while maintaining meaning
  • Over 95% preservation of named entities and technical terms

Example 1

Original (33 tokens):
"I've been considering the role of technology in mental health treatment.
How might virtual therapy and digital interventions evolve?
I'm interested in both current applications and future possibilities."
_
Compressed (12 tokens):
_"I've been considering role of technology in mental health treatment."

Compression ratio: 63.64%

Example 2

Original (29 tokens):
"All these apps keep asking for my location.
What are they actually doing with this information?
I'm curious about the balance between convenience and privacy."

Compressed (14 tokens):
"apps keep asking for my location. What are they doing with information."

Compression ratio: 51.72%

The Cost Impact

Let’s translate these results into real business scenarios.

Customer Support AI

(100,000 queries/day):

  • Avg. 200 tokens per query
  • GPT-4 API cost: $0.03 / 1K tokens

Without compression:

  • 20M tokens/day → $600/day → $18,000/month
  • With 22.42% compression:
  • 15.5M tokens/day → $465/day
  • Monthly savings: $4,050

How It Works: A Three-Layer Approach

Rules Layer

We implemented a configurable rule system instead of using a black-box ML model. For example:

Replace “Could you explain” with “explain”

Replace “Hello, I was wondering” with “I wonder”

rule_groups:
remove_fillers:
enabled: true
patterns:
- pattern: "Could you explain"
replacement: "explain"
remove_greetings:
enabled: true
patterns:
- pattern: "Hello, I was wondering"
replacement: "I wonder"

spaCy NLP Layer

We leverage spaCy’s linguistic analysis for intelligent compression:

  • Named Entity Recognition to preserve key terms
  • Dependency parsing for sentence structure
  • POS tagging to remove non-essential parts
  • Compound-word preservation for technical terms

Entity Preservation Layer

We ensure critical information is not lost:

  • Technical terms (e.g., "5G", "TCP/IP")
  • Named entities (companies, people, places)
  • Numerical values and measurements
  • Domain-specific vocabulary

Real-World Applications

_Customer Support
_

  • Compress user queries while maintaining context
  • Preserve product-specific language
  • Reduce support costs, maintain quality

_Content Moderation
_

  • Efficiently process user reports
  • Maintain critical context
  • Cost-effective scaling
  • Technical Documentation
  • Compress API or doc queries
  • Preserve code snippets and terms
  • Cut costs without losing accuracy
  • Beyond Simple Compression

What makes our approach unique?

Intelligent Preservation — Maintains technical accuracy and key data

Configurable Rules — Domain-adaptable, transparent, and editable

Transparent Processing — Understandable and debuggable

Current Limitations

  • Requires domain-specific tuning
  • Conservative in technical contexts
  • Manual rule editing still helpful
  • Entity preservation may be overly cautious

Future Development

  • ML-based adaptive compression
  • Domain-specific profiles
  • Real-time compression
  • LLM platform integrations
  • Custom vocabulary modules Conclusion

The results from our testing show that intelligent semantic prompt compression is not only possible — it's practical.

With a 22.42% average compression ratio and high semantic preservation, LLM-based systems can reduce API costs while maintaining clarity and intent.

Whether you're building support bots, moderation tools, or technical assistants, prompt compression could be a key layer in your stack.

Project on GitHub:
github.com/metawake/prompt_compressor
(Open source, transparent, and built for experimentation.)