Tools like LLMLingua (by Microsoft) use language models to compress prompts by learning which parts can be dropped while preserving meaning. It’s powerful — but also relies on another LLM to optimize prompts for the LLM.
I wanted to try something different: a lightweight, rule-based semantic compressor that doesn't require training or GPUs — just smart heuristics, NLP tools like spaCy, and a deep respect for meaning.
The Challenge: Every Token Costs
In the world of Large Language Models (LLMs), every token comes with a price tag. For organizations running thousands of prompts daily, these costs add up quickly. But what if we could reduce these costs without sacrificing the quality of interactions?
Real Results: Beyond Theory
Our experimental Semantic Prompt Compressor has shown promising results in real-world testing. Analyzing 135 diverse prompts, we achieved:
- 22.42% average compression ratio
- Reduction from 4,986 → 3,868 tokens
- 1,118 tokens saved while maintaining meaning
- Over 95% preservation of named entities and technical terms
Example 1
Original (33 tokens):
"I've been considering the role of technology in mental health treatment.
How might virtual therapy and digital interventions evolve?
I'm interested in both current applications and future possibilities."
_
Compressed (12 tokens):
_"I've been considering role of technology in mental health treatment."
Compression ratio: 63.64%
Example 2
Original (29 tokens):
"All these apps keep asking for my location.
What are they actually doing with this information?
I'm curious about the balance between convenience and privacy."
Compressed (14 tokens):
"apps keep asking for my location. What are they doing with information."
Compression ratio: 51.72%
The Cost Impact
Let’s translate these results into real business scenarios.
Customer Support AI
(100,000 queries/day):
- Avg. 200 tokens per query
- GPT-4 API cost: $0.03 / 1K tokens
Without compression:
- 20M tokens/day → $600/day → $18,000/month
- With 22.42% compression:
- 15.5M tokens/day → $465/day
- Monthly savings: $4,050
How It Works: A Three-Layer Approach
Rules Layer
We implemented a configurable rule system instead of using a black-box ML model. For example:
Replace “Could you explain” with “explain”
Replace “Hello, I was wondering” with “I wonder”
rule_groups:
remove_fillers:
enabled: true
patterns:
- pattern: "Could you explain"
replacement: "explain"
remove_greetings:
enabled: true
patterns:
- pattern: "Hello, I was wondering"
replacement: "I wonder"
spaCy NLP Layer
We leverage spaCy’s linguistic analysis for intelligent compression:
- Named Entity Recognition to preserve key terms
- Dependency parsing for sentence structure
- POS tagging to remove non-essential parts
- Compound-word preservation for technical terms
Entity Preservation Layer
We ensure critical information is not lost:
- Technical terms (e.g., "5G", "TCP/IP")
- Named entities (companies, people, places)
- Numerical values and measurements
- Domain-specific vocabulary
Real-World Applications
_Customer Support
_
- Compress user queries while maintaining context
- Preserve product-specific language
- Reduce support costs, maintain quality
_Content Moderation
_
- Efficiently process user reports
- Maintain critical context
- Cost-effective scaling
- Technical Documentation
- Compress API or doc queries
- Preserve code snippets and terms
- Cut costs without losing accuracy
- Beyond Simple Compression
What makes our approach unique?
Intelligent Preservation — Maintains technical accuracy and key data
Configurable Rules — Domain-adaptable, transparent, and editable
Transparent Processing — Understandable and debuggable
Current Limitations
- Requires domain-specific tuning
- Conservative in technical contexts
- Manual rule editing still helpful
- Entity preservation may be overly cautious
Future Development
- ML-based adaptive compression
- Domain-specific profiles
- Real-time compression
- LLM platform integrations
- Custom vocabulary modules Conclusion
The results from our testing show that intelligent semantic prompt compression is not only possible — it's practical.
With a 22.42% average compression ratio and high semantic preservation, LLM-based systems can reduce API costs while maintaining clarity and intent.
Whether you're building support bots, moderation tools, or technical assistants, prompt compression could be a key layer in your stack.
Project on GitHub:
github.com/metawake/prompt_compressor
(Open source, transparent, and built for experimentation.)