How I Built a Prompt Compressor That Reduces LLM Token Costs Without Losing Meaning

Tools like LLMLingua (by Microsoft) use language models to compress prompts by learning which parts can be dropped while preserving meaning. It’s powerful — but also relies on another LLM to optimize prompts for the LLM.

I wanted to try something different: a lightweight, rule-based semantic compressor that doesn't require training or GPUs — just smart heuristics, NLP tools like spaCy, and a deep respect for meaning.

The Challenge: Every Token Costs

In the world of Large Language Models (LLMs), every token comes with a price tag. For organizations running thousands of prompts daily, these costs add up quickly. But what if we could reduce these costs without sacrificing the quality of interactions?

Real Results: Beyond Theory

Our experimental Semantic Prompt Compressor has shown promising results in real-world testing. Analyzing 135 diverse prompts, we achieved:

22.42% average compression ratio
Reduction from 4,986 → 3,868 tokens
1,118 tokens saved while maintaining meaning
Over 95% preservation of named entities and technical terms

Example 1

Original (33 tokens):
"I've been considering the role of technology in mental health treatment.
How might virtual therapy and digital interventions evolve?
I'm interested in both current applications and future possibilities."
_
Compressed (12 tokens):
_"I've been considering role of technology in mental health treatment."

Compression ratio: 63.64%

Example 2

Original (29 tokens):
"All these apps keep asking for my location.
What are they actually doing with this information?
I'm curious about the balance between convenience and privacy."

Compressed (14 tokens):
"apps keep asking for my location. What are they doing with information."

Compression ratio: 51.72%

The Cost Impact

Let’s translate these results into real business scenarios.

Customer Support AI

(100,000 queries/day):

Avg. 200 tokens per query
GPT-4 API cost: $0.03 / 1K tokens

Without compression:

20M tokens/day → $600/day → $18,000/month
With 22.42% compression:
15.5M tokens/day → $465/day
Monthly savings: $4,050

How It Works: A Three-Layer Approach

Rules Layer

We implemented a configurable rule system instead of using a black-box ML model. For example:

Replace “Could you explain” with “explain”

Replace “Hello, I was wondering” with “I wonder”

rule_groups: remove_fillers: enabled: true patterns: - pattern: "Could you explain" replacement: "explain" remove_greetings: enabled: true patterns: - pattern: "Hello, I was wondering" replacement: "I wonder"

spaCy NLP Layer

We leverage spaCy’s linguistic analysis for intelligent compression:

Named Entity Recognition to preserve key terms
Dependency parsing for sentence structure
POS tagging to remove non-essential parts
Compound-word preservation for technical terms

Entity Preservation Layer

We ensure critical information is not lost:

Technical terms (e.g., "5G", "TCP/IP")
Named entities (companies, people, places)
Numerical values and measurements
Domain-specific vocabulary

Real-World Applications

_Customer Support
_

Compress user queries while maintaining context
Preserve product-specific language
Reduce support costs, maintain quality

_Content Moderation
_

Efficiently process user reports
Maintain critical context
Cost-effective scaling
Technical Documentation
Compress API or doc queries
Preserve code snippets and terms
Cut costs without losing accuracy
Beyond Simple Compression

What makes our approach unique?

Intelligent Preservation — Maintains technical accuracy and key data

Configurable Rules — Domain-adaptable, transparent, and editable

Transparent Processing — Understandable and debuggable

Current Limitations

Requires domain-specific tuning
Conservative in technical contexts
Manual rule editing still helpful
Entity preservation may be overly cautious

Future Development

ML-based adaptive compression
Domain-specific profiles
Real-time compression
LLM platform integrations
Custom vocabulary modules Conclusion

The results from our testing show that intelligent semantic prompt compression is not only possible — it's practical.

With a 22.42% average compression ratio and high semantic preservation, LLM-based systems can reduce API costs while maintaining clarity and intent.

Whether you're building support bots, moderation tools, or technical assistants, prompt compression could be a key layer in your stack.

Project on GitHub:
github.com/metawake/prompt_compressor
(Open source, transparent, and built for experimentation.)

How I Built a Prompt Compressor That Reduces LLM Token Costs Without Losing Meaning

The Challenge: Every Token Costs

Real Results: Beyond Theory

Example 1

Example 2

The Cost Impact

Customer Support AI

How It Works: A Three-Layer Approach

Rules Layer

spaCy NLP Layer

Entity Preservation Layer

Real-World Applications

What makes our approach unique?

Current Limitations

Future Development

Comments (0)

Read More

#reading

#popular

How I Built a Prompt Compressor That Reduces LLM Token Costs Without Losing Meaning

The Challenge: Every Token Costs

Real Results: Beyond Theory

Example 1

Example 2

The Cost Impact

Customer Support AI

How It Works: A Three-Layer Approach

Rules Layer

spaCy NLP Layer

Entity Preservation Layer

Real-World Applications

What makes our approach unique?

Current Limitations

Future Development

Comments (0)

Read More

TCP client/server with Python

Steps to Build Binary Executables for Python Code with GitHub Actions

My Development Favorite Commands Cheatsheet

X官方API获取KOL（目标账号）粉丝量

#reading

#popular