How to Cut Your LLM Costs by 70% With a Semantic Cache

If your AI application sends every user query to an LLM, you are paying full price for answers you have already generated. Semantic caching matches incoming queries against previous ones by meaning, not just exact string match, and returns cached responses instantly. The Nayan Cache API uses trigram similarity matching with zero external dependencies. No embeddings. No vector database. Cache hits resolve in roughly 5 milliseconds instead of 2,000+.

Here is a number that should bother you: the average AI-powered SaaS application sends between 40% and 70% of its queries to an LLM when a semantically identical answer already exists in a previous response. That is money burning in a furnace.

GPT-4o costs $2.50 per million input tokens and $10 per million output tokens. Claude 3.5 Sonnet runs $3 and $15. At scale, a chatbot handling 100,000 conversations per month can easily spend $3,000 to $8,000 on LLM calls alone. And a significant chunk of those calls are answering the same questions with slightly different phrasing.

The Problem With Exact-Match Caching

Exact-match caching is what most teams try first. Hash the user query, check if the hash exists, return the cached response if it does. Simple. Fast. And nearly useless for natural language.

"What's the capital of France?" and "what is the capital of france" and "France's capital city?" and "capital of france" are all the same question. An exact-match cache treats them as four different queries and makes four separate LLM calls.

The solution is semantic matching: comparing queries by meaning rather than by characters. But the traditional approach to semantic matching requires an embedding model, a vector database, and a similarity search pipeline. That is a lot of infrastructure for a cache.

Trigram Similarity: The No-Deps Approach

The Nayan Cache API takes a different approach. Instead of converting text to embeddings and running cosine similarity, it uses trigram similarity matching. A trigram is a sequence of three consecutive characters. The string "hello" produces the trigrams: "hel", "ell", "llo". Two strings are similar if they share a high proportion of their trigrams.

This sounds crude, and for long documents it would be. But for the kind of short queries that make up 90% of chatbot traffic, trigram matching is surprisingly effective. "What's the capital of France" and "What is the capital of france" share most of their trigrams and score above the similarity threshold.

The advantage: zero external dependencies. No embedding model to call. No vector database to host. No latency from a secondary ML inference. The cache lookup is a string comparison, and it resolves in single-digit milliseconds.

How It Works in Practice

Step 1: After your LLM generates a response, store the query-response pair in the cache:

# Store a query-response pair in the cache
curl -X POST https://api.nayanleadership.com/v1/cache/store \
  -H "Authorization: Bearer nayan_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of France?",
    "response": "The capital of France is Paris.",
    "namespace": "my-chatbot",
    "ttl": 86400
  }'

Step 2: Before making an LLM call, check the cache:

# Look up a semantically similar cached response
curl -X POST https://api.nayanleadership.com/v1/cache/lookup \
  -H "Authorization: Bearer nayan_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "whats the capital of france",
    "namespace": "my-chatbot",
    "threshold": 0.7
  }'

If a match is found, the response includes the cached answer, the similarity score, and the original query it matched against. If no match is found, you proceed to the LLM as usual and store the result for next time.

The threshold parameter controls how similar queries need to be to count as a match. A value of 0.7 is a good starting point. Lower values catch more matches but risk returning answers to different questions. Higher values are more conservative.

The Numbers

Metric	LLM Call	Cache Hit
Latency	1,500 - 3,000ms	3 - 8ms
Cost per query (GPT-4o)	$0.002 - $0.01	$0.00001
Infrastructure	API key + network call	API key + network call
Accuracy	Model-dependent	Identical to original

At a 50% cache hit rate, which is conservative for most chatbot workloads, you cut your LLM costs in half and your average response latency by 40%. At 70% hit rate, which is typical for customer support bots with a bounded domain, the cost savings are dramatic.

The Namespace System

Every cache entry belongs to a namespace. This means you can run multiple applications, multiple environments, or multiple tenants on the same API key without cache pollution. Your staging environment does not return cached responses from production. Your customer support bot does not return cached responses from your content generator.

Namespaces also let you set different TTL (time-to-live) values for different use cases. A knowledge base bot might cache responses for 24 hours. A news bot might cache for 30 minutes. A real-time pricing bot might not cache at all.

Who Should Use This

AI app developers building chatbots, copilots, or any product that sends user queries to an LLM. If your users ask similar questions, you are leaving money on the table.

SaaS companies with AI features. That AI-powered search, that smart assistant, that automated report generator. All of them produce repeat queries that semantic caching can intercept.

Chatbot builders deploying customer support bots. Support queries are highly repetitive by nature. Cache hit rates of 60-80% are common in this domain.

Getting Started

The free tier includes 1,000 cache lookups per month. That is enough to instrument your existing application, measure your actual hit rate, and calculate your cost savings before committing to a paid plan. Sign up at api.nayanleadership.com and you will have an API key in under a minute.

Key Takeaways

40-70% of chatbot queries are semantically identical to previously answered questions. Every duplicate is a wasted LLM call.
Trigram similarity matching compares queries by meaning without requiring embeddings or a vector database. Zero external dependencies.
Cache hits resolve in roughly 5ms compared to 1,500-3,000ms for an LLM call. That is a 300x improvement in latency.
Namespace isolation prevents cache pollution across applications, environments, and tenants.
Free tier: 1,000 lookups/month. Enough to measure your hit rate and calculate savings. Get your API key.

The Problem With Exact-Match Caching

Trigram Similarity: The No-Deps Approach

How It Works in Practice

The Numbers

The Namespace System

Who Should Use This

Getting Started

Key Takeaways

Cici Bee

More Insights

Why Your AI Content Sounds Like AI

Your AI App Is Probably Vulnerable

Give Your AI a Memory

Leadership insights, delivered weekly