Response Caching
Identical chat completions return cached responses in ~23ms with zero provider quota burn. The cache keys on (model, messages, temperature, max_tokens, top_p, stop) via SHA-256, uses LRU eviction, and respects per-entry TTL (default 1 hour).
Why this matters for free-tier developers
During development you typically iterate on the same prompt 10-20 times. Tweaking the system message, comparing outputs, debugging. Without caching, each iteration burns a provider request and a slice of your daily token quota. With caching, only the first iteration costs anything. The rest are free hits.
A typical dev session of 100 chat completions might collapse to 30-40 unique prompts. That’s a 60-70% reduction in provider quota usage with no code changes on your end.
Verified end-to-end
Call A (cold) cached=false latency=200ms tokens=43+2 → GroqCall B (same prompt) cached=true latency=23ms tokens=0 ← cacheCall C (same prompt) cached=true latency=23ms tokens=0 ← cacheCall D (different prompt) cached=false latency=200ms tokens=new → GroqThat’s a 9× speedup on duplicate requests, and the Groq quota only gets charged for unique prompts.
How it works
- Cache key.
sha256(JSON.stringify({model, messages, temperature, max_tokens, top_p, stop})) - Cache lookup. Happens BEFORE the routing loop, on every non-streaming request.
- Cache hit. Short-circuits the entire flow: no provider call, no token tracker increment, no rate limiter increment.
- Cache miss. Runs the normal failover loop, then writes the successful response back to cache.
- LRU eviction. At capacity, the oldest entry is dropped (Map re-insertion keeps recently-used entries at the end).
- TTL expiry. Each entry has its own
expiresAt, checked on read.
Configuration
CACHE_ENABLED=true # set to "false" to disableCACHE_TTL_MS=3600000 # 1 hour defaultCACHE_MAX_ENTRIES=1000 # LRU eviction at 1000All optional. Sensible defaults. You don’t need to set anything.
Rules
- Streaming requests are never cached. The OpenAI SSE protocol doesn’t fit the cache abstraction cleanly.
- Errors are never cached. Only successful 2xx responses get stored.
- Cache hits don’t count against the per-provider token quota. Real cost = 0.
- Cached responses are marked. Look for
x_freellm_cached: truein the response (alongsidex_freellm_provider). - Cache survives within the process. Restart loses the cache, which warms up again in seconds.
Cache stats on /v1/status
{ "cache": { "enabled": true, "ttlMs": 3600000, "maxEntries": 1000, "currentSize": 12, "hits": 47, "misses": 8, "sets": 8, "evictions": 0, "hitRate": 0.8545 }}The dashboard surfaces this as a cyan “Cache Hits” metric card with the hit-rate percentage as a sub-line, and rows in the recent requests table get a CACHE badge when served from cache.
Why in-memory instead of SQLite
The original plan called for better-sqlite3, but it was rejected because:
- Native compilation risk.
better-sqlite3needsnode-gyp, Python, and a C++ toolchain at install time. Railway’s slim image likely lacks them, which would break the published Railway template’s build. - Ephemeral filesystem on free tiers. Railway and Render free tiers don’t have persistent disk by default. A SQLite cache file would be wiped on every restart anyway, requiring a paid persistent volume.
- Architectural consistency. Every other observability piece in FreeLLM (request log, rate limiter, circuit breaker, usage tracker) is in-memory. Adding DB-backed storage for one feature would break the pattern.
Cold cache warms up in seconds, restart loss is acceptable for a free-tier gateway, and the entire feature ships with zero new dependencies (uses Node’s built-in crypto.createHash). The ResponseCache class lives behind a clean interface, so swapping the storage to SQLite later is a one-file change if persistence becomes a priority.