Benchmarks

This page publishes two numbers that most LLM gateway comparisons handwave: how fast the gateway boots from cold, and how much latency it adds on top of the upstream provider response.

Both are measured with a reproducible script in the repo, against a fake in-process HTTP server that returns a canned OpenAI-compatible 200 body as fast as Node can push bytes. That isolates FreeLLM’s own contribution from any upstream variance, so the numbers below reflect the routing layer, validation, rate limiter, logger, error handler, and cache path, and nothing else.

Latest numbers

Measurement	Value
Cold start (spawn to first `/healthz` 200)	~130 ms
Cache-miss overhead, p50	~0.7 ms
Cache-miss overhead, p99	~1.4 ms
Cache-hit overhead, p50	~0.35 ms
Cache-hit overhead, p99	~0.9 ms

Numbers above are ballpark, measured on a developer laptop. The exact run lives at docs/benchmarks.json in the repo, which is regenerated by the benchmark script and includes Node version, platform, and the raw percentiles.

Why these are the numbers we publish

Cold start matters because FreeLLM is designed to run on free tiers of Railway, Render, and Fly, all of which spin containers down when idle. A gateway that takes four seconds to boot is unusable in that mode. 130 ms means the first request from a cold container pays a single digit penalty, then everything after it is warm.

Overhead matters because it is what the gateway itself is costing you. If the upstream provider’s completion is 400 ms, 0.7 ms of that is FreeLLM doing its job. The rest is the model thinking. You do not pay a tax for the routing and failover machinery; you pay it when the machinery actually triggers on a 429.

Cache-hit overhead is separate because the response cache short-circuits the entire routing flow. If your app hammers the same prompt (common in tests, dev loops, and cron jobs), you get the 0.35 ms path and zero upstream quota burn.

Caveats

These numbers come from a single host running a single Node process hitting a loopback socket. Real deployments have:

Network latency between the caller and the gateway (usually 5 to 50 ms)
Network latency between the gateway and the upstream provider (the dominant factor, usually 50 to 500 ms)
Cloud hypervisor noise, which adds variance but usually not much p50
Cold starts that are longer on free tiers because container boot adds its own overhead before the gateway process starts

What the numbers will not change: FreeLLM’s own contribution to the total remains a few milliseconds, and cold start remains sub-second.

Reproduce it yourself

git clone https://github.com/Devansh-365/freellm
cd freellm
pnpm install
pnpm -r build
node scripts/bench.mjs --print

The script writes docs/benchmarks.json and also prints the JSON to stdout so you can diff it against the published version. Open an issue if your numbers differ materially, we want to hear it.