EngineeringMar 28, 2026

Building an OpenAI-compatible API gateway from scratch

The OpenAI API has become the de facto interface for LLM applications. Developers have built entire toolchains — LangChain, LlamaIndex, Vercel AI SDK, countless internal frameworks — that speak OpenAI. Making Cloudach a drop-in replacement meant we had to implement the spec precisely. Here's what that took.

Why compatibility is harder than it looks

The OpenAI API spec is partially documented. The parts that matter most to developers — streaming behavior, error codes, token counting, finish reasons — have subtle behaviors that aren't fully specified in the docs but are relied on in production code everywhere.

We found this out early when we passed the basic "change base URL" test but broke LangChain streaming because we were emitting data: [DONE] after a final delta chunk with a non-empty content field. OpenAI always sends a final chunk with an empty content string and finish_reason: stop before the DONEsentinel. We didn't. LangChain's streaming parser silently dropped the last token.

The gateway architecture

Our API gateway sits between the client and the vLLM inference backends. It handles:

Auth: API key validation, rate limit enforcement, usage tracking
Request translation: OpenAI chat completions format → vLLM sampling params
Model routing: map model field to the right backend instance
Response translation: vLLM token stream → OpenAI SSE format
Token counting: tiktoken for OpenAI-compatible usage metadata (even though vLLM uses sentencepiece internally)

We built it in Go. Node.js was tempting for the streaming ergonomics, but Go's goroutine-per-request model handles the long-lived SSE connections much more efficiently. At scale, idle connections are memory, not CPU — Go handles 50k concurrent idle SSE connections in ~200MB RSS. Node would struggle past 10k.

The tricky parts

Streaming with correct token deltas.vLLM streams complete tokens, not subword pieces. OpenAI sometimes streams sub-token deltas. Most SDKs don't care — they concatenate everything — but some do byte-level streaming for progressive rendering. We emit token-by-token and document this limitation.

Function calling. vLLM supports tool use via constrained decoding on supported models. Translating OpenAI's tools format to vLLM's guided JSON schema required writing a recursive schema translator — OpenAI allows anyOf / $refin tool parameter schemas that vLLM's constrained decoder doesn't support natively.

System prompt handling. Llama 3 uses a [INST] / <<SYS>> chat template. Mistral uses a different template. Qwen uses another. We maintain a template registry keyed on model family and apply the right one when constructing the raw prompt — so messages format works identically across all models.

Error codes. We map vLLM errors (context length exceeded, CUDA OOM, model not loaded) to the correct OpenAI error types: context_length_exceeded, server_error, model_not_found. This is important because many SDKs do structured error handling based on the error type string.

Testing compatibility

We have a compatibility test suite that runs every deployment against the real OpenAI API and diffs the response shape. It covers: streaming completions, function calling, embeddings, error shapes, usage metadata fields, and finish reason values. Any divergence fails the deploy.

We also run integration tests against LangChain, LlamaIndex, and the Vercel AI SDK on every commit. These are the most valuable tests in our suite — they catch the undocumented behavioral dependencies that pure API spec tests miss.

The result

Most applications genuinely require only one change to migrate: swap the base URL and API key. If you're building on the OpenAI SDK and want to cut costs while keeping data in-house, try Cloudach. Migration takes 5 minutes.