How to choose the right open-source LLM
There are now dozens of capable open-source models. Mistral 7B, Llama 3 8B, Llama 3 70B, Mixtral 8×7B, Code Llama — and that list grows every month. The good news: most production use cases map cleanly to one model family. This guide gives you a decision framework so you stop A/B testing models indefinitely and ship.
Start with the constraint, not the benchmark
Most developers open a leaderboard, sort by MMLU, and pick the top result. That is the wrong starting point. Benchmark rankings measure a model's maximum capability across all tasks — they say nothing about whether the model fits your latency budget, token budget, or the specific task you are actually building.
Instead, start by answering three questions:
- What is my latency requirement? Is there a human waiting for the response? If yes, anything over 150 ms TTFT feels slow. 70B models clock in at 142–156 ms TTFT p50 on our cluster — already at the edge.
- What is my context window requirement? RAG and chat apps that stay under 8 K tokens have 6 models to choose from. Long-document tasks need Mistral 7B (32 K) or Llama 3.1 (128 K).
- Can I fine-tune? A fine-tuned 8B model routinely beats a prompted 70B model on domain tasks. If you have even 500 labeled examples, fine-tuning usually wins over model-switching.
Only after you have eliminated models that can't satisfy these constraints should you look at quality benchmarks to pick between what's left.
The decision tree
Walk through this tree top-to-bottom. Stop at the first branch that matches your situation.
Use case matrix
Here is how common production use cases map to model choices, based on our benchmarks and customer deployments.
| Use case | Recommended | Why |
|---|---|---|
| Customer support chatbot | llama3-8b | Fast, low cost, handles common Q&A well |
| Code generation / review | codellama-13b | Purpose-trained on code, strong fill-in-middle |
| Document summarisation | llama3-8b | Short-context summaries stay within 8 K limit |
| Long-doc summarisation (>8 K) | llama31-8b | 128 K context, comparable speed to llama3-8b |
| Translation | gemma-7b | Trained with multilingual data, compact footprint |
| RAG pipeline | mistral-7b | Low latency for fast retrieval → response cycles |
| Agents / function calling | mixtral-8x7b | Strong instruction-following, longer context |
| High-throughput batch jobs | mistral-7b | Highest tok/s at concurrency 8 (4,820 tok/s) |
| Max quality, zero-shot | llama3-70b | Best MMLU (79.5) and MT-Bench (9.0) scores |
Benchmark numbers you can trust
We ran every model in the Cloudach catalog on our production GKE cluster using vLLM v0.4.2. Here are the numbers that matter for shipping:
Time to first token — concurrency 1
| Model | p50 (ms) | p99 (ms) | Context |
|---|---|---|---|
mistral-7b | 35 | 79 | 32 K |
llama3-8b | 38 | 88 | 8 K |
llama31-8b | 41 | 94 | 128 K |
mixtral-8x7b | 74 | 163 | 32 K |
llama3-70b | 142 | 287 | 8 K |
llama31-70b | 156 | 304 | 128 K |
Quality benchmarks
| Model | MMLU | HumanEval | MT-Bench |
|---|---|---|---|
mistral-7b | 62.5 | 30.5 | 6.84 |
llama3-8b | 66.6 | 33.0 | 7.10 |
codellama-13b | 35.1 | 62.0 | 6.01 |
mixtral-8x7b | 70.6 | 40.2 | 8.30 |
llama3-70b | 79.5 | 50.4 | 9.00 |
MMLU = 5-shot accuracy. HumanEval = pass@1 (Python). MT-Bench = GPT-4-as-judge, 1–10 scale. Full methodology in the April 2026 benchmark report.
The 7B vs 70B question
This is the most common question we get from new users. The answer is almost always: start with a 7B/8B model.
Here's why. A 70B model costs 5× more per token than an 8B model on Cloudach. It is also ~4× slower. In most production workloads, you will process far more tokens than you expect — 10 M tokens/month is typical for a small-to-medium SaaS product. At those volumes, the cost difference is not a rounding error; it is the difference between a product that is economically viable and one that is not.
The cases where 70B genuinely wins:
- Multi-step reasoning: complex math, multi-hop question answering, code that requires understanding entire codebases
- Zero-shot performance: you have no training data and cannot fine-tune
- High-stakes accuracy: medical, legal, or financial text where a small factual error has real consequences
- Complex agent loops: planning tasks that require the model to self-correct over many steps
If none of those apply, start with llama3-8b and benchmark your actual task. A fine-tuned 8B model will likely beat a prompted 70B model, and it will do so at one-fifth the per-token cost.
The fine-tuning multiplier
One thing that benchmark tables consistently understate: fine-tuning has a larger effect on real-task quality than moving from 8B to 70B. We have seen this repeatedly in customer deployments:
- A support chatbot fine-tuned on 1,000 examples of ideal responses outperforms a prompted 70B model on domain-specific Q&A in 9 out of 10 customer evaluations.
- A fine-tuned Mistral 7B for SQL generation reliably outperforms a prompted GPT-4-class model when the schema has unusual naming conventions.
- Translation quality on low-resource languages improves dramatically with 500–2,000 parallel examples, regardless of base model size.
The practical recommendation: if you are choosing between “upgrade from 8B to 70B” or “collect 500 examples and fine-tune your 8B model,” choose fine-tuning first. The quality ceiling of a fine-tuned 8B model at task-specific work is higher than most developers expect.
Cloudach supports full fine-tuning and LoRA for all major model families. See the Fine-Tuning Guide for a walkthrough from dataset to deployed adapter in under 30 minutes.
Practical patterns for common architectures
RAG pipelines: use two models
A common mistake in RAG is using the same large model for both retrieval-side reranking and synthesis. A better pattern:
- Retrieval + reranking:
mistral-7b— fast, low latency, good enough for relevance scoring - Final synthesis:
llama3-8bormixtral-8x7b— only called once per user query, worth spending a bit more on quality
This hybrid pattern cuts overall cost by 60–70% compared to using a 70B model for both steps, with minimal quality loss on the answer.
Agents: bigger context, better instruction-following
Agents live or die by instruction-following quality. You need a model that reliably formats tool calls correctly, respects chain-of-thought instructions, and does not hallucinate tool names. Our recommendation is mixtral-8x7b for most agent workloads — it has strong MT-Bench scores (8.30) and a 32 K context window that fits most tool schemas + conversation history without truncation.
For agents that require deeper planning or long codebases in context, use llama3-70b. The quality gap at complex planning is measurable.
Real-time user-facing UX: every millisecond counts
If a human is watching a cursor blink, streaming is non-negotiable. Enable stream: trueon every user-facing call — it moves the perceived response start from TTFT to “within 35–40 ms.” Use mistral-7b or llama3-8b for these paths; they both clear the 100 ms p99 TTFT bar that users perceive as instant.
Checklist: picking your model
- Latency requirement < 100 ms p99 TTFT → only 7B/8B models qualify
- Context > 8 K tokens → Mistral 7B (32 K), Mixtral 8×7B (32 K), or Llama 3.1 (128 K)
- Primary task is code → start with Code Llama 13B
- Multilingual → Gemma 7B or Mixtral 8×7B
- Have training data → fine-tune first before upgrading model size
- Cost-sensitive + high volume → 7B/8B models at 5× lower cost than 70B
- Complex reasoning / zero-shot / high-stakes → 70B models worth the premium
Start with llama3-8b and escalate intentionally
The default recommendation for new Cloudach projects is llama3-8b. It is fast enough for real-time UX, cheap enough to scale, and capable enough for the majority of production use cases. Build your evaluation suite on 8B, measure the failure modes, and only escalate to a larger model if the data says you should.
Model selection is not a one-time decision. As your product matures and your dataset grows, fine-tuning your chosen base model is almost always a better investment than switching to a larger one.