MLApr 14, 2026

How to choose the right open-source LLM

There are now dozens of capable open-source models. Mistral 7B, Llama 3 8B, Llama 3 70B, Mixtral 8×7B, Code Llama — and that list grows every month. The good news: most production use cases map cleanly to one model family. This guide gives you a decision framework so you stop A/B testing models indefinitely and ship.

Start with the constraint, not the benchmark

Most developers open a leaderboard, sort by MMLU, and pick the top result. That is the wrong starting point. Benchmark rankings measure a model's maximum capability across all tasks — they say nothing about whether the model fits your latency budget, token budget, or the specific task you are actually building.

Instead, start by answering three questions:

What is my latency requirement? Is there a human waiting for the response? If yes, anything over 150 ms TTFT feels slow. 70B models clock in at 142–156 ms TTFT p50 on our cluster — already at the edge.
What is my context window requirement? RAG and chat apps that stay under 8 K tokens have 6 models to choose from. Long-document tasks need Mistral 7B (32 K) or Llama 3.1 (128 K).
Can I fine-tune? A fine-tuned 8B model routinely beats a prompted 70B model on domain tasks. If you have even 500 labeled examples, fine-tuning usually wins over model-switching.

Only after you have eliminated models that can't satisfy these constraints should you look at quality benchmarks to pick between what's left.

The decision tree

Walk through this tree top-to-bottom. Stop at the first branch that matches your situation.

What is your primary requirement?
│
├─ Lowest latency / highest throughput?
│   └─ → mistral-7b  (35 ms TTFT p50, 1,560 tok/s)
│
├─ Best quality for general English tasks?
│   ├─ Budget: low → llama3-8b
│   └─ Budget: flexible → llama3-70b or mixtral-8x7b
│
├─ Code generation or debugging?
│   └─ → codellama-13b
│
├─ Long context window (>8 K tokens)?
│   ├─ Up to 32 K → mistral-7b or mixtral-8x7b
│   └─ Up to 128 K → llama31-8b or llama31-70b
│
├─ Best quality regardless of cost?
│   └─ → llama3-70b or llama31-70b
│
└─ Mixed workload (quality + reasonable speed)?
    └─ → mixtral-8x7b

Use case matrix

Here is how common production use cases map to model choices, based on our benchmarks and customer deployments.

Use case	Recommended	Why
Customer support chatbot	`llama3-8b`	Fast, low cost, handles common Q&A well
Code generation / review	`codellama-13b`	Purpose-trained on code, strong fill-in-middle
Document summarisation	`llama3-8b`	Short-context summaries stay within 8 K limit
Long-doc summarisation (>8 K)	`llama31-8b`	128 K context, comparable speed to llama3-8b
Translation	`gemma-7b`	Trained with multilingual data, compact footprint
RAG pipeline	`mistral-7b`	Low latency for fast retrieval → response cycles
Agents / function calling	`mixtral-8x7b`	Strong instruction-following, longer context
High-throughput batch jobs	`mistral-7b`	Highest tok/s at concurrency 8 (4,820 tok/s)
Max quality, zero-shot	`llama3-70b`	Best MMLU (79.5) and MT-Bench (9.0) scores

Benchmark numbers you can trust

We ran every model in the Cloudach catalog on our production GKE cluster using vLLM v0.4.2. Here are the numbers that matter for shipping:

Time to first token — concurrency 1

Model	p50 (ms)	p99 (ms)	Context
`mistral-7b`	35	79	32 K
`llama3-8b`	38	88	8 K
`llama31-8b`	41	94	128 K
`mixtral-8x7b`	74	163	32 K
`llama3-70b`	142	287	8 K
`llama31-70b`	156	304	128 K

Quality benchmarks

Model	MMLU	HumanEval	MT-Bench
`mistral-7b`	62.5	30.5	6.84
`llama3-8b`	66.6	33.0	7.10
`codellama-13b`	35.1	62.0	6.01
`mixtral-8x7b`	70.6	40.2	8.30
`llama3-70b`	79.5	50.4	9.00

MMLU = 5-shot accuracy. HumanEval = pass@1 (Python). MT-Bench = GPT-4-as-judge, 1–10 scale. Full methodology in the April 2026 benchmark report.

The 7B vs 70B question

This is the most common question we get from new users. The answer is almost always: start with a 7B/8B model.

Here's why. A 70B model costs 5× more per token than an 8B model on Cloudach. It is also ~4× slower. In most production workloads, you will process far more tokens than you expect — 10 M tokens/month is typical for a small-to-medium SaaS product. At those volumes, the cost difference is not a rounding error; it is the difference between a product that is economically viable and one that is not.

The cases where 70B genuinely wins:

Multi-step reasoning: complex math, multi-hop question answering, code that requires understanding entire codebases
Zero-shot performance: you have no training data and cannot fine-tune
High-stakes accuracy: medical, legal, or financial text where a small factual error has real consequences
Complex agent loops: planning tasks that require the model to self-correct over many steps

If none of those apply, start with llama3-8b and benchmark your actual task. A fine-tuned 8B model will likely beat a prompted 70B model, and it will do so at one-fifth the per-token cost.

The fine-tuning multiplier

One thing that benchmark tables consistently understate: fine-tuning has a larger effect on real-task quality than moving from 8B to 70B. We have seen this repeatedly in customer deployments:

A support chatbot fine-tuned on 1,000 examples of ideal responses outperforms a prompted 70B model on domain-specific Q&A in 9 out of 10 customer evaluations.
A fine-tuned Mistral 7B for SQL generation reliably outperforms a prompted GPT-4-class model when the schema has unusual naming conventions.
Translation quality on low-resource languages improves dramatically with 500–2,000 parallel examples, regardless of base model size.

The practical recommendation: if you are choosing between “upgrade from 8B to 70B” or “collect 500 examples and fine-tune your 8B model,” choose fine-tuning first. The quality ceiling of a fine-tuned 8B model at task-specific work is higher than most developers expect.

Cloudach supports full fine-tuning and LoRA for all major model families. See the Fine-Tuning Guide for a walkthrough from dataset to deployed adapter in under 30 minutes.

Practical patterns for common architectures

RAG pipelines: use two models

A common mistake in RAG is using the same large model for both retrieval-side reranking and synthesis. A better pattern:

Retrieval + reranking: mistral-7b — fast, low latency, good enough for relevance scoring
Final synthesis: llama3-8b or mixtral-8x7b — only called once per user query, worth spending a bit more on quality

This hybrid pattern cuts overall cost by 60–70% compared to using a 70B model for both steps, with minimal quality loss on the answer.

Agents: bigger context, better instruction-following

Agents live or die by instruction-following quality. You need a model that reliably formats tool calls correctly, respects chain-of-thought instructions, and does not hallucinate tool names. Our recommendation is mixtral-8x7b for most agent workloads — it has strong MT-Bench scores (8.30) and a 32 K context window that fits most tool schemas + conversation history without truncation.

For agents that require deeper planning or long codebases in context, use llama3-70b. The quality gap at complex planning is measurable.

Real-time user-facing UX: every millisecond counts

If a human is watching a cursor blink, streaming is non-negotiable. Enable stream: trueon every user-facing call — it moves the perceived response start from TTFT to “within 35–40 ms.” Use mistral-7b or llama3-8b for these paths; they both clear the 100 ms p99 TTFT bar that users perceive as instant.

Checklist: picking your model

Latency requirement < 100 ms p99 TTFT → only 7B/8B models qualify
Context > 8 K tokens → Mistral 7B (32 K), Mixtral 8×7B (32 K), or Llama 3.1 (128 K)
Primary task is code → start with Code Llama 13B
Multilingual → Gemma 7B or Mixtral 8×7B
Have training data → fine-tune first before upgrading model size
Cost-sensitive + high volume → 7B/8B models at 5× lower cost than 70B
Complex reasoning / zero-shot / high-stakes → 70B models worth the premium

Start with llama3-8b and escalate intentionally

The default recommendation for new Cloudach projects is llama3-8b. It is fast enough for real-time UX, cheap enough to scale, and capable enough for the majority of production use cases. Build your evaluation suite on 8B, measure the failure modes, and only escalate to a larger model if the data says you should.

Model selection is not a one-time decision. As your product matures and your dataset grows, fine-tuning your chosen base model is almost always a better investment than switching to a larger one.

Model Selection Guide (full reference)Docs Models reference — catalog, API examples, context windowsDocs How we hit sub-100ms TTFT with Llama 3 and vLLMML Fine-tune Llama 3 on your own data with CloudachML