Fine-tune Llama 3 on your own data with Cloudach
The most impactful thing you can do to improve LLM output quality for a specific domain is fine-tuning — not writing longer prompts, not switching models, not RAG alone. Fine-tuning rewires the model to match your data distribution at the weight level. Here is why it matters and how to do it on Cloudach in under 30 minutes.
Why fine-tuning, not just prompting?
Prompting is fast and flexible. But it has a hard ceiling. A general-purpose Llama 3 8B prompted with a system message like “You are a friendly Acme Corp support agent” will still default to generic phrasing, make up product details it was never told, and occasionally refuse or over-hedge in ways that frustrate users.
Fine-tuning solves these problems structurally. When you train on 500 examples of your ideal responses, the model internalises:
- Your tone: concise and direct, or warm and verbose — the model learns by example, not instruction
- Your facts: product names, prices, policies — baked into weights, not retrieved at runtime
- Your format: whether to use bullet points, how to handle unanswerable questions, when to escalate
- Your refusal boundaries: what's in-scope and out-of-scope for your use case
In our internal benchmarks, a fine-tuned Llama 3 8B consistently outperforms a prompted Llama 3 70B on domain-specific tasks — at one-eighth the inference cost.
LoRA: fine-tuning without training the whole model
Full fine-tuning — updating all 8 billion parameters — is expensive and often unnecessary. LoRA (Low-Rank Adaptation) is a parameter-efficient method that achieves comparable results by training only a tiny fraction of new weights.
The core idea: instead of updating a full weight matrix W (shape d × k), LoRA adds a pair of low-rank matrices A (shape d × r) and B(shape r × k), where the rank r is much smaller than d or k. During training, only A and B are updated. The forward pass becomes:
output = x @ (W + A @ B * scale)
For Llama 3 8B with rank 16, the LoRA adapter has roughly 10 million trainable parameters — less than 0.15% of the 8 billion total. Training takes minutes rather than hours, costs a fraction of a full fine-tune, and the adapter is tiny enough (≈ 40 MB) that it can be swapped per-request at inference time.
How Cloudach serves LoRA adapters
Cloudach uses vLLM's native LoRA multi-adapter support. When your fine-tuning job completes:
- The adapter weights (
adapter_config.json+adapter_model.safetensors) are stored in our object store. - On first request, vLLM loads the base model once and registers your adapter. Adapter loading takes < 50 ms — warm thereafter.
- Multiple adapters for the same base model share a single base model replica. You pay for base model GPU hours plus a small hosting fee per adapter, not a separate GPU per fine-tune.
- vLLM's
lora_requestmechanism routes each request to the right adapter with zero extra latency compared to the base model.
This architecture means you can maintain dozens of fine-tuned variants — one per customer, one per language, one per product line — all served from the same GPU cluster at the same sub-100ms TTFT we offer on base models.
Choosing base models for fine-tuning
Not every base model is equally good for fine-tuning. Here is our practical guidance:
| Model | Best for | LoRA rank recommendation |
|---|---|---|
llama3-8b | Most tasks — best cost/quality ratio for fine-tuning | 16 (start here) |
llama3-70b | Complex reasoning, nuanced tone, multilingual | 16–32 |
llama31-8b | Long-context tasks (RAG, document Q&A) | 16 |
mistral-7b | Fast inference, European data residency required | 16–32 |
mixtral-8x7b | High accuracy, mixture-of-experts efficiency | 16 (LoRA only) |
Our recommendation for most teams: start with llama3-8band rank 16. Only move to a larger model if the 8B fine-tune doesn't meet your quality bar — the size difference has a 5–8× cost impact on training and inference.
What makes a good training dataset
Data quality is the biggest lever. We have seen teams get excellent results from 200 carefully curated examples and poor results from 5,000 noisy ones. A few principles that matter most:
Write assistant turns in your exact production voice
The model learns by example. If 10% of your training examples use bullet points and 90% use prose, the model will be inconsistent. Decide on your format before you start, and apply it uniformly.
Cover your long tail
Happy path examples are easy. The real value comes from edge cases: unanswerable questions, out-of-scope requests, ambiguous inputs, and situations where the right answer is to escalate rather than guess. Include at least 15–20% edge case examples.
Use the same system prompt in training and inference
Include your production system prompt in every training example. The model will learn to condition on it. If you train without a system prompt and then add one at inference, you will get inconsistent results.
Balance your categories
If billing questions make up 40% of your examples but only 5% of real traffic, the model will over-index on billing phrasing. Aim for class balance that reflects real usage distribution.
Step-by-step: from data to deployed model
Here is the complete workflow with curl. See the Fine-tuning Tutorial for Python and a guided walkthrough.
1. Prepare your dataset
# Each line: {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}
# Minimum 100 examples, JSONL format2. Upload
curl https://api.cloudach.com/v1/fine-tuning/datasets \
-H "Authorization: Bearer $CLOUDACH_API_KEY" \
-F "file=@training_data.jsonl" \
-F "purpose=fine-tune"
# → {"id": "ds-8f3a2b1c", ...}3. Create job
curl https://api.cloudach.com/v1/fine-tuning/jobs \
-H "Authorization: Bearer $CLOUDACH_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-8b",
"training_file": "ds-8f3a2b1c",
"method": {"type": "lora", "lora": {"rank": 16, "alpha": 32}},
"hyperparameters": {"n_epochs": 3},
"suffix": "my-model"
}'
# → {"id": "ftjob-a1b2c3", "status": "queued"}4. Monitor + infer
# Poll until status = "succeeded"
curl https://api.cloudach.com/v1/fine-tuning/jobs/ftjob-a1b2c3 \
-H "Authorization: Bearer $CLOUDACH_API_KEY"
# Run inference
curl https://api.cloudach.com/v1/chat/completions \
-H "Authorization: Bearer $CLOUDACH_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "llama3-8b:ft:my-model:ftjob-a1b2c3", "messages": [...]}'What to expect: timings and cost
For a 500-example dataset, 3 epochs, rank 16 LoRA on Llama 3 8B:
- Training time: 8–12 minutes
- Training cost: approximately $0.45 (at $0.003 / 1K training tokens)
- Adapter hosting: $2/month
- Inference: same as base model — $0.10 / 1M tokens
A team running 10M inference tokens per month on a fine-tuned 8B model pays roughly $1 + $2 hosting = $3/month in fine-tuning costs, versus ~$1,000/month if they were trying to achieve similar quality by switching to a 70B model.
Get started
Fine-tuning is available to all Cloudach users on the Pro plan and above. The quickest path:
- Sign up or log in to Cloudach
- Download our sample 50-example dataset and run through the tutorial end-to-end
- Replace the sample data with your own examples and re-run
Have questions about your specific use case? Email ml@cloudach.com — our ML team reviews every inbound and responds within one business day.