← Back to blog
MLApr 14, 2026

Fine-tune Llama 3 on your own data with Cloudach

The most impactful thing you can do to improve LLM output quality for a specific domain is fine-tuning — not writing longer prompts, not switching models, not RAG alone. Fine-tuning rewires the model to match your data distribution at the weight level. Here is why it matters and how to do it on Cloudach in under 30 minutes.

Why fine-tuning, not just prompting?

Prompting is fast and flexible. But it has a hard ceiling. A general-purpose Llama 3 8B prompted with a system message like “You are a friendly Acme Corp support agent” will still default to generic phrasing, make up product details it was never told, and occasionally refuse or over-hedge in ways that frustrate users.

Fine-tuning solves these problems structurally. When you train on 500 examples of your ideal responses, the model internalises:

  • Your tone: concise and direct, or warm and verbose — the model learns by example, not instruction
  • Your facts: product names, prices, policies — baked into weights, not retrieved at runtime
  • Your format: whether to use bullet points, how to handle unanswerable questions, when to escalate
  • Your refusal boundaries: what's in-scope and out-of-scope for your use case

In our internal benchmarks, a fine-tuned Llama 3 8B consistently outperforms a prompted Llama 3 70B on domain-specific tasks — at one-eighth the inference cost.

LoRA: fine-tuning without training the whole model

Full fine-tuning — updating all 8 billion parameters — is expensive and often unnecessary. LoRA (Low-Rank Adaptation) is a parameter-efficient method that achieves comparable results by training only a tiny fraction of new weights.

The core idea: instead of updating a full weight matrix W (shape d × k), LoRA adds a pair of low-rank matrices A (shape d × r) and B(shape r × k), where the rank r is much smaller than d or k. During training, only A and B are updated. The forward pass becomes:

output = x @ (W + A @ B * scale)

For Llama 3 8B with rank 16, the LoRA adapter has roughly 10 million trainable parameters — less than 0.15% of the 8 billion total. Training takes minutes rather than hours, costs a fraction of a full fine-tune, and the adapter is tiny enough (≈ 40 MB) that it can be swapped per-request at inference time.

How Cloudach serves LoRA adapters

Cloudach uses vLLM's native LoRA multi-adapter support. When your fine-tuning job completes:

  1. The adapter weights (adapter_config.json + adapter_model.safetensors) are stored in our object store.
  2. On first request, vLLM loads the base model once and registers your adapter. Adapter loading takes < 50 ms — warm thereafter.
  3. Multiple adapters for the same base model share a single base model replica. You pay for base model GPU hours plus a small hosting fee per adapter, not a separate GPU per fine-tune.
  4. vLLM's lora_request mechanism routes each request to the right adapter with zero extra latency compared to the base model.

This architecture means you can maintain dozens of fine-tuned variants — one per customer, one per language, one per product line — all served from the same GPU cluster at the same sub-100ms TTFT we offer on base models.

Choosing base models for fine-tuning

Not every base model is equally good for fine-tuning. Here is our practical guidance:

ModelBest forLoRA rank recommendation
llama3-8bMost tasks — best cost/quality ratio for fine-tuning16 (start here)
llama3-70bComplex reasoning, nuanced tone, multilingual16–32
llama31-8bLong-context tasks (RAG, document Q&A)16
mistral-7bFast inference, European data residency required16–32
mixtral-8x7bHigh accuracy, mixture-of-experts efficiency16 (LoRA only)

Our recommendation for most teams: start with llama3-8band rank 16. Only move to a larger model if the 8B fine-tune doesn't meet your quality bar — the size difference has a 5–8× cost impact on training and inference.

What makes a good training dataset

Data quality is the biggest lever. We have seen teams get excellent results from 200 carefully curated examples and poor results from 5,000 noisy ones. A few principles that matter most:

Write assistant turns in your exact production voice

The model learns by example. If 10% of your training examples use bullet points and 90% use prose, the model will be inconsistent. Decide on your format before you start, and apply it uniformly.

Cover your long tail

Happy path examples are easy. The real value comes from edge cases: unanswerable questions, out-of-scope requests, ambiguous inputs, and situations where the right answer is to escalate rather than guess. Include at least 15–20% edge case examples.

Use the same system prompt in training and inference

Include your production system prompt in every training example. The model will learn to condition on it. If you train without a system prompt and then add one at inference, you will get inconsistent results.

Balance your categories

If billing questions make up 40% of your examples but only 5% of real traffic, the model will over-index on billing phrasing. Aim for class balance that reflects real usage distribution.

Step-by-step: from data to deployed model

Here is the complete workflow with curl. See the Fine-tuning Tutorial for Python and a guided walkthrough.

1. Prepare your dataset

# Each line: {"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]}
# Minimum 100 examples, JSONL format

2. Upload

curl https://api.cloudach.com/v1/fine-tuning/datasets \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -F "file=@training_data.jsonl" \
  -F "purpose=fine-tune"
# → {"id": "ds-8f3a2b1c", ...}

3. Create job

curl https://api.cloudach.com/v1/fine-tuning/jobs \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b",
    "training_file": "ds-8f3a2b1c",
    "method": {"type": "lora", "lora": {"rank": 16, "alpha": 32}},
    "hyperparameters": {"n_epochs": 3},
    "suffix": "my-model"
  }'
# → {"id": "ftjob-a1b2c3", "status": "queued"}

4. Monitor + infer

# Poll until status = "succeeded"
curl https://api.cloudach.com/v1/fine-tuning/jobs/ftjob-a1b2c3 \
  -H "Authorization: Bearer $CLOUDACH_API_KEY"

# Run inference
curl https://api.cloudach.com/v1/chat/completions \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3-8b:ft:my-model:ftjob-a1b2c3", "messages": [...]}'

What to expect: timings and cost

For a 500-example dataset, 3 epochs, rank 16 LoRA on Llama 3 8B:

  • Training time: 8–12 minutes
  • Training cost: approximately $0.45 (at $0.003 / 1K training tokens)
  • Adapter hosting: $2/month
  • Inference: same as base model — $0.10 / 1M tokens

A team running 10M inference tokens per month on a fine-tuned 8B model pays roughly $1 + $2 hosting = $3/month in fine-tuning costs, versus ~$1,000/month if they were trying to achieve similar quality by switching to a 70B model.

Get started

Fine-tuning is available to all Cloudach users on the Pro plan and above. The quickest path:

  1. Sign up or log in to Cloudach
  2. Download our sample 50-example dataset and run through the tutorial end-to-end
  3. Replace the sample data with your own examples and re-run

Have questions about your specific use case? Email ml@cloudach.com — our ML team reviews every inbound and responds within one business day.


Related

Tutorial: Fine-tune Llama 3 step by stepTutorialFine-Tuning API ReferenceDocsData Preparation GuideGuide