Docs/Tutorials/Fine-tuning Llama 3
Beginner~30 min

Fine-tune Llama 3 on your own data

In this tutorial you'll fine-tune Llama 3 8B on a customer support dataset using LoRA. By the end you'll have a custom model that replies in your brand's voice, deployed and ready for inference on Cloudach's API.

Overview

What you'll do:

  1. Prepare a JSONL training dataset with chat-format examples
  2. Upload the dataset to Cloudach
  3. Launch a LoRA fine-tuning job on Llama 3 8B
  4. Monitor training loss in real time
  5. Run inference on your deployed custom model
  6. Evaluate the results against the base model
You'll need a Cloudach API key. Sign up free — no credit card required.

Fine-tuning with LoRA trains only a small set of adapter weights (≈ 0.1–1% of model parameters) rather than the full model. This is faster, cheaper, and often achieves equal or better results for domain adaptation tasks. Cloudach serves the adapter on top of the base model using vLLM, so you get the same sub-100ms latency as the base model.

Prerequisites

  • A Cloudach account and API key (sign up)
  • Python 3.9+ or a terminal with curl
  • At least 100 training examples in JSONL format (we provide a sample below)

Step 1 — Prepare your dataset

Training data must be a .jsonl file where each line is a JSON object with a messages array. The format is identical to the OpenAI fine-tuning format.

Example line

{"messages": [
  {"role": "system", "content": "You are a helpful support agent for Acme Corp."},
  {"role": "user",   "content": "How do I reset my password?"},
  {"role": "assistant", "content": "Go to Settings → Security → Reset Password. You'll receive a reset link by email within 2 minutes."}
]}

Download our sample dataset

We provide a 50-example customer support dataset to use as a starting point or to test the workflow end-to-end before using your own data.

# Download the Cloudach sample dataset (50 customer-support examples)
curl -L https://raw.githubusercontent.com/cloudach/examples/main/fine-tuning/sample_dataset.jsonl \
  -o training_data.jsonl

Validate your file

Run this quick check before uploading to catch format errors:

python3 - <<'EOF'
import json, sys
errors = 0
with open("training_data.jsonl") as f:
    for i, line in enumerate(f, 1):
        try:
            obj = json.loads(line)
            assert "messages" in obj
            roles = [m["role"] for m in obj["messages"]]
            assert "user" in roles and "assistant" in roles
        except Exception as e:
            print(f"Line {i}: {e}")
            errors += 1
print(f"Done. {errors} error(s).")
EOF
Aim for at least 200–500 examples for meaningful domain adaptation. See the Data Preparation Guide for collection and cleaning tips.

Step 2 — Upload your dataset

Upload the JSONL file to get a dataset ID to reference in the fine-tuning job.

curl https://api.cloudach.com/v1/fine-tuning/datasets \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -F "file=@training_data.jsonl" \
  -F "purpose=fine-tune"

The response looks like:

{
  "id": "ds-8f3a2b1c",
  "object": "dataset",
  "filename": "training_data.jsonl",
  "bytes": 142891,
  "line_count": 50,
  "status": "processed"
}

Save the id field (e.g. ds-8f3a2b1c) — you'll need it in the next step.

Step 3 — Launch the fine-tuning job

Create a LoRA fine-tuning job targeting llama3-8b. Replace ds-8f3a2b1c with your dataset ID.

curl https://api.cloudach.com/v1/fine-tuning/jobs \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b",
    "training_file": "ds-8f3a2b1c",
    "method": {
      "type": "lora",
      "lora": { "rank": 16, "alpha": 32, "dropout": 0.05 }
    },
    "hyperparameters": {
      "n_epochs": 3,
      "batch_size": 16
    },
    "suffix": "support-bot"
  }'

Response:

{
  "id": "ftjob-a1b2c3",
  "object": "fine_tuning.job",
  "model": "llama3-8b",
  "status": "queued",
  "created_at": 1712001200,
  "estimated_finish": null,
  "fine_tuned_model": null
}

Save the job id (e.g. ftjob-a1b2c3). The job enters the queued state and starts within a few minutes.

Key parameters explained

ParameterOur valueWhat it does
lora.rank16Controls adapter capacity. Higher = more expressive, higher cost. 16 is a good default.
lora.alpha32Scaling factor (typically 2× rank).
n_epochs3Training passes over the dataset. Start with 3; increase if loss is still falling.
batch_size16Examples per gradient update. Larger = faster but uses more GPU memory.
suffixsupport-botAppended to the deployed model ID so you can identify it.

Step 4 — Monitor training

Poll the job endpoint to see status, training loss, and estimated finish time. Training a 50-example dataset typically takes 3–5 minutes; larger datasets scale linearly.

# Poll every 30 seconds
curl https://api.cloudach.com/v1/fine-tuning/jobs/ftjob-a1b2c3 \
  -H "Authorization: Bearer $CLOUDACH_API_KEY"
{
  "id": "ftjob-a1b2c3",
  "status": "running",
  "trained_tokens": 48200,
  "estimated_finish": 1712003600,
  "fine_tuned_model": null,
  "events": [
    { "step": 0,   "train_loss": 2.38, "train_mean_token_accuracy": 0.44 },
    { "step": 50,  "train_loss": 1.62, "train_mean_token_accuracy": 0.63 },
    { "step": 100, "train_loss": 1.14, "train_mean_token_accuracy": 0.74 }
  ]
}

When status changes to succeeded, the fine_tuned_model field will contain your deployed model ID.

Stream live events

For a real-time loss curve, stream events instead of polling:

curl "https://api.cloudach.com/v1/fine-tuning/jobs/ftjob-a1b2c3/events?stream=true" \
  -H "Authorization: Bearer $CLOUDACH_API_KEY"
A healthy training run shows train_loss falling from ~2.0–2.5 in step 0 down to 0.5–1.2 by the final step. If it plateaus above 1.5, try more epochs or a higher LoRA rank. If it drops below 0.2, you may be overfitting — add more diverse examples.

Step 5 — Run inference on your model

Once status is succeeded, use the fine_tuned_model ID exactly like any other model. No code changes needed beyond swapping the model ID.

# Replace with your fine_tuned_model value
curl https://api.cloudach.com/v1/chat/completions \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b:ft:support-bot:ftjob-a1b2c3",
    "messages": [
      {"role": "system", "content": "You are a helpful support agent for Acme Corp."},
      {"role": "user", "content": "What is your return policy?"}
    ]
  }'

The response is identical to a standard chat completion:

{
  "id": "chatcmpl-xyz",
  "object": "chat.completion",
  "model": "llama3-8b:ft:support-bot:ftjob-a1b2c3",
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "We accept returns within 30 days of purchase. Items must be unopened and in original packaging. To start a return, email support@acme.com with your order number."
    },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 42, "completion_tokens": 38, "total_tokens": 80 }
}

Step 6 — Evaluate results

Compare your fine-tuned model against the base model on a held-out test set:

import openai, json

client = openai.OpenAI(
    api_key="YOUR_CLOUDACH_API_KEY",
    base_url="https://api.cloudach.com/v1"
)

test_questions = [
    "How do I cancel my subscription?",
    "I was charged twice, what do I do?",
    "Do you have a free trial?",
]

for q in test_questions:
    base = client.chat.completions.create(
        model="llama3-8b",
        messages=[{"role": "user", "content": q}]
    )
    finetuned = client.chat.completions.create(
        model="llama3-8b:ft:support-bot:ftjob-a1b2c3",
        messages=[
            {"role": "system", "content": "You are a helpful support agent for Acme Corp."},
            {"role": "user", "content": q}
        ]
    )
    print(f"Q: {q}")
    print(f"Base:       {base.choices[0].message.content[:120]}")
    print(f"Fine-tuned: {finetuned.choices[0].message.content[:120]}")
    print()

Things to look for:

  • Tone: Does the model respond in your brand's voice?
  • Factual accuracy: Does it cite correct product details?
  • Format: Does it follow your preferred response structure?
  • Hallucination rate: Does it make up information less than the base model?
  • Refusal rate: Does it handle out-of-scope questions gracefully?

If results are not satisfactory, the most effective improvements are: (1) adding more diverse training examples, (2) increasing n_epochs to 5, (3) increasing lora.rank to 32 or 64 for more complex tasks.

Next steps