Docs/Tutorials/Customer support bot
Intermediate~20 min

Build a customer support bot with Cloudach and Llama 3

In this tutorial you'll build a production-ready customer support chatbot that streams responses in real-time, maintains conversation context across turns, and handles errors gracefully. We'll use Llama 3 70B for high-quality responses and the Cloudach streaming API.

Overview

What you'll build:

  • A chatbot that answers questions about your product using a custom system prompt
  • Streaming responses so users see text as it's generated (no waiting for the full reply)
  • A conversation history that keeps context across multiple turns
  • Graceful error handling and retry logic
You'll need a Cloudach API key. Sign up free — no credit card required.

Prerequisites

  • A Cloudach API key (from your dashboard)
  • Python 3.9+ with openai SDK: pip install openai

Step 1 — Write the system prompt

The system prompt defines who the bot is and what it knows. Keep it concise and factual. Llama 3 follows instructions well — you don't need to over-engineer it.

SYSTEM_PROMPT = """You are Aria, a helpful customer support assistant for Cloudach.

Cloudach is an OpenAI-compatible LLM API that hosts Llama 3, Mistral, and Mixtral models.
You help developers with API questions, billing, rate limits, and debugging.

Rules:
- Be concise and direct. Developers prefer short answers.
- Always include a code example when explaining an API concept.
- If you don't know something, say so and link to docs.cloudach.com.
- Never make up pricing or SLA numbers — refer users to the pricing page.
"""

Step 2 — Manage conversation context

LLMs are stateless — every request is independent. To maintain a conversation, you pass the full message history on each call. Keep the last N turns to stay within the context window.

from openai import OpenAI
from collections import deque

client = OpenAI(
    api_key="sk-cloudach-YOUR_KEY",
    base_url="https://api.cloudach.com/v1",
)

MAX_HISTORY = 10  # number of user+assistant turn pairs to keep

history = deque(maxlen=MAX_HISTORY * 2)  # *2 because each turn = user + assistant

def chat(user_message: str) -> str:
    history.append({"role": "user", "content": user_message})

    messages = [{"role": "system", "content": SYSTEM_PROMPT}] + list(history)

    response = client.chat.completions.create(
        model="llama3-70b",
        messages=messages,
        stream=False,
    )

    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})
    return reply

Step 3 — Stream the response

Streaming makes your bot feel instant. Instead of waiting for the full reply, you print each token as it arrives. This is especially important for long answers.

def chat_stream(user_message: str):
    """Stream tokens to stdout as they arrive."""
    history.append({"role": "user", "content": user_message})

    messages = [{"role": "system", "content": SYSTEM_PROMPT}] + list(history)

    stream = client.chat.completions.create(
        model="llama3-70b",
        messages=messages,
        stream=True,
    )

    full_reply = ""
    for chunk in stream:
        token = chunk.choices[0].delta.content or ""
        print(token, end="", flush=True)
        full_reply += token

    print()  # newline after stream ends
    history.append({"role": "assistant", "content": full_reply})

Step 4 — Full working example

Put it all together into a terminal chatbot you can run right now.

#!/usr/bin/env python3
"""Cloudach customer support bot — terminal demo."""
from openai import OpenAI
from collections import deque

SYSTEM_PROMPT = """You are Aria, a helpful customer support assistant for Cloudach.
Be concise. Include code examples when explaining API concepts."""

client = OpenAI(
    api_key="sk-cloudach-YOUR_KEY",
    base_url="https://api.cloudach.com/v1",
)
history = deque(maxlen=20)

def chat_stream(user_message: str):
    history.append({"role": "user", "content": user_message})
    messages = [{"role": "system", "content": SYSTEM_PROMPT}] + list(history)

    stream = client.chat.completions.create(
        model="llama3-70b",
        messages=messages,
        stream=True,
    )

    full_reply = ""
    print("\nAria: ", end="", flush=True)
    for chunk in stream:
        token = chunk.choices[0].delta.content or ""
        print(token, end="", flush=True)
        full_reply += token
    print()

    history.append({"role": "assistant", "content": full_reply})

def main():
    print("Cloudach Support Bot (Llama 3 70B) — type 'quit' to exit\n")
    while True:
        try:
            user_input = input("You: ").strip()
        except (EOFError, KeyboardInterrupt):
            break
        if not user_input or user_input.lower() in ("quit", "exit"):
            break
        chat_stream(user_input)

if __name__ == "__main__":
    main()

Step 5 — Production tips

Handle rate limit errors

import time
from openai import RateLimitError

def chat_with_retry(user_message: str, retries: int = 3):
    for attempt in range(retries):
        try:
            return chat_stream(user_message)
        except RateLimitError as e:
            if attempt == retries - 1:
                raise
            wait = 2 ** attempt  # exponential backoff: 1s, 2s, 4s
            print(f"Rate limited. Retrying in {wait}s...")
            time.sleep(wait)

Keep the system prompt lean

Every token in your system prompt costs tokens on every request. 200–400 tokens is usually enough. If you need to include large knowledge bases (product docs, FAQs), use retrieval-augmented generation (RAG) and inject only the relevant context per request.

Model selection

Use caseRecommended modelWhy
High-volume FAQ botllama3-8bFastest, cheapest, great for structured replies
Complex support queriesllama3-70bBetter reasoning, handles edge cases
Long conversation historymistral-7b32K context window
Highest accuracymixtral-8x7bMoE model, best for nuanced tasks
Need higher rate limits or a dedicated GPU? Contact sales for enterprise plans.

What's next