Build a customer support bot with Cloudach and Llama 3
In this tutorial you'll build a production-ready customer support chatbot that streams responses in real-time, maintains conversation context across turns, and handles errors gracefully. We'll use Llama 3 70B for high-quality responses and the Cloudach streaming API.
Overview
What you'll build:
- A chatbot that answers questions about your product using a custom system prompt
- Streaming responses so users see text as it's generated (no waiting for the full reply)
- A conversation history that keeps context across multiple turns
- Graceful error handling and retry logic
Prerequisites
- A Cloudach API key (from your dashboard)
- Python 3.9+ with
openaiSDK:pip install openai
Step 1 — Write the system prompt
The system prompt defines who the bot is and what it knows. Keep it concise and factual. Llama 3 follows instructions well — you don't need to over-engineer it.
SYSTEM_PROMPT = """You are Aria, a helpful customer support assistant for Cloudach. Cloudach is an OpenAI-compatible LLM API that hosts Llama 3, Mistral, and Mixtral models. You help developers with API questions, billing, rate limits, and debugging. Rules: - Be concise and direct. Developers prefer short answers. - Always include a code example when explaining an API concept. - If you don't know something, say so and link to docs.cloudach.com. - Never make up pricing or SLA numbers — refer users to the pricing page. """
Step 2 — Manage conversation context
LLMs are stateless — every request is independent. To maintain a conversation, you pass the full message history on each call. Keep the last N turns to stay within the context window.
from openai import OpenAI
from collections import deque
client = OpenAI(
api_key="sk-cloudach-YOUR_KEY",
base_url="https://api.cloudach.com/v1",
)
MAX_HISTORY = 10 # number of user+assistant turn pairs to keep
history = deque(maxlen=MAX_HISTORY * 2) # *2 because each turn = user + assistant
def chat(user_message: str) -> str:
history.append({"role": "user", "content": user_message})
messages = [{"role": "system", "content": SYSTEM_PROMPT}] + list(history)
response = client.chat.completions.create(
model="llama3-70b",
messages=messages,
stream=False,
)
reply = response.choices[0].message.content
history.append({"role": "assistant", "content": reply})
return replyStep 3 — Stream the response
Streaming makes your bot feel instant. Instead of waiting for the full reply, you print each token as it arrives. This is especially important for long answers.
def chat_stream(user_message: str):
"""Stream tokens to stdout as they arrive."""
history.append({"role": "user", "content": user_message})
messages = [{"role": "system", "content": SYSTEM_PROMPT}] + list(history)
stream = client.chat.completions.create(
model="llama3-70b",
messages=messages,
stream=True,
)
full_reply = ""
for chunk in stream:
token = chunk.choices[0].delta.content or ""
print(token, end="", flush=True)
full_reply += token
print() # newline after stream ends
history.append({"role": "assistant", "content": full_reply})Step 4 — Full working example
Put it all together into a terminal chatbot you can run right now.
#!/usr/bin/env python3
"""Cloudach customer support bot — terminal demo."""
from openai import OpenAI
from collections import deque
SYSTEM_PROMPT = """You are Aria, a helpful customer support assistant for Cloudach.
Be concise. Include code examples when explaining API concepts."""
client = OpenAI(
api_key="sk-cloudach-YOUR_KEY",
base_url="https://api.cloudach.com/v1",
)
history = deque(maxlen=20)
def chat_stream(user_message: str):
history.append({"role": "user", "content": user_message})
messages = [{"role": "system", "content": SYSTEM_PROMPT}] + list(history)
stream = client.chat.completions.create(
model="llama3-70b",
messages=messages,
stream=True,
)
full_reply = ""
print("\nAria: ", end="", flush=True)
for chunk in stream:
token = chunk.choices[0].delta.content or ""
print(token, end="", flush=True)
full_reply += token
print()
history.append({"role": "assistant", "content": full_reply})
def main():
print("Cloudach Support Bot (Llama 3 70B) — type 'quit' to exit\n")
while True:
try:
user_input = input("You: ").strip()
except (EOFError, KeyboardInterrupt):
break
if not user_input or user_input.lower() in ("quit", "exit"):
break
chat_stream(user_input)
if __name__ == "__main__":
main()Step 5 — Production tips
Handle rate limit errors
import time
from openai import RateLimitError
def chat_with_retry(user_message: str, retries: int = 3):
for attempt in range(retries):
try:
return chat_stream(user_message)
except RateLimitError as e:
if attempt == retries - 1:
raise
wait = 2 ** attempt # exponential backoff: 1s, 2s, 4s
print(f"Rate limited. Retrying in {wait}s...")
time.sleep(wait)Keep the system prompt lean
Every token in your system prompt costs tokens on every request. 200–400 tokens is usually enough. If you need to include large knowledge bases (product docs, FAQs), use retrieval-augmented generation (RAG) and inject only the relevant context per request.
Model selection
| Use case | Recommended model | Why |
|---|---|---|
| High-volume FAQ bot | llama3-8b | Fastest, cheapest, great for structured replies |
| Complex support queries | llama3-70b | Better reasoning, handles edge cases |
| Long conversation history | mistral-7b | 32K context window |
| Highest accuracy | mixtral-8x7b | MoE model, best for nuanced tasks |
What's next
- Read the Streaming docs for details on parsing SSE chunks
- Check the Rate Limits section to plan your retry logic
- See the Changelog for new models and features
- Join support@cloudach.com for help