LlamaIndex integration
Use Cloudach as the LLM backend in LlamaIndex. Cloudach is OpenAI-compatible, so the built-in OpenAI LLM class works with two configuration changes: setapi_base and api_key to your Cloudach values. This guide covers completions, chat, streaming, global settings, and a simple RAG pipeline.
Overview
What you'll learn:
- Configure the LlamaIndex
OpenAIclass to use Cloudach models - Send chat messages and stream completions
- Set Cloudach as the global default LLM with
Settings.llm - Build a minimal RAG query engine backed by Cloudach
Install
pip install llama-index llama-index-llms-openai
Set your API key in the environment:
export CLOUDACH_API_KEY="sk-cloudach-YOUR_KEY"
Step 1 — Basic completion
Pass api_base and api_key to the OpenAI constructor. Then call .complete() with a prompt string.
import os
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="llama3-70b",
api_key=os.environ["CLOUDACH_API_KEY"],
api_base="https://api.cloudach.com/v1",
temperature=0.7,
)
response = llm.complete("The three laws of robotics are")
print(response.text)Step 2 — Chat messages
Use ChatMessage objects to send system and user turns. The response is a ChatResponse with a .message.content string.
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage, MessageRole
llm = OpenAI(
model="llama3-70b",
api_key=os.environ["CLOUDACH_API_KEY"],
api_base="https://api.cloudach.com/v1",
)
messages = [
ChatMessage(role=MessageRole.SYSTEM, content="You are a concise technical assistant."),
ChatMessage(role=MessageRole.USER, content="What is retrieval-augmented generation?"),
]
response = llm.chat(messages)
print(response.message.content)Step 3 — Streaming
stream_complete and stream_chat return generators. Each item has a .delta attribute containing the new token(s).
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="llama3-70b",
api_key=os.environ["CLOUDACH_API_KEY"],
api_base="https://api.cloudach.com/v1",
)
# Stream a completion
stream = llm.stream_complete("Explain vector databases in plain English:")
for chunk in stream:
print(chunk.delta, end="", flush=True)
print()
# Stream a chat
from llama_index.core.llms import ChatMessage, MessageRole
messages = [
ChatMessage(role=MessageRole.USER, content="Summarize what a transformer is in 2 sentences."),
]
for chunk in llm.stream_chat(messages):
print(chunk.delta, end="", flush=True)
print()Step 4 — Global LLM settings
Set Cloudach as the default LLM for all LlamaIndex operations in your session. Any index or query engine you build after this will use it automatically.
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
Settings.llm = OpenAI(
model="llama3-70b",
api_key=os.environ["CLOUDACH_API_KEY"],
api_base="https://api.cloudach.com/v1",
temperature=0.1, # lower = more deterministic for RAG
)Step 5 — Simple RAG pipeline
Build a vector index from local documents and query it with Cloudach. For embeddings, use a local HuggingFace model (no extra API key needed).
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# pip install llama-index-embeddings-huggingface sentence-transformers
# Configure LLM and embedder
Settings.llm = OpenAI(
model="llama3-70b",
api_key=os.environ["CLOUDACH_API_KEY"],
api_base="https://api.cloudach.com/v1",
temperature=0.1,
)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Load documents from ./data (plain text, PDF, HTML, etc.)
documents = SimpleDirectoryReader("./data").load_data()
# Build the index (embeds and stores vectors in memory)
index = VectorStoreIndex.from_documents(documents)
# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the main topics covered in these documents?")
print(response)Step 6 — Complete working script
Save as cloudach_llamaindex.py and run with:
CLOUDACH_API_KEY=sk-cloudach-YOUR_KEY python cloudach_llamaindex.py
#!/usr/bin/env python3
"""Cloudach + LlamaIndex integration demo.
Install:
pip install llama-index llama-index-llms-openai
Run:
CLOUDACH_API_KEY=sk-cloudach-YOUR_KEY python cloudach_llamaindex.py
"""
import os
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage, MessageRole
# ── Configure ───────────────────────────────────────────────────────────────
llm = OpenAI(
model="llama3-70b",
api_key=os.environ["CLOUDACH_API_KEY"],
api_base="https://api.cloudach.com/v1",
temperature=0.7,
)
Settings.llm = llm
# ── Completion ───────────────────────────────────────────────────────────────
print("=== Completion ===")
response = llm.complete("List three benefits of open-source LLMs:")
print(response.text)
# ── Chat ─────────────────────────────────────────────────────────────────────
print("\n=== Chat ===")
messages = [
ChatMessage(role=MessageRole.SYSTEM, content="You are a concise technical assistant."),
ChatMessage(role=MessageRole.USER, content="What is the difference between RAG and fine-tuning?"),
]
chat_response = llm.chat(messages)
print(chat_response.message.content)
# ── Streaming completion ──────────────────────────────────────────────────────
print("\n=== Streaming ===")
stream = llm.stream_complete("Explain embeddings in 2 sentences:")
for chunk in stream:
print(chunk.delta, end="", flush=True)
print()
Available models
| Model ID | Context | Best for |
|---|---|---|
llama3-8b | 8K | Fast responses, high-volume pipelines |
llama3-70b | 8K | Complex reasoning, RAG synthesis |
mistral-7b | 32K | Long documents, large context windows |
mixtral-8x7b | 32K | Highest accuracy, complex tasks |
What's next
- LangChain integration — build LCEL chains and agents with Cloudach
- Rate limits — plan your retry logic
- SDK compatibility — other frameworks that work with Cloudach
- support@cloudach.com — questions or feedback