Docs/Integrations/LlamaIndex
Beginner~10 minPython

LlamaIndex integration

Use Cloudach as the LLM backend in LlamaIndex. Cloudach is OpenAI-compatible, so the built-in OpenAI LLM class works with two configuration changes: setapi_base and api_key to your Cloudach values. This guide covers completions, chat, streaming, global settings, and a simple RAG pipeline.

Overview

What you'll learn:

  • Configure the LlamaIndex OpenAI class to use Cloudach models
  • Send chat messages and stream completions
  • Set Cloudach as the global default LLM with Settings.llm
  • Build a minimal RAG query engine backed by Cloudach
You need a Cloudach API key. Sign up free — no credit card required.

Install

pip install llama-index llama-index-llms-openai

Set your API key in the environment:

export CLOUDACH_API_KEY="sk-cloudach-YOUR_KEY"

Step 1 — Basic completion

Pass api_base and api_key to the OpenAI constructor. Then call .complete() with a prompt string.

import os
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="llama3-70b",
    api_key=os.environ["CLOUDACH_API_KEY"],
    api_base="https://api.cloudach.com/v1",
    temperature=0.7,
)

response = llm.complete("The three laws of robotics are")
print(response.text)

Step 2 — Chat messages

Use ChatMessage objects to send system and user turns. The response is a ChatResponse with a .message.content string.

from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage, MessageRole

llm = OpenAI(
    model="llama3-70b",
    api_key=os.environ["CLOUDACH_API_KEY"],
    api_base="https://api.cloudach.com/v1",
)

messages = [
    ChatMessage(role=MessageRole.SYSTEM, content="You are a concise technical assistant."),
    ChatMessage(role=MessageRole.USER, content="What is retrieval-augmented generation?"),
]

response = llm.chat(messages)
print(response.message.content)

Step 3 — Streaming

stream_complete and stream_chat return generators. Each item has a .delta attribute containing the new token(s).

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="llama3-70b",
    api_key=os.environ["CLOUDACH_API_KEY"],
    api_base="https://api.cloudach.com/v1",
)

# Stream a completion
stream = llm.stream_complete("Explain vector databases in plain English:")
for chunk in stream:
    print(chunk.delta, end="", flush=True)
print()

# Stream a chat
from llama_index.core.llms import ChatMessage, MessageRole
messages = [
    ChatMessage(role=MessageRole.USER, content="Summarize what a transformer is in 2 sentences."),
]
for chunk in llm.stream_chat(messages):
    print(chunk.delta, end="", flush=True)
print()

Step 4 — Global LLM settings

Set Cloudach as the default LLM for all LlamaIndex operations in your session. Any index or query engine you build after this will use it automatically.

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(
    model="llama3-70b",
    api_key=os.environ["CLOUDACH_API_KEY"],
    api_base="https://api.cloudach.com/v1",
    temperature=0.1,  # lower = more deterministic for RAG
)

Step 5 — Simple RAG pipeline

Build a vector index from local documents and query it with Cloudach. For embeddings, use a local HuggingFace model (no extra API key needed).

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# pip install llama-index-embeddings-huggingface sentence-transformers

# Configure LLM and embedder
Settings.llm = OpenAI(
    model="llama3-70b",
    api_key=os.environ["CLOUDACH_API_KEY"],
    api_base="https://api.cloudach.com/v1",
    temperature=0.1,
)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Load documents from ./data (plain text, PDF, HTML, etc.)
documents = SimpleDirectoryReader("./data").load_data()

# Build the index (embeds and stores vectors in memory)
index = VectorStoreIndex.from_documents(documents)

# Query
query_engine = index.as_query_engine()
response = query_engine.query("What are the main topics covered in these documents?")
print(response)
Tip: For production, replace the in-memory vector store with a persistent one like Chroma, Pinecone, or pgvector. LlamaIndex has first-class support for all three.

Step 6 — Complete working script

Save as cloudach_llamaindex.py and run with:

CLOUDACH_API_KEY=sk-cloudach-YOUR_KEY python cloudach_llamaindex.py
#!/usr/bin/env python3
"""Cloudach + LlamaIndex integration demo.

Install:
    pip install llama-index llama-index-llms-openai

Run:
    CLOUDACH_API_KEY=sk-cloudach-YOUR_KEY python cloudach_llamaindex.py
"""
import os
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage, MessageRole

# ── Configure ───────────────────────────────────────────────────────────────
llm = OpenAI(
    model="llama3-70b",
    api_key=os.environ["CLOUDACH_API_KEY"],
    api_base="https://api.cloudach.com/v1",
    temperature=0.7,
)
Settings.llm = llm

# ── Completion ───────────────────────────────────────────────────────────────
print("=== Completion ===")
response = llm.complete("List three benefits of open-source LLMs:")
print(response.text)

# ── Chat ─────────────────────────────────────────────────────────────────────
print("\n=== Chat ===")
messages = [
    ChatMessage(role=MessageRole.SYSTEM, content="You are a concise technical assistant."),
    ChatMessage(role=MessageRole.USER, content="What is the difference between RAG and fine-tuning?"),
]
chat_response = llm.chat(messages)
print(chat_response.message.content)

# ── Streaming completion ──────────────────────────────────────────────────────
print("\n=== Streaming ===")
stream = llm.stream_complete("Explain embeddings in 2 sentences:")
for chunk in stream:
    print(chunk.delta, end="", flush=True)
print()

Available models

Model IDContextBest for
llama3-8b8KFast responses, high-volume pipelines
llama3-70b8KComplex reasoning, RAG synthesis
mistral-7b32KLong documents, large context windows
mixtral-8x7b32KHighest accuracy, complex tasks

What's next