API Documentation

Cloudach is an OpenAI-compatible LLM API. Drop in your API key and base URL — no code changes needed.

Quickstart

Get from zero to your first API call in under 5 minutes.

Step 1 — Sign up

  1. Go to app.cloudach.com/signup
  2. Enter your email and create a password
  3. You are now logged in and ready

Step 2 — Create an API key

  1. Open the API Keys page in your dashboard
  2. Click Create new key and give it a name (e.g. my-first-key)
  3. Copy and store the key — it is shown only once. Format: sk-cloudach-...

Step 3 — Make your first call

curl https://api.cloudach.com/v1/chat/completions \
  -H "Authorization: Bearer sk-cloudach-YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b",
    "messages": [{"role": "user", "content": "Hello from Cloudach!"}]
  }'
Time to first token: ~1 second once your key is in hand.

Authentication

All requests require an Authorization header with your API key as a Bearer token.

Authorization: Bearer sk-cloudach-YOUR_KEY

API key properties

  • Prefix: sk-cloudach-
  • Stored as a SHA-256 hash server-side — the raw key is never recoverable after creation
  • Can be revoked instantly from the dashboard
  • Multiple keys per account are supported (one per integration recommended)
  • Auth cache TTL: 60 seconds — revocation propagates within 60s

Auth errors

// 401 — missing Authorization header
{"error": {"message": "Missing credentials. Include 'Authorization: Bearer <api-key>'.", "type": "invalid_request_error"}}

// 401 — invalid or revoked key
{"error": {"message": "Invalid or revoked API key.", "type": "authentication_error"}}

Endpoints

Base URL: https://api.cloudach.com/v1

EndpointMethodDescriptionAuth
/v1/chat/completionsPOSTChat messages (streaming supported)Required
/v1/completionsPOSTText completions (legacy format)Required
/v1/modelsGETList available modelsRequired
/v1/models/{model_id}GETGet a specific modelRequired
/healthGETHealth checkNone

POST /v1/chat/completions

OpenAI-compatible chat endpoint. Supports streaming via Server-Sent Events.

Request body

{
  "model": "llama3-8b",           // required — see /v1/models for available models
  "messages": [                   // required — non-empty array
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is 2 + 2?"}
  ],
  "stream": false,                // optional — true for SSE streaming
  "temperature": 0.7,             // optional — 0.0–2.0 (default 1.0)
  "max_tokens": 512               // optional — max completion tokens
}

Non-streaming response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1712345678,
  "model": "llama3-8b",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "The answer is 4."},
    "finish_reason": "stop"
  }],
  "usage": {"prompt_tokens": 22, "completion_tokens": 8, "total_tokens": 30}
}

Streaming (SSE)

Set "stream": true. Response is a stream of data: ... lines, ending with data: [DONE].

data: {"id":"chatcmpl-abc","choices":[{"delta":{"role":"assistant","content":""},"index":0}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"The "},"index":0}]}

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"answer is 4."},"index":0}],"usage":{"prompt_tokens":22,"completion_tokens":8,"total_tokens":30}}

data: [DONE]

POST /v1/completions

Legacy text completion endpoint (OpenAI format). Use chat completions for new integrations.

// Request
{
  "model": "llama3-8b",
  "prompt": "The capital of France is",
  "max_tokens": 20
}

// Response
{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "model": "llama3-8b",
  "choices": [{"text": " Paris.", "index": 0, "finish_reason": "stop"}],
  "usage": {"prompt_tokens": 7, "completion_tokens": 3, "total_tokens": 10}
}

GET /v1/models

// Response
{
  "object": "list",
  "data": [
    {"id": "llama3-8b",  "object": "model", "owned_by": "cloudach"},
    {"id": "mistral-7b", "object": "model", "owned_by": "cloudach"}
  ]
}

Available models

Model IDContextBest for
llama3-8b8KFast chat, Q&A, summarization
llama3-70b8KComplex reasoning, analysis
llama31-8b128KLong-context chat, fast inference
llama31-70b128KState-of-the-art open model, long context
mistral-7b32KLong context, code, EU-hosted
mixtral-8x7b32KBest accuracy, complex tasks
command-r-plus128KRAG, tool use, multi-step agents
dbrx32KCoding, reasoning, MoE efficiency

Rate Limits

Rate limits apply per API key. All limits reset on a rolling window (RPM) or at midnight UTC (TPD).

Limits by plan

PlanRequests / min (RPM)Tokens / day (TPD)Notes
Free601,000,000Default on sign-up
Pro60010,000,000Available after plan upgrade
EnterpriseCustomCustomContact sales@cloudach.com

Per-key overrides

You can set a custom rate_limit_rpm on individual API keys from the dashboard. Useful for restricting keys used in untrusted environments or increasing limits for high-throughput integrations.

Rate-limit response headers

Every API response includes these headers so you can track your usage proactively:

HeaderExample valueMeaning
X-RateLimit-Limit-Requests60Your RPM ceiling
X-RateLimit-Remaining-Requests42Requests left in this 60-second window
X-RateLimit-Reset-Requests2026-04-14T12:01:00ZUTC timestamp when the window resets
X-RateLimit-Limit-Tokens1000000Your daily token ceiling
X-RateLimit-Remaining-Tokens987432Tokens left today
X-RateLimit-Reset-Tokens2026-04-15T00:00:00ZUTC timestamp of next daily reset
Retry-After60Seconds to wait before retrying (only on 429 responses)

429 error responses

Use the type field to distinguish RPM from TPD errors:

// RPM exceeded — type: "requests"
{"error": {"message": "Rate limit exceeded: 60 requests per minute.", "type": "requests", "code": "rate_limit_exceeded"}}

// TPD exceeded — type: "tokens"
{"error": {"message": "You have exceeded your daily token limit of 1,000,000 tokens. Tokens reset at midnight UTC.", "type": "tokens", "code": "rate_limit_exceeded"}}

Handling 429s — exponential backoff

Always respect the Retry-After header when present. Fall back to exponential backoff (1 s → 2 s → 4 s → 8 s) when the header is absent.

Python

import time
from openai import OpenAI, APIStatusError

client = OpenAI(api_key="sk-cloudach-YOUR_KEY", base_url="https://api.cloudach.com/v1")

RETRYABLE = {429, 500, 502, 503}

def chat_with_backoff(messages, model="llama3-8b", max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(model=model, messages=messages)
        except APIStatusError as e:
            if e.status_code not in RETRYABLE or attempt == max_retries - 1:
                raise
            retry_after = e.response.headers.get("Retry-After")
            wait = float(retry_after) if retry_after else 2 ** attempt
            print(f"Attempt {attempt + 1} failed ({e.status_code}). Retrying in {wait}s...")
            time.sleep(wait)

Node.js

import OpenAI from "openai";

const client = new OpenAI({ apiKey: "sk-cloudach-YOUR_KEY", baseURL: "https://api.cloudach.com/v1" });

const RETRYABLE = new Set([429, 500, 502, 503]);

async function chatWithBackoff(messages, model = "llama3-8b", maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.chat.completions.create({ model, messages });
    } catch (err) {
      if (!(err instanceof OpenAI.APIError) || !RETRYABLE.has(err.status) || attempt === maxRetries - 1) {
        throw err;
      }
      const retryAfter = err.headers?.["retry-after"];
      const wait = retryAfter ? parseFloat(retryAfter) * 1000 : Math.pow(2, attempt) * 1000;
      console.log(`Attempt ${attempt + 1} failed (${err.status}). Retrying in ${wait}ms...`);
      await new Promise((r) => setTimeout(r, wait));
    }
  }
}

Need higher limits? Contact sales to discuss enterprise quotas.

Error Codes

All errors follow the OpenAI error schema: {"error": {"message": "...", "type": "...", "code": "...", "param": "..."}}. param is only included when the error is tied to a specific request field.

Error reference

HTTPcodetypeCauseFix
400invalid_requestinvalid_request_errorMalformed JSON bodyValidate JSON; set Content-Type: application/json
400missing_required_paraminvalid_request_errormodel or messages missingInclude both model and messages in every request
400invalid_param_valueinvalid_request_errortemperature out of [0,2], empty messages, etc.Validate parameter values before sending
400context_length_exceededinvalid_request_errorPrompt + max_tokens exceeds model contextTrim history or switch to a larger-context model
401missing_credentialsinvalid_request_errorNo Authorization headerAdd Authorization: Bearer <key>
401invalid_api_keyauthentication_errorKey is wrong, expired, or revokedCheck or rotate key in the dashboard
403insufficient_quotapermission_errorMonthly token cap reachedUpgrade plan or wait for reset
404model_not_foundinvalid_request_errorModel ID not recognisedCall GET /v1/models for valid IDs
404not_foundinvalid_request_errorRoute does not existCheck base URL and path
413request_too_largeinvalid_request_errorBody > 1 MBChunk large payloads
429rate_limit_exceededrequestsRPM limit hitWait Retry-After seconds; use exponential backoff
429rate_limit_exceededtokensDaily token limit hitWait until midnight UTC for reset
500internal_server_errorapi_errorUnexpected server faultRetry with backoff; contact support if persistent
502model_backend_unavailableapi_errorInference backend down or overloadedRetry with exponential backoff
503service_unavailableapi_errorMaintenance windowCheck status.cloudach.com

Example error responses

400 — Context length exceeded

{
  "error": {
    "message": "This model's maximum context length is 8192 tokens, but your request has 9500 tokens (8100 prompt + 1400 max_tokens). Shorten your messages or reduce max_tokens.",
    "type": "invalid_request_error",
    "code": "context_length_exceeded"
  }
}

401 — Invalid key

{
  "error": {
    "message": "Invalid or revoked API key.",
    "type": "authentication_error",
    "code": "invalid_api_key"
  }
}

404 — Model not found

{
  "error": {
    "message": "The model 'gpt-4' does not exist or you do not have access to it.",
    "type": "invalid_request_error",
    "code": "model_not_found",
    "param": "model"
  }
}

429 — Rate limited

// RPM exceeded (type: "requests")
{"error": {"message": "Rate limit exceeded: 60 requests per minute.", "type": "requests", "code": "rate_limit_exceeded"}}

// Daily token cap (type: "tokens")
{"error": {"message": "You have exceeded your daily token limit of 1,000,000 tokens. Tokens reset at midnight UTC.", "type": "tokens", "code": "rate_limit_exceeded"}}
Do NOT retry 400, 401, 403, or 404 errors — they indicate a bug in the request, not a transient fault. Retry 429, 500, 502, and 503 with exponential backoff.

Streaming error handling

Errors in streaming responses fall into two categories:

  • Pre-stream errors — the request fails before any data: events are sent. You receive a normal HTTP error response (non-200 status, JSON body). Handle identically to non-streaming errors.
  • Mid-stream errors — the backend fails after the stream has started. The data: sequence is cut short; the final event is an error object instead of data: [DONE].

Mid-stream error event

data: {"id":"chatcmpl-abc","choices":[{"delta":{"content":"The answer is"},"index":0}]}

data: {"error": {"message": "Stream interrupted by server.", "type": "api_error", "code": "stream_error"}}

// Connection closes — [DONE] is NOT sent

Python — streaming with error handling

from openai import OpenAI, APIStatusError

client = OpenAI(api_key="sk-cloudach-YOUR_KEY", base_url="https://api.cloudach.com/v1")

collected = []
try:
    stream = client.chat.completions.create(
        model="llama3-8b",
        messages=[{"role": "user", "content": "Tell me a story."}],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        collected.append(delta)
        print(delta, end="", flush=True)
except APIStatusError as e:
    # Covers both pre-stream HTTP errors and mid-stream 5xx faults
    print(f"\nStream error ({e.status_code}): {e.message}")
    # Retry the full request if e.status_code in {429, 500, 502, 503}

Node.js — streaming with error handling

import OpenAI from "openai";

const client = new OpenAI({ apiKey: "sk-cloudach-YOUR_KEY", baseURL: "https://api.cloudach.com/v1" });

const collected = [];
try {
  const stream = await client.chat.completions.create({
    model: "llama3-8b",
    messages: [{ role: "user", content: "Tell me a story." }],
    stream: true,
  });
  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content ?? "";
    collected.push(delta);
    process.stdout.write(delta);
  }
} catch (err) {
  if (err instanceof OpenAI.APIError) {
    console.error(`\nStream error (${err.status}): ${err.message}`);
    // Retry if err.status is in [429, 500, 502, 503]
  } else {
    throw err; // Network-level error (TCP reset, proxy timeout)
  }
}

Network-level interruptions (TCP resets, proxy timeouts) surface as connection errors from the HTTP client, not as API error JSON. Always wrap stream consumption in a try/catch and implement a retry strategy for the full request.

Webhooks

Webhooks let you receive real-time HTTP POST notifications when events happen in your Cloudach account. Register an endpoint URL in the dashboard and subscribe to the event types you care about.

Event types

EventWhen it fires
usage.thresholdCumulative spend for the billing period crosses a threshold
api_key.createdA new API key is created
api_key.revokedAn API key is revoked
request.failedAn API request returns a 4xx or 5xx status code

Verifying signatures

Every delivery includes an X-Cloudach-Signature header formatted as sha256=<hex>. Verify it by computing HMAC-SHA256(secret, rawBody) with your webhook signing secret and comparing the result. Reject requests where signatures do not match.

# Python
import hmac, hashlib

def verify(secret: str, body: bytes, header: str) -> bool:
    expected = "sha256=" + hmac.new(
        secret.encode(), body, hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(expected, header)
// Node.js
const crypto = require('crypto');

function verify(secret, rawBody, header) {
  const expected = 'sha256=' + crypto
    .createHmac('sha256', secret)
    .update(rawBody)
    .digest('hex');
  return crypto.timingSafeEqual(
    Buffer.from(expected),
    Buffer.from(header)
  );
}

Payload structure

{
  "id": "evt_01abc...",
  "event": "api_key.created",
  "created": 1713100800,
  "data": { ... }
}

Retry policy

Non-2xx responses or timeouts (10 s) trigger up to 3 retries with exponential back-off (0.5 s → 1 s → 2 s). View delivery history in the Webhooks dashboard.

SDK Reference

Cloudach is drop-in compatible with any OpenAI SDK. Change two values: base_url and api_key. All request/response shapes are identical.

SDK quickstart guides

SDKInstallbase_url / baseURL
Python openai ≥ 1.0pip install openaihttps://api.cloudach.com/v1
Node.js openai ≥ 4.0npm install openaihttps://api.cloudach.com/v1
LangChain (Python)pip install langchain-openaiopenai_api_base env var
LiteLLMpip install litellmapi_base config
Direct HTTP / curlAuthorization: Bearer header

chat.completions.create — parameters

ParameterTypeRequiredDefaultDescription
modelstringYesModel ID. See /v1/models for available IDs.
messagesarrayYesNon-empty array of {role, content} objects. Roles: system, user, assistant.
streambooleanNofalseIf true, response is SSE stream of delta chunks ending with data: [DONE].
temperaturenumberNo1.0Sampling temperature. 0.0 = deterministic, 2.0 = very random.
max_tokensnumberNomodel maxMaximum tokens to generate. Caps completion length.
top_pnumberNo1.0Nucleus sampling. Alternative to temperature. Use one, not both.
nnumberNo1Number of completions to generate. Higher values multiply token cost.
stopstring | arrayNonullStop sequence(s). Generation halts when one is produced.
presence_penaltynumberNo0.0-2.0 to 2.0. Positive values penalise repeated topics.
frequency_penaltynumberNo0.0-2.0 to 2.0. Positive values penalise repeated tokens.
userstringNoStable end-user identifier for abuse monitoring.

chat.completions.create — response (non-streaming)

{
  "id": "chatcmpl-abc123",           // unique completion ID
  "object": "chat.completion",
  "created": 1712345678,             // Unix timestamp
  "model": "llama3-8b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help?"
      },
      "finish_reason": "stop"        // "stop" | "length" | "content_filter"
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 7,
    "total_tokens": 25
  }
}

chat.completions.create — streaming chunk

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion.chunk",
  "created": 1712345678,
  "model": "llama3-8b",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",  // only in first chunk
        "content": "Hello"   // null in first and last chunks
      },
      "finish_reason": null   // "stop" | "length" in final chunk
    }
  ],
  // usage only in the last content chunk (before [DONE])
  "usage": { "prompt_tokens": 18, "completion_tokens": 1, "total_tokens": 19 }
}

models.list — response

// GET /v1/models
{
  "object": "list",
  "data": [
    {
      "id": "llama3-8b",
      "object": "model",
      "created": 1712000000,
      "owned_by": "cloudach"
    }
    // ... more models
  ]
}

Python — full client reference

from openai import OpenAI, AsyncOpenAI

# Sync client
client = OpenAI(
    base_url="https://api.cloudach.com/v1",
    api_key="sk-cloudach-YOUR_KEY",
    timeout=60.0,       # request timeout in seconds (default: 60)
    max_retries=2,      # automatic retries on 429/5xx (default: 2)
)

# Async client (asyncio / FastAPI)
async_client = AsyncOpenAI(
    base_url="https://api.cloudach.com/v1",
    api_key="sk-cloudach-YOUR_KEY",
)

# Chat completion (sync)
response = client.chat.completions.create(model="llama3-8b", messages=[...])
text = response.choices[0].message.content
usage = response.usage  # .prompt_tokens, .completion_tokens, .total_tokens

# Chat completion (async)
response = await async_client.chat.completions.create(model="llama3-8b", messages=[...])

# List models
models = client.models.list()
for m in models.data:
    print(m.id)

# Retrieve a model
model = client.models.retrieve("llama3-8b")

Node.js — full client reference

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.cloudach.com/v1",
  apiKey: "sk-cloudach-YOUR_KEY",
  timeout: 60_000,    // ms (default: 60000)
  maxRetries: 2,      // automatic retries on 429/5xx (default: 2)
});

// Chat completion
const response = await client.chat.completions.create({ model: "llama3-8b", messages: [...] });
const text = response.choices[0].message.content;
const usage = response.usage; // .promptTokens, .completionTokens, .totalTokens

// Streaming
const stream = await client.chat.completions.create({
  model: "llama3-8b", messages: [...], stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");
}

// List models
const models = await client.models.list();
for (const m of models.data) console.log(m.id);

Full guides: Python quickstart · Node.js quickstart · Streaming guide · Migrate from OpenAI

Integrations

Use Cloudach with popular LLM frameworks. Drop-in compatible — change the base URL and API key, nothing else.

Fine-Tuning

Fine-tuning lets you adapt a base model to your domain, tone, or task using your own labelled examples. Cloudach exposes fine-tuning through a simple REST API and serves the resulting LoRA adapters on top of vLLM with the same sub-100ms latency as base models.

Quickstart

The workflow has four steps: prepare a JSONL dataset → upload → create job → infer.

# 1. Upload your dataset
curl https://api.cloudach.com/v1/fine-tuning/datasets \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -F "file=@training_data.jsonl" \
  -F "purpose=fine-tune"
# → {"id": "ds-8f3a2b1c", ...}

# 2. Create a LoRA fine-tuning job
curl https://api.cloudach.com/v1/fine-tuning/jobs \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b",
    "training_file": "ds-8f3a2b1c",
    "method": {"type": "lora", "lora": {"rank": 16, "alpha": 32}},
    "hyperparameters": {"n_epochs": 3},
    "suffix": "my-model"
  }'
# → {"id": "ftjob-a1b2c3", "status": "queued"}

# 3. Poll until succeeded
curl https://api.cloudach.com/v1/fine-tuning/jobs/ftjob-a1b2c3 \
  -H "Authorization: Bearer $CLOUDACH_API_KEY"
# → {"status": "succeeded", "fine_tuned_model": "llama3-8b:ft:my-model:ftjob-a1b2c3"}

# 4. Infer — use fine_tuned_model as the model ID
curl https://api.cloudach.com/v1/chat/completions \
  -H "Authorization: Bearer $CLOUDACH_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3-8b:ft:my-model:ftjob-a1b2c3", "messages": [...]}'

Dataset format: each line of your .jsonl file must be a JSON object with a messages array (identical to the OpenAI fine-tuning format). Minimum 100 examples.

{"messages": [
  {"role": "system", "content": "You are a helpful support agent for Acme Corp."},
  {"role": "user",   "content": "How do I reset my password?"},
  {"role": "assistant", "content": "Go to Settings → Security → Reset Password. You'll receive a link within 2 minutes."}
]}

LoRA adapters

Cloudach uses vLLM with multi-adapter LoRA support. LoRA trains lightweight adapter weights (≈ 0.1–1% of model size) rather than updating the full model. Key properties:

  • Adapter loading adds < 50 ms on first request; warm requests have zero overhead
  • Multiple adapters for the same base model share one GPU replica — you pay base model rates, not a new GPU per fine-tune
  • Adapters can be downloaded for self-hosted vLLM deployments
ParameterDefaultDescription
method.typeloraTraining method: lora or full (8B models only)
lora.rank16Adapter capacity — 8/16/32/64. Higher = more expressive, higher cost
lora.alpha2 × rankScaling factor. Usually set to 2× rank
lora.target_modulesq_proj, v_projWeight matrices to train
n_epochs3Training passes over the dataset
batch_size16Examples per gradient step

Supported base models

ModelMethodLoRA rank options
llama3-8bFull fine-tune + LoRA8, 16, 32, 64
llama3-70bLoRA only8, 16, 32
llama31-8bFull fine-tune + LoRA8, 16, 32, 64
llama31-70bLoRA only8, 16, 32
mistral-7bFull fine-tune + LoRA8, 16, 32, 64
mixtral-8x7bLoRA only8, 16, 32

Full reference: Fine-Tuning API Reference — all endpoints, parameters, error codes, and pricing. See also the step-by-step tutorial and the Data Preparation Guide.

Tutorials

Step-by-step guides for common use cases.

API Playground

Try the API directly from your browser — no terminal needed. Fill in the inputs, click Run, and watch the response stream in real time. The code panel stays in sync as you type.

Create a key in your dashboard — it is pre-filled if you are already signed in.

Generated Code

Updates live as you change inputs.

curl https://api.cloudach.com/v1/chat/completions \
  -H "Authorization: Bearer sk-cloudach-YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-8b",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Hello! Tell me about yourself."
        }
    ],
    "temperature": 0.7,
    "max_tokens": 256,
    "stream": true
  }'

FAQ

Is Cloudach fully OpenAI-compatible? +

Yes. Cloudach implements the OpenAI REST API spec for chat completions, text completions, and model listing. Any SDK or tool that targets OpenAI works with Cloudach by changing the base URL and API key.

Do I need to change my code to switch from OpenAI? +

No. Set base_url to https://api.cloudach.com/v1 and swap your API key. That's it. All request/response shapes are identical.

What are the current rate limits? +

60 requests per minute and 1,000,000 tokens per day per API key on the free tier. Contact sales@cloudach.com for enterprise limits.

What happens when I exceed the token quota? +

Requests will return a 429 with code rate_limit_exceeded and a Retry-After header. Tokens reset at midnight UTC. You can upgrade or purchase additional token packs from the dashboard.

Can I use Cloudach in production? +

Yes. Cloudach is production-ready with 99.9% uptime SLA on paid plans, sub-100ms median TTFT, and autoscaling infrastructure. See the Status page for live metrics.

Is my data private? +

Cloudach does not log prompt or completion content. Request metadata (token counts, model, timestamp) is stored for billing. Your data is never used to train models.

Which models are available? +

Llama 3 8B, Llama 3 70B, Mistral 7B, and Mixtral 8×7B are available today. New models are announced on the blog.