Skip to main content

Quickstart

Knox Chat is a unified, OpenAI-compatible API, Anthropic Messages API format supporting and platform that gives you access to hundreds of models through one endpoint, with intelligent routing, fallbacks, and transparent pricing. It pairs that with Knox‑MS, a memory‑centric system for long‑running agents and apps that need persistent context.

What you get

  • Smart model routing: performance/cost/balanced strategies with automatic fallback and provider metrics.
  • Multimodal I/O: text, images, PDFs, and audio inputs, plus image generation outputs.
  • Tools and MCP: OpenAI‑compatible tool calling and MCP server bridging.
  • Web search: :online model variant or the web plugin for live citations.
  • Structured & reasoning outputs: JSON Schema enforcement and standardized reasoning tokens.
  • Efficiency features: prompt caching and middle‑out message transforms for long contexts.
  • Reliability: zero‑completion insurance so failed/empty responses aren’t billed.

Knox‑MS Memory System

Knox‑MS adds multi‑level memory, summaries, vector search, and a knowledge graph to power persistent agents and workflows. It also supports autonomous planning, task orchestration, self‑healing, and realtime progress events. Learn more in the Knox Memory System — Full Feature Deep Dive.

Using the OpenAI SDK

import OpenAI from 'openai';

const openai = new OpenAI({
baseURL: 'https://api.knox.chat/v1',
apiKey: '<KNOXCHAT_API_KEY>',
});

async function main() {
const completion = await openai.chat.completions.create({
model: 'openai/gpt-5',
messages: [
{
role: 'user',
content: 'What is the meaning of life?',
},
],
});

console.log(completion.choices[0].message);
}

main();

Using the Knox.Chat API directly

import requests
import json

response = requests.post(
url="https://api.knox.chat/v1/chat/completions",
headers={
"Authorization": "Bearer <KNOXCHAT_API_KEY>",
},
data=json.dumps({
"model": "anthropic/claude-sonnet-4.6", # Optional
"messages": [
{
"role": "user",
"content": "What is the meaning of life?"
}
]
})
)

The API also supports streaming.

Using third-party SDKs

For information about using third-party SDKs and frameworks with Knox Chat, please see our frameworks documentation.

Principles

Knox Chat helps teams build reliable, cost‑efficient AI systems across providers. We believe the future is multimodal, multi‑provider, and memory‑centric.

Why Knox Chat?

Price and Performance. Knox Chat scouts for the best prices, lowest latencies, and highest throughput across providers, and lets you choose how to prioritize them.

Standardized API. No code changes when switching between models or providers. Use OpenAI‑compatible SDKs, tools, and structured outputs out of the box.

Memory‑First Apps. Knox‑MS enables long‑running agents with persistent memory, knowledge extraction, and autonomous planning.

Multimodal by Default. Images, PDFs, and audio inputs plus image generation make multimodal workflows straightforward.

Consolidated Billing. Simple and transparent billing, regardless of how many providers you use, with zero‑completion insurance for failed/empty generations.

Higher Availability. Automatic fallback and smart routing keep requests working even when providers go down.

Models

One API for hundreds of models

Explore and browse 300+ models and providers on our website, or with our API.

Models API Standard

Our Models API makes the most important information about all LLMs freely available as soon as we confirm it.

API Response Schema

The Models API returns a standardized JSON response format that provides comprehensive metadata for each available model. This schema is cached at the edge and designed for reliable integration for production applications.

Root Response Object

{
"data": [
/* Array of Model objects */
]
}

Model Object Schema

Each model in the data array contains the following standardized fields:

FieldTypeDescription
idstringUnique model identifier used in API requests (e.g., "google/gemini-2.5-pro")
objectstringObject type identifier (always "model")
creatednumberUnix timestamp of when the model was created
owned_bystringOrganization that owns the model
permissionModelPermission[]Array of permission objects defining access controls
rootstringRoot model identifier
parentstring | nullParent model identifier if this is a fine-tuned version
context_lengthnumberMaximum context window size in tokens
architectureArchitectureObject describing the model's technical capabilities
pricingPricingPrice structure for using this model
top_providerTopProviderConfiguration details for the primary provider
supported_parametersstring[]Array of supported API parameters for this model

Architecture Object

{
"modality": string, // High-level description of input/output flow (e.g., "text+image->text")
"input_modalities": string[], // Supported input types: ["file", "image", "text", "audio"]
"output_modalities": string[], // Supported output types: ["text"]
"tokenizer": string // Tokenization method used (e.g., "Gemini")
}

Pricing Object

All pricing values are in USD per token/request/unit. A value of "0" indicates the feature is free.

{
"prompt": string, // Cost per input token
"completion": string, // Cost per output token
"request": string, // Fixed cost per API request
"image": string, // Cost per image input
"audio": string, // Cost per audio input
"web_search": string, // Cost per web search operation
"internal_reasoning": string, // Cost for internal reasoning tokens
"input_cache_read": string, // Cost per cached input token read
"input_cache_write": string // Cost per cached input token write
}

Top Provider Object

{
"context_length": number, // Provider-specific context limit
"max_completion_tokens": number // Maximum tokens in response
}

Supported Parameters

The supported_parameters array indicates which OpenAI-compatible parameters work with each model:

  • include_reasoning - Include reasoning in response
  • max_tokens - Response length limiting
  • reasoning - Internal reasoning mode
  • response_format - Output format specification
  • seed - Deterministic outputs
  • stop - Custom stop sequences
  • structured_outputs - JSON schema enforcement
  • temperature - Randomness control
  • tool_choice - Tool selection control
  • tools - Function calling capabilities
  • top_p - Nucleus sampling
Token Counting Differences

Different models use different tokenization methods (as indicated by the tokenizer field in the model schema). Some models break up text into chunks of multiple characters (GPT, Claude, Llama, etc), while others tokenize differently (like Gemini). This means that token counts (and therefore costs) will vary between models, even when inputs and outputs are the same. Costs are displayed and billed according to the tokenizer for the model in use. You can use the usage field in API responses to get the actual token counts for your input and output.

Frequently Asked Questions

Getting started

Models and Providers

API Technical Specifications

Privacy and Data Logging

Please see our Terms of Service and Privacy Policy.