Quickstart
Knox Chat is a unified, OpenAI-compatible API, Anthropic Messages API format supporting and platform that gives you access to hundreds of models through one endpoint, with intelligent routing, fallbacks, and transparent pricing. It pairs that with Knox‑MS, a memory‑centric system for long‑running agents and apps that need persistent context.
What you get
- Smart model routing: performance/cost/balanced strategies with automatic fallback and provider metrics.
- Multimodal I/O: text, images, PDFs, and audio inputs, plus image generation outputs.
- Tools and MCP: OpenAI‑compatible tool calling and MCP server bridging.
- Web search:
:onlinemodel variant or thewebplugin for live citations. - Structured & reasoning outputs: JSON Schema enforcement and standardized reasoning tokens.
- Efficiency features: prompt caching and middle‑out message transforms for long contexts.
- Reliability: zero‑completion insurance so failed/empty responses aren’t billed.
Knox‑MS Memory System
Knox‑MS adds multi‑level memory, summaries, vector search, and a knowledge graph to power persistent agents and workflows. It also supports autonomous planning, task orchestration, self‑healing, and realtime progress events. Learn more in the Knox Memory System — Full Feature Deep Dive.
Using the OpenAI SDK
- TypeScript
- Python
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'https://api.knox.chat/v1',
apiKey: '<KNOXCHAT_API_KEY>',
});
async function main() {
const completion = await openai.chat.completions.create({
model: 'openai/gpt-5',
messages: [
{
role: 'user',
content: 'What is the meaning of life?',
},
],
});
console.log(completion.choices[0].message);
}
main();
from openai import OpenAI
client = OpenAI(
base_url="https://api.knox.chat/v1",
api_key="<KNOXCHAT_API_KEY>",
)
completion = client.chat.completions.create(
model="openai/gpt-5",
messages=[
{
"role": "user",
"content": "What is the meaning of life?"
}
]
)
print(completion.choices[0].message.content)
Using the Knox.Chat API directly
- Python
- TypeScript
- Shell
import requests
import json
response = requests.post(
url="https://api.knox.chat/v1/chat/completions",
headers={
"Authorization": "Bearer <KNOXCHAT_API_KEY>",
},
data=json.dumps({
"model": "anthropic/claude-sonnet-4.6", # Optional
"messages": [
{
"role": "user",
"content": "What is the meaning of life?"
}
]
})
)
fetch('https://api.knox.chat/v1/chat/completions', {
method: 'POST',
headers: {
Authorization: 'Bearer <KNOXCHAT_API_KEY>',
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'anthropic/claude-sonnet-4.6',
messages: [
{
role: 'user',
content: 'What is the meaning of life?',
},
],
}),
});
curl https://api.knox.chat/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $KNOXCHAT_API_KEY" \
-d '{
"model": "anthropic/claude-sonnet-4.6",
"messages": [
{
"role": "user",
"content": "What is the meaning of life?"
}
]
}'
The API also supports streaming.
Using third-party SDKs
For information about using third-party SDKs and frameworks with Knox Chat, please see our frameworks documentation.
Principles
Knox Chat helps teams build reliable, cost‑efficient AI systems across providers. We believe the future is multimodal, multi‑provider, and memory‑centric.
Why Knox Chat?
Price and Performance. Knox Chat scouts for the best prices, lowest latencies, and highest throughput across providers, and lets you choose how to prioritize them.
Standardized API. No code changes when switching between models or providers. Use OpenAI‑compatible SDKs, tools, and structured outputs out of the box.
Memory‑First Apps. Knox‑MS enables long‑running agents with persistent memory, knowledge extraction, and autonomous planning.
Multimodal by Default. Images, PDFs, and audio inputs plus image generation make multimodal workflows straightforward.
Consolidated Billing. Simple and transparent billing, regardless of how many providers you use, with zero‑completion insurance for failed/empty generations.
Higher Availability. Automatic fallback and smart routing keep requests working even when providers go down.
Models
One API for hundreds of models
Explore and browse 300+ models and providers on our website, or with our API.
Models API Standard
Our Models API makes the most important information about all LLMs freely available as soon as we confirm it.
API Response Schema
The Models API returns a standardized JSON response format that provides comprehensive metadata for each available model. This schema is cached at the edge and designed for reliable integration for production applications.
Root Response Object
{
"data": [
/* Array of Model objects */
]
}
Model Object Schema
Each model in the data array contains the following standardized fields:
| Field | Type | Description |
|---|---|---|
id | string | Unique model identifier used in API requests (e.g., "google/gemini-2.5-pro") |
object | string | Object type identifier (always "model") |
created | number | Unix timestamp of when the model was created |
owned_by | string | Organization that owns the model |
permission | ModelPermission[] | Array of permission objects defining access controls |
root | string | Root model identifier |
parent | string | null | Parent model identifier if this is a fine-tuned version |
context_length | number | Maximum context window size in tokens |
architecture | Architecture | Object describing the model's technical capabilities |
pricing | Pricing | Price structure for using this model |
top_provider | TopProvider | Configuration details for the primary provider |
supported_parameters | string[] | Array of supported API parameters for this model |
Architecture Object
{
"modality": string, // High-level description of input/output flow (e.g., "text+image->text")
"input_modalities": string[], // Supported input types: ["file", "image", "text", "audio"]
"output_modalities": string[], // Supported output types: ["text"]
"tokenizer": string // Tokenization method used (e.g., "Gemini")
}
Pricing Object
All pricing values are in USD per token/request/unit. A value of "0" indicates the feature is free.
{
"prompt": string, // Cost per input token
"completion": string, // Cost per output token
"request": string, // Fixed cost per API request
"image": string, // Cost per image input
"audio": string, // Cost per audio input
"web_search": string, // Cost per web search operation
"internal_reasoning": string, // Cost for internal reasoning tokens
"input_cache_read": string, // Cost per cached input token read
"input_cache_write": string // Cost per cached input token write
}
Top Provider Object
{
"context_length": number, // Provider-specific context limit
"max_completion_tokens": number // Maximum tokens in response
}
Supported Parameters
The supported_parameters array indicates which OpenAI-compatible parameters work with each model:
include_reasoning- Include reasoning in responsemax_tokens- Response length limitingreasoning- Internal reasoning moderesponse_format- Output format specificationseed- Deterministic outputsstop- Custom stop sequencesstructured_outputs- JSON schema enforcementtemperature- Randomness controltool_choice- Tool selection controltools- Function calling capabilitiestop_p- Nucleus sampling
Different models use different tokenization methods (as indicated by the tokenizer field in the model schema). Some models break up text into chunks of multiple characters (GPT, Claude, Llama, etc), while others tokenize differently (like Gemini). This means that token counts (and therefore costs) will vary between models, even when inputs and outputs are the same. Costs are displayed and billed according to the tokenizer for the model in use. You can use the usage field in API responses to get the actual token counts for your input and output.
Frequently Asked Questions
Getting started
Models and Providers
API Technical Specifications
Privacy and Data Logging
Please see our Terms of Service and Privacy Policy.