Every AI coding tool, chatbot, document analyzer, and autonomous agent running today is built on the same foundation: an HTTP API that takes a list of messages and returns a completion. Claude's Messages API and OpenAI's Chat Completions API are structurally nearly identical, and mastering either one gives you a transferable mental model for both. But knowing the basic call is a starting point, not the whole story. Production applications need to handle streaming for perceived responsiveness, prompt caching to keep costs in check, structured output for reliable parsing, exponential backoff for transient errors, and tight key management to keep secrets secret. This article covers all of it.
We will build up from the basic API structure, then layer on each production concern in turn. By the end you will have a complete picture of what is required to ship a reliable, cost-effective, low-latency LLM-powered feature — not just get a response back in your terminal.
- The Messages API has three roles —
system,user,assistant— and understanding each is the foundation of prompt design. The system role sets persistent instructions; user/assistant alternate to form the conversation. - Streaming is almost always the right default for user-facing features. First-token latency feels dramatically faster than waiting for the entire response, even if total time is the same.
- Prompt caching can cut costs by 80–90% for workloads with a large, stable prefix (system prompt, document, codebase). Cache hits are also 3–5× faster than full inference.
- Token costs compound fast. Understand your input/output token ratio, measure it empirically, and design your prompt to minimize unnecessary input repetition.
- Structured output (JSON mode or schema-constrained generation) is essential when downstream code parses the model's response. Never regex-parse free text if you can constrain the output format instead.
- Key hygiene matters at day one. API keys are bearer tokens; anyone who has yours can spend your budget and read your prompts. Use environment variables, secret managers, and server-side proxies — never embed keys in client-side code.
The Claude and OpenAI APIs share the same messages-based structure. For production: use streaming for responsiveness, prompt caching for cost on large stable contexts, structured output for reliable parsing, exponential backoff for errors, and a server-side proxy so API keys never reach the client. Model choice is a cost/capability tradeoff — measure empirically.
The Messages API: Structure and Roles
Every call to the Messages API is a list of message objects, each with a role and content. The three roles have distinct semantics:
- system — sent once, before any user turn. Sets the model's persona, task constraints, output format, and any persistent instructions. The model treats this with higher authority than user messages. This is where you put "You are a senior Python engineer. Always return valid JSON. Never include markdown code fences."
- user — messages from the human (or the application posing as the human). Each turn of the conversation adds a user message.
- assistant — the model's previous responses. When you build multi-turn conversations, you include prior assistant messages in the list so the model can maintain context. For single-turn calls you omit these.
Anthropic's API handles system messages as a top-level parameter rather than a message in the list, but the conceptual role is identical. OpenAI includes system as the first object in the messages array. Both behave the same way in practice.
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system="You are a concise technical writer. Reply in plain text, no markdown.",
messages=[
{"role": "user", "content": "Explain what a context window is in one paragraph."},
],
)
print(response.content[0].text)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
Key Parameters and What They Do
| Parameter | Type | What it controls | Typical value |
|---|---|---|---|
model | string | Which model to use | "claude-opus-4-5", "gpt-4o" |
max_tokens | int | Maximum output tokens before truncation | 256–4096 depending on task |
temperature | float 0–1 | Randomness: 0 = deterministic, 1 = creative | 0 for code/JSON, 0.7 for prose |
top_p | float 0–1 | Nucleus sampling: alternative to temperature | Don't use both simultaneously |
stop_sequences | list[str] | Stop generating when any string is produced | ["", "###"] for structured prompts |
stream | bool | Return tokens as they are generated | true for user-facing features |
Temperature vs. top_p: use temperature for most applications. Set it low (0–0.2) when you want deterministic, factual output (parsing, code generation, JSON extraction). Set it higher (0.6–0.9) when creativity or variation is desirable. Avoid changing both simultaneously — they interact in ways that are hard to reason about.
Streaming: First Token Wins the User's Attention
Without streaming, your app waits for the entire response before displaying anything — which for a long response can be 5–20 seconds of a blank screen. With streaming, tokens arrive as the model generates them, and the user sees text appearing within a few hundred milliseconds of sending the request. Total latency is often identical, but perceived latency is dramatically lower. For any user-facing feature, streaming is almost always the right default.
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-opus-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": "Write a merge sort in Python."}],
) as stream:
for text in stream.text_stream():
print(text, end="", flush=True) # each chunk arrives as generated
final = stream.get_final_message()
print(f"\nTotal tokens: {final.usage.input_tokens + final.usage.output_tokens}")
In a web application, you typically proxy the stream from your server to the browser using Server-Sent Events (SSE) or WebSockets. The browser renders each chunk as it arrives. Most LLM API SDKs handle the SSE framing for you on the server side — you just iterate over the stream.
Streaming in JavaScript
import OpenAI from "openai";
const client = new OpenAI(); // reads OPENAI_API_KEY from env
const stream = await client.chat.completions.create({
model: "gpt-4o",
stream: true,
messages: [{ role: "user", content: "Explain async/await in JavaScript." }],
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? "";
process.stdout.write(delta); // stream to browser via SSE in production
}
Prompt Caching: The Single Best Cost Optimization
LLM inference is expensive because the model must process every input token from scratch on every call. But many production workloads have a large, stable prefix: a long system prompt, a document being analyzed, or an entire codebase being processed query-by-query. Prompt caching lets the API pre-process this stable prefix once and reuse the KV cache for subsequent calls with the same prefix — dramatically cutting both cost and latency.
How Cache Economics Work
| Token type | Relative cost | Latency impact |
|---|---|---|
| Regular input tokens | 1× | Full inference time |
| Cache write (first call) | ~1.25× (slight premium to populate) | Slightly higher on first call |
| Cache read (subsequent calls) | ~0.10× (90% discount) | 3–5× faster than full input processing |
| Output tokens | 3–5× input cost | Determined by output length |
The math is compelling: if your system prompt is 10,000 tokens and you make 100 calls per day, naively that is 1,000,000 input tokens. With caching, it is 10,000 cache-write tokens (on the first call or TTL refresh) plus 90,000 cache-read tokens for the rest — roughly a 90% reduction in input token cost for that portion of the prompt.
Implementing Prompt Caching with Claude
import anthropic
client = anthropic.Anthropic()
# A long system prompt that stays stable across many calls
LONG_SYSTEM_PROMPT = """
You are a senior Python engineer reviewing code for a fintech company.
[... 5,000 tokens of detailed coding standards, style rules, security
requirements, and example patterns ...]
"""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # mark for caching
}
],
messages=[
{"role": "user", "content": "Review this function: def process(x): return x*2"}
],
)
usage = response.usage
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Regular input tokens: {usage.input_tokens}")
Cache TTL on Claude is 5 minutes by default. As long as calls arrive within the TTL, subsequent requests hit the cache. For high-traffic applications this is essentially always; for low-traffic apps, you may need to implement a keep-alive ping to keep the cache warm between real calls.
When Caching Helps Most
- Document Q&A: embed the full document in the prompt once; each user question is a cache-read on the document prefix.
- Code review pipelines: include your entire coding standards document in the system prompt; review each PR against the same cached standards.
- Multi-turn chat with a long system prompt: the system prompt is stable across all turns; cache it and only pay full price for each new user message.
- Batch processing: if you process 10,000 documents through the same template, the template prefix is cached after the first request.
Token Costs and Cost Estimation
Every API call is billed by tokens consumed: input tokens (your prompt + conversation history) and output tokens (the model's response). Output tokens cost 3–5× more than input tokens on most models, reflecting the additional compute required for autoregressive generation. Understanding this ratio is essential for cost-aware application design.
Estimating Costs Before You Build
A rough rule of thumb: 1 token ≈ 0.75 English words. A 1,000-word document is roughly 1,333 tokens. Common input/output ratios by application type:
| Application type | Typical input:output ratio | Cost bottleneck |
|---|---|---|
| Code review / analysis | 10:1 to 20:1 | Input (the code to review) |
| Content generation | 1:5 to 1:10 | Output (the generated text) |
| Classification / extraction | 20:1 to 100:1 | Input (document being processed) |
| Conversational assistant | 3:1 to 5:1 | Mix — depends on history length |
| Agent task execution | 50:1 to 200:1 | Input (tool outputs, context) |
Measure your actual token usage on real inputs before pricing your product. The usage object in the API response always contains precise token counts — log them. Your estimates will often be wrong by 2–3×, and that matters for unit economics.
Cost Reduction Levers
- Prompt caching — largest single lever for stable-prefix workloads.
- Smaller models for simpler tasks — Haiku or GPT-4o-mini are 10–20× cheaper than Opus or GPT-4o; many classification and extraction tasks don't need the largest model.
- Limit max_tokens — set a tight cap appropriate for your task. A function that extracts a single field from a document does not need 4096 output tokens.
- Truncate conversation history — multi-turn applications accumulate history; trim or summarize old turns to avoid paying for stale context repeatedly.
- Batch API — both Anthropic and OpenAI offer asynchronous batch endpoints at roughly 50% of the synchronous price, suitable for non-real-time workloads.
Latency Optimization
Latency has two distinct components: time to first token (TTFT) — how long before the user sees anything — and total completion time — how long before the full response is available. Streaming addresses TTFT. Total time depends on model size, prompt length, and output length.
Reducing Time to First Token
- Stream responses. TTFT drops to the time the model starts generating the first token — typically 200–800ms for major APIs — regardless of total response length.
- Use prompt caching. Cache hits skip input processing, reducing TTFT by 3–5× for large cached prefixes.
- Use a geographically close API endpoint. Round-trip latency to the API server adds directly to TTFT. Both Anthropic and OpenAI offer regional endpoints.
- Reduce input size. Smaller prompts start processing faster. If you are loading an entire large document, consider retrieval (fetch only relevant chunks) instead.
Reducing Total Completion Time
- Use a smaller, faster model where the task allows it. Haiku generates tokens at roughly 3–4× the speed of Opus.
- Set a tight max_tokens. The model generates until it hits max_tokens or a stop sequence. Unnecessarily high limits don't cause the model to generate more — but they don't cause it to stop sooner either. Match max_tokens to the realistic output length.
- Parallelize independent calls. If you need two separate LLM outputs that don't depend on each other, fire both API calls concurrently with
asyncio.gatherorPromise.all.
Error Handling and Retry Logic
LLM APIs are network services and will occasionally return errors. The two most common classes are transient errors (rate limits, temporary overload) and permanent errors (invalid input, authentication failure). Treating them identically — either always retrying or never retrying — is wrong. A well-designed client distinguishes between them.
HTTP Status Codes and What to Do
| Status | Meaning | Action |
|---|---|---|
| 200 | Success | Process response normally |
| 400 | Bad request (invalid params, content policy) | Fix the request — do not retry |
| 401 | Invalid API key | Fix auth — do not retry |
| 429 | Rate limit exceeded | Retry with exponential backoff; respect Retry-After header |
| 500, 529 | Server error / overloaded | Retry with exponential backoff, up to 5 attempts |
| 413 | Request too large | Truncate prompt — do not retry as-is |
import time, random, anthropic
from anthropic import RateLimitError, APIStatusError
def call_with_retry(client, max_retries=5, **kwargs):
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1) # exponential + jitter
print(f"rate limit hit, retry {attempt+1} in {wait:.1f}s")
time.sleep(wait)
except APIStatusError as e:
if e.status_code in (500, 529) and attempt < max_retries - 1:
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
else:
raise # 4xx errors: don't retry
Add jitter (the random.uniform(0, 1)) to avoid the thundering herd problem: if 100 clients all hit a rate limit simultaneously and all retry at exactly 2 seconds, 4 seconds, 8 seconds — they all hammer the API together at each retry point. Jitter spreads retries across a window and significantly reduces retry-induced load spikes.
Structured Output: Making Responses Parseable
When your application parses the model's response programmatically — extracting fields, populating a database, triggering downstream logic — you need the output in a predictable format. Free text is not that. A model that returns "The answer is 42." when you needed {"answer": 42} breaks your parser. Structured output solves this by constraining what the model can generate.
Three Approaches, by Reliability
- Prompt-only JSON: instruct the model to "return valid JSON" in the system prompt and provide an example. Works 80–95% of the time on strong models; the remaining cases produce JSON with trailing text, missing quotes, or invalid escape sequences. Not acceptable for production without a validation + retry loop.
- JSON mode: a parameter (
response_format: {type: "json_object"}on OpenAI) that constrains the model to always produce valid JSON. Solves syntax errors; does not guarantee the schema (keys, types, nesting) matches what you expect. - Schema-constrained generation (tool use as output): define the expected output as a JSON Schema via the tool definitions; the model is forced to emit a tool call matching that schema. This is the most reliable approach and the one we recommend for production.
import json, anthropic
client = anthropic.Anthropic()
# Define the expected output shape as a tool
extract_tool = {
"name": "extract_bug_report",
"description": "Extract structured data from a bug report.",
"input_schema": {
"type": "object",
"properties": {
"severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
"affected_component": {"type": "string"},
"reproduction_steps": {"type": "array", "items": {"type": "string"}},
"is_regression": {"type": "boolean"},
},
"required": ["severity", "affected_component", "is_regression"],
},
}
response = client.messages.create(
model="claude-haiku-4-5", # extraction task → use cheaper model
max_tokens=512,
tools=[extract_tool],
tool_choice={"type": "tool", "name": "extract_bug_report"}, # force tool use
messages=[{"role": "user", "content": BUG_REPORT_TEXT}],
)
# The model MUST emit a valid tool call matching the schema
tool_call = response.content[0]
extracted = tool_call.input # already a parsed dict, no json.loads() needed
print(extracted["severity"], extracted["affected_component"])
By setting tool_choice to force the specific tool, you guarantee that the model's entire output is a valid JSON object matching your schema. No regex, no try/except on json.loads, no retry loops for malformed output.
API Key Management and Security
API keys are bearer tokens: whoever has one can make API calls billed to your account, read any prompts you send, and potentially access conversation history. Treating them carelessly is the fastest way to get a surprise invoice and a data breach simultaneously.
The Cardinal Rules
- Never embed keys in client-side code. JavaScript bundles are inspectable. Mobile apps are reversible. Anyone who downloads your app can extract hardcoded keys. Always proxy through your server.
- Never commit keys to version control. Even in a private repo, git history is permanent and repos get shared. Use
.gitignorefor.envfiles and run a pre-commit hook that blocks accidental commits of secrets. - Read keys from environment variables. All LLM SDKs read
ANTHROPIC_API_KEY/OPENAI_API_KEYfrom the environment automatically if you do not pass them explicitly. This is the minimum correct practice. - Use a secrets manager in production. AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, and Doppler all provide centralized, audited, rotatable secret storage. Your application retrieves the key at startup rather than storing it in env vars in a config file.
- Rotate keys regularly and immediately if you suspect exposure. Both APIs allow multiple keys simultaneously, so rotation is non-disruptive: create new key → update all services → revoke old key.
Server-Side Proxy Pattern
For browser applications, the correct architecture is: your backend holds the API key, your frontend sends requests to your own backend, and your backend forwards them to the LLM API. This pattern also lets you add rate limiting per user, content filtering, logging, and cost attribution — all at the proxy layer, without touching client code.
Choosing the Right Model
Both Anthropic and OpenAI offer a tiered model lineup: a large capable model (Opus, GPT-4o), a fast balanced model (Sonnet, GPT-4o-mini), and a cheap fast model (Haiku, GPT-4o-mini on discount tiers). The instinct to "always use the best model" is economically wrong. Task-model matching is a real engineering decision.
| Task type | Recommended tier | Why |
|---|---|---|
| Simple classification, extraction, routing | Small (Haiku, mini) | 10–20× cheaper; quality difference negligible for clear-cut tasks |
| Code generation, multi-step reasoning | Medium (Sonnet, GPT-4o) | Balanced capability/cost; handles most engineering tasks well |
| Complex architecture, nuanced judgment, research | Large (Opus, GPT-4o for hard tasks) | Quality difference justifies cost for high-stakes decisions |
| Long-running agents with many tool calls | Medium + caching | Cost accumulates fast; caching reduces input cost dramatically |
The right way to choose is empirically: run both models on 50–100 real examples, compare quality on your specific task, compute the cost difference, and decide whether the quality delta justifies the price delta. Do not assume — measure.
Building with LLM APIs is not hard, but building well requires understanding the layered concerns: get the message structure right, stream for perceived latency, cache stable prefixes to control costs, constrain output format for reliable parsing, handle errors with exponential backoff, and keep keys server-side. Each layer compounds with the others — an application that does all of these is not just cheaper and faster, it is fundamentally more reliable and maintainable.
What is prompt caching and why does it matter? The API pre-processes a stable input prefix (system prompt, document) and reuses the KV cache for subsequent calls with the same prefix — yielding 90% cost reduction and 3–5× latency improvement on the cached portion. Essential for document Q&A, code review pipelines, and any high-volume workload with a fixed context.
How do you guarantee structured output from an LLM? Define the expected schema as a JSON Schema in a tool definition and force the model to call that tool via tool_choice. The model cannot produce anything that doesn't match the schema. Prompt-only instructions for "return JSON" work most of the time but fail in production — schema-constrained tool use is the reliable approach.
Why is model selection an engineering decision, not a "always use the best" decision? Output tokens cost 3–5× input tokens; larger models are 10–20× more expensive than smaller ones. For classification, extraction, and routing tasks, smaller models match larger ones on quality while reducing cost by an order of magnitude. Measure quality on your task and compute the cost/quality tradeoff empirically.