Every AI coding tool, chatbot, document analyzer, and autonomous agent running today is built on the same foundation: an HTTP API that takes a list of messages and returns a completion. Claude's Messages API and OpenAI's Chat Completions API are structurally nearly identical, and mastering either one gives you a transferable mental model for both. But knowing the basic call is a starting point, not the whole story. Production applications need to handle streaming for perceived responsiveness, prompt caching to keep costs in check, structured output for reliable parsing, exponential backoff for transient errors, and tight key management to keep secrets secret. This article covers all of it.

We will build up from the basic API structure, then layer on each production concern in turn. By the end you will have a complete picture of what is required to ship a reliable, cost-effective, low-latency LLM-powered feature — not just get a response back in your terminal.

⚡ Quick Takeaways
  • The Messages API has three rolessystem, user, assistant — and understanding each is the foundation of prompt design. The system role sets persistent instructions; user/assistant alternate to form the conversation.
  • Streaming is almost always the right default for user-facing features. First-token latency feels dramatically faster than waiting for the entire response, even if total time is the same.
  • Prompt caching can cut costs by 80–90% for workloads with a large, stable prefix (system prompt, document, codebase). Cache hits are also 3–5× faster than full inference.
  • Token costs compound fast. Understand your input/output token ratio, measure it empirically, and design your prompt to minimize unnecessary input repetition.
  • Structured output (JSON mode or schema-constrained generation) is essential when downstream code parses the model's response. Never regex-parse free text if you can constrain the output format instead.
  • Key hygiene matters at day one. API keys are bearer tokens; anyone who has yours can spend your budget and read your prompts. Use environment variables, secret managers, and server-side proxies — never embed keys in client-side code.
tldr

The Claude and OpenAI APIs share the same messages-based structure. For production: use streaming for responsiveness, prompt caching for cost on large stable contexts, structured output for reliable parsing, exponential backoff for errors, and a server-side proxy so API keys never reach the client. Model choice is a cost/capability tradeoff — measure empirically.

The Messages API: Structure and Roles

Every call to the Messages API is a list of message objects, each with a role and content. The three roles have distinct semantics:

Anthropic's API handles system messages as a top-level parameter rather than a message in the list, but the conceptual role is identical. OpenAI includes system as the first object in the messages array. Both behave the same way in practice.

python — basic Messages API call (Claude)
import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system="You are a concise technical writer. Reply in plain text, no markdown.",
    messages=[
        {"role": "user", "content": "Explain what a context window is in one paragraph."},
    ],
)

print(response.content[0].text)
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

Key Parameters and What They Do

ParameterTypeWhat it controlsTypical value
modelstringWhich model to use"claude-opus-4-5", "gpt-4o"
max_tokensintMaximum output tokens before truncation256–4096 depending on task
temperaturefloat 0–1Randomness: 0 = deterministic, 1 = creative0 for code/JSON, 0.7 for prose
top_pfloat 0–1Nucleus sampling: alternative to temperatureDon't use both simultaneously
stop_sequenceslist[str]Stop generating when any string is produced["", "###"] for structured prompts
streamboolReturn tokens as they are generatedtrue for user-facing features

Temperature vs. top_p: use temperature for most applications. Set it low (0–0.2) when you want deterministic, factual output (parsing, code generation, JSON extraction). Set it higher (0.6–0.9) when creativity or variation is desirable. Avoid changing both simultaneously — they interact in ways that are hard to reason about.

Streaming: First Token Wins the User's Attention

Without streaming, your app waits for the entire response before displaying anything — which for a long response can be 5–20 seconds of a blank screen. With streaming, tokens arrive as the model generates them, and the user sees text appearing within a few hundred milliseconds of sending the request. Total latency is often identical, but perceived latency is dramatically lower. For any user-facing feature, streaming is almost always the right default.

python — streaming response (Claude)
import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a merge sort in Python."}],
) as stream:
    for text in stream.text_stream():
        print(text, end="", flush=True)  # each chunk arrives as generated

final = stream.get_final_message()
print(f"\nTotal tokens: {final.usage.input_tokens + final.usage.output_tokens}")

In a web application, you typically proxy the stream from your server to the browser using Server-Sent Events (SSE) or WebSockets. The browser renders each chunk as it arrives. Most LLM API SDKs handle the SSE framing for you on the server side — you just iterate over the stream.

Streaming in JavaScript

javascript — streaming with OpenAI SDK
import OpenAI from "openai";

const client = new OpenAI();  // reads OPENAI_API_KEY from env

const stream = await client.chat.completions.create({
  model: "gpt-4o",
  stream: true,
  messages: [{ role: "user", content: "Explain async/await in JavaScript." }],
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content ?? "";
  process.stdout.write(delta);  // stream to browser via SSE in production
}

Prompt Caching: The Single Best Cost Optimization

LLM inference is expensive because the model must process every input token from scratch on every call. But many production workloads have a large, stable prefix: a long system prompt, a document being analyzed, or an entire codebase being processed query-by-query. Prompt caching lets the API pre-process this stable prefix once and reuse the KV cache for subsequent calls with the same prefix — dramatically cutting both cost and latency.

How Cache Economics Work

Token typeRelative costLatency impact
Regular input tokensFull inference time
Cache write (first call)~1.25× (slight premium to populate)Slightly higher on first call
Cache read (subsequent calls)~0.10× (90% discount)3–5× faster than full input processing
Output tokens3–5× input costDetermined by output length

The math is compelling: if your system prompt is 10,000 tokens and you make 100 calls per day, naively that is 1,000,000 input tokens. With caching, it is 10,000 cache-write tokens (on the first call or TTL refresh) plus 90,000 cache-read tokens for the rest — roughly a 90% reduction in input token cost for that portion of the prompt.

Implementing Prompt Caching with Claude

python — prompt caching with cache_control (Claude)
import anthropic

client = anthropic.Anthropic()

# A long system prompt that stays stable across many calls
LONG_SYSTEM_PROMPT = """
You are a senior Python engineer reviewing code for a fintech company.
[... 5,000 tokens of detailed coding standards, style rules, security
 requirements, and example patterns ...]
"""

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},  # mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": "Review this function: def process(x): return x*2"}
    ],
)

usage = response.usage
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens:  {usage.cache_read_input_tokens}")
print(f"Regular input tokens: {usage.input_tokens}")

Cache TTL on Claude is 5 minutes by default. As long as calls arrive within the TTL, subsequent requests hit the cache. For high-traffic applications this is essentially always; for low-traffic apps, you may need to implement a keep-alive ping to keep the cache warm between real calls.

When Caching Helps Most

Token Costs and Cost Estimation

Every API call is billed by tokens consumed: input tokens (your prompt + conversation history) and output tokens (the model's response). Output tokens cost 3–5× more than input tokens on most models, reflecting the additional compute required for autoregressive generation. Understanding this ratio is essential for cost-aware application design.

Estimating Costs Before You Build

A rough rule of thumb: 1 token ≈ 0.75 English words. A 1,000-word document is roughly 1,333 tokens. Common input/output ratios by application type:

Application typeTypical input:output ratioCost bottleneck
Code review / analysis10:1 to 20:1Input (the code to review)
Content generation1:5 to 1:10Output (the generated text)
Classification / extraction20:1 to 100:1Input (document being processed)
Conversational assistant3:1 to 5:1Mix — depends on history length
Agent task execution50:1 to 200:1Input (tool outputs, context)

Measure your actual token usage on real inputs before pricing your product. The usage object in the API response always contains precise token counts — log them. Your estimates will often be wrong by 2–3×, and that matters for unit economics.

Cost Reduction Levers

Latency Optimization

Latency has two distinct components: time to first token (TTFT) — how long before the user sees anything — and total completion time — how long before the full response is available. Streaming addresses TTFT. Total time depends on model size, prompt length, and output length.

Reducing Time to First Token

Reducing Total Completion Time

Error Handling and Retry Logic

LLM APIs are network services and will occasionally return errors. The two most common classes are transient errors (rate limits, temporary overload) and permanent errors (invalid input, authentication failure). Treating them identically — either always retrying or never retrying — is wrong. A well-designed client distinguishes between them.

HTTP Status Codes and What to Do

StatusMeaningAction
200SuccessProcess response normally
400Bad request (invalid params, content policy)Fix the request — do not retry
401Invalid API keyFix auth — do not retry
429Rate limit exceededRetry with exponential backoff; respect Retry-After header
500, 529Server error / overloadedRetry with exponential backoff, up to 5 attempts
413Request too largeTruncate prompt — do not retry as-is
python — exponential backoff with jitter
import time, random, anthropic
from anthropic import RateLimitError, APIStatusError

def call_with_retry(client, max_retries=5, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)

        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)  # exponential + jitter
            print(f"rate limit hit, retry {attempt+1} in {wait:.1f}s")
            time.sleep(wait)

        except APIStatusError as e:
            if e.status_code in (500, 529) and attempt < max_retries - 1:
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                raise  # 4xx errors: don't retry

Add jitter (the random.uniform(0, 1)) to avoid the thundering herd problem: if 100 clients all hit a rate limit simultaneously and all retry at exactly 2 seconds, 4 seconds, 8 seconds — they all hammer the API together at each retry point. Jitter spreads retries across a window and significantly reduces retry-induced load spikes.

Structured Output: Making Responses Parseable

When your application parses the model's response programmatically — extracting fields, populating a database, triggering downstream logic — you need the output in a predictable format. Free text is not that. A model that returns "The answer is 42." when you needed {"answer": 42} breaks your parser. Structured output solves this by constraining what the model can generate.

Three Approaches, by Reliability

  1. Prompt-only JSON: instruct the model to "return valid JSON" in the system prompt and provide an example. Works 80–95% of the time on strong models; the remaining cases produce JSON with trailing text, missing quotes, or invalid escape sequences. Not acceptable for production without a validation + retry loop.
  2. JSON mode: a parameter (response_format: {type: "json_object"} on OpenAI) that constrains the model to always produce valid JSON. Solves syntax errors; does not guarantee the schema (keys, types, nesting) matches what you expect.
  3. Schema-constrained generation (tool use as output): define the expected output as a JSON Schema via the tool definitions; the model is forced to emit a tool call matching that schema. This is the most reliable approach and the one we recommend for production.
python — schema-constrained structured output via tool use
import json, anthropic

client = anthropic.Anthropic()

# Define the expected output shape as a tool
extract_tool = {
    "name": "extract_bug_report",
    "description": "Extract structured data from a bug report.",
    "input_schema": {
        "type": "object",
        "properties": {
            "severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
            "affected_component": {"type": "string"},
            "reproduction_steps": {"type": "array", "items": {"type": "string"}},
            "is_regression": {"type": "boolean"},
        },
        "required": ["severity", "affected_component", "is_regression"],
    },
}

response = client.messages.create(
    model="claude-haiku-4-5",  # extraction task → use cheaper model
    max_tokens=512,
    tools=[extract_tool],
    tool_choice={"type": "tool", "name": "extract_bug_report"},  # force tool use
    messages=[{"role": "user", "content": BUG_REPORT_TEXT}],
)

# The model MUST emit a valid tool call matching the schema
tool_call = response.content[0]
extracted = tool_call.input  # already a parsed dict, no json.loads() needed
print(extracted["severity"], extracted["affected_component"])

By setting tool_choice to force the specific tool, you guarantee that the model's entire output is a valid JSON object matching your schema. No regex, no try/except on json.loads, no retry loops for malformed output.

API Key Management and Security

API keys are bearer tokens: whoever has one can make API calls billed to your account, read any prompts you send, and potentially access conversation history. Treating them carelessly is the fastest way to get a surprise invoice and a data breach simultaneously.

The Cardinal Rules

Server-Side Proxy Pattern

For browser applications, the correct architecture is: your backend holds the API key, your frontend sends requests to your own backend, and your backend forwards them to the LLM API. This pattern also lets you add rate limiting per user, content filtering, logging, and cost attribution — all at the proxy layer, without touching client code.

Choosing the Right Model

Both Anthropic and OpenAI offer a tiered model lineup: a large capable model (Opus, GPT-4o), a fast balanced model (Sonnet, GPT-4o-mini), and a cheap fast model (Haiku, GPT-4o-mini on discount tiers). The instinct to "always use the best model" is economically wrong. Task-model matching is a real engineering decision.

Task typeRecommended tierWhy
Simple classification, extraction, routingSmall (Haiku, mini)10–20× cheaper; quality difference negligible for clear-cut tasks
Code generation, multi-step reasoningMedium (Sonnet, GPT-4o)Balanced capability/cost; handles most engineering tasks well
Complex architecture, nuanced judgment, researchLarge (Opus, GPT-4o for hard tasks)Quality difference justifies cost for high-stakes decisions
Long-running agents with many tool callsMedium + cachingCost accumulates fast; caching reduces input cost dramatically

The right way to choose is empirically: run both models on 50–100 real examples, compare quality on your specific task, compute the cost difference, and decide whether the quality delta justifies the price delta. Do not assume — measure.

takeaway

Building with LLM APIs is not hard, but building well requires understanding the layered concerns: get the message structure right, stream for perceived latency, cache stable prefixes to control costs, constrain output format for reliable parsing, handle errors with exponential backoff, and keep keys server-side. Each layer compounds with the others — an application that does all of these is not just cheaper and faster, it is fundamentally more reliable and maintainable.

🎯 interview hot-takes

What is prompt caching and why does it matter? The API pre-processes a stable input prefix (system prompt, document) and reuses the KV cache for subsequent calls with the same prefix — yielding 90% cost reduction and 3–5× latency improvement on the cached portion. Essential for document Q&A, code review pipelines, and any high-volume workload with a fixed context.

How do you guarantee structured output from an LLM? Define the expected schema as a JSON Schema in a tool definition and force the model to call that tool via tool_choice. The model cannot produce anything that doesn't match the schema. Prompt-only instructions for "return JSON" work most of the time but fail in production — schema-constrained tool use is the reliable approach.

Why is model selection an engineering decision, not a "always use the best" decision? Output tokens cost 3–5× input tokens; larger models are 10–20× more expensive than smaller ones. For classification, extraction, and routing tasks, smaller models match larger ones on quality while reducing cost by an order of magnitude. Measure quality on your task and compute the cost/quality tradeoff empirically.

← prev
How AI Coding Agents Work