Prompt Engineering for Code

Every AI coding tool runs the same core loop: you supply text, the model generates code. That sounds simple until you realise that the text you supply — the prompt — is the only lever you have over what comes out. The model is fixed; the context window is finite; the only variable you control is what you put into it. And yet most engineers treat prompting as an afterthought, typing a one-liner and wondering why the output misses the mark.

This article is a deep dive into prompt engineering specifically for code generation and agentic coding tasks. It covers the mechanics of why prompts determine output quality, the concrete techniques that consistently raise the bar — context, constraints, planning, few-shot examples, iterative refinement, test-driven prompting — and the anti-patterns that silently produce mediocre or broken code. By the end you should have a repeatable mental model for prompting any AI coding tool, from Copilot completions to Claude Code agentic tasks.

⚡ Quick Takeaways

Context is the multiplier. Giving the model the right files, the actual error message, and the real constraints is worth more than any other single technique.
Plan before code. Asking the model to outline its approach first catches design errors before they get embedded in 200 lines of implementation.
Few-shot examples collapse ambiguity. One concrete "given X, produce Y" example is clearer than a paragraph of prose describing the same thing.
State acceptance criteria, not just goals. "Add auth" is a goal; "add JWT middleware that rejects requests without a valid token and returns 401 with a JSON error body" is a testable specification.
Iterate with targeted follow-ups. One long mega-prompt is rarely better than a crisp initial prompt plus focused correction rounds.
Vague prompts produce plausible-looking wrong code. The model will never tell you the prompt was ambiguous — it will just hallucinate a reasonable-seeming answer.

tldr

Prompting for code is a skill, not a knack. The core formula is: right context + clear spec + acceptance criteria + plan-first + few-shot examples. Nail those five ingredients and the model output improves dramatically. Skip them and you get plausible-looking code that silently violates your constraints.

Why the Prompt Determines Code Quality

A large language model is, at its core, a next-token predictor conditioned on everything in the context window. It has no background knowledge about your repo, your team's conventions, the production constraint you mentioned in Slack, or the edge case that burned you last sprint. All it knows is what you give it right now.

This has a profound implication: the model's output is bounded by the quality of its input. A frontier model with a bad prompt will produce worse code than a smaller model with a great prompt, because the smaller model is at least working with accurate, complete information. Context assembly — choosing what to include in the prompt — is the dominant factor in output quality, and it is entirely under your control.

There is also an asymmetry of failure you need to understand: the model will always produce something. It will not tell you the prompt was too vague; it will fill in the gaps with statistically plausible completions. In a code context that means plausible-looking code that may compile, pass a surface read, and still be subtly wrong in ways that only surface in production. This is why a junior engineer who vibe-codes aggressively can look productive for weeks before the technical debt crystallises.

Prompt engineering for code is therefore not about magic incantations. It is about systematically removing the model's uncertainty: giving it the files it needs, the error it must fix, the constraints it must respect, and the examples that show the style it should match. Every technique in this article reduces a different kind of uncertainty.

Giving the Model Enough Context

The single highest-leverage thing you can do is also the most mechanical: paste the right code. Most engineers under-paste. They describe a function in prose when they should paste the function. They mention an error message when they should paste the full stack trace. They say "the auth module" when they should paste the relevant fifty lines from it.

What to include

The function or file being changed — don't describe it, show it. The model needs the actual signatures, variable names, and existing logic.
Closely related code — callers, called functions, data types. If your function returns a UserRecord, paste the UserRecord struct definition.
The actual error message or failing test output — not "it throws an error," but the full stack trace including line numbers.
Relevant configuration — build config, schema, environment constraints that change what valid code looks like.
Prior art in the codebase — "here is how we currently handle pagination in the orders service" grounds the model in your actual conventions.

What not to include

Entire files when only a few functions matter — summarise or excerpt.
Noise that pushes important context toward the end of the window (models attend more strongly to the beginning and end).
Secrets, PII, or proprietary data — always sanitise before pasting into any cloud-hosted model.

A useful heuristic: if a new teammate were pairing with you on this exact task, what would you put on a shared screen? Paste that. Nothing less, nothing gratuitously more.

prompt — bad context

# ❌ Vague — model must guess what "the auth middleware" looks like
Fix the bug in our auth middleware where sometimes
tokens are accepted even when expired.

prompt — good context

# ✅ Model has everything it needs to make a precise fix
Here is our JWT middleware (Go):

func AuthMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        tokenStr := r.Header.Get("Authorization")
        claims := &Claims{}
        token, err := jwt.ParseWithClaims(tokenStr, claims, keyFunc)
        if err != nil || !token.Valid {
            http.Error(w, "unauthorized", http.StatusUnauthorized)
            return
        }
        next.ServeHTTP(w, r)
    })
}

Bug: tokens that are expired (Claims.ExpiresAt in the past) are
sometimes accepted. jwt.ParseWithClaims does check expiry, but we
strip the "Bearer " prefix inconsistently — see the raw header value
in the failing test output below:

    Authorization: Bearer eyJhbGci...   <-- has "Bearer " prefix
    jwt: token is malformed              <-- parse error swallowed, falls through

Fix the prefix stripping and make sure an expired token always 401s.

Specifying Requirements with Acceptance Criteria

There is a category difference between a goal and a specification. "Add rate limiting" is a goal. A specification tells the model what done looks like: which endpoints, what limit, what header carries the remaining count, what status code on exhaustion, what the reset window is, and whether limits are per-IP or per-user. The model cannot read your mind; if you don't specify it, you will get the model's default assumption, which may not match yours.

Good acceptance criteria share two properties: they are concrete (names, numbers, status codes, field names) and they are testable (you can write a test that passes if and only if the criterion is met). If you can't write a test for it, the criterion is probably too vague.

Vague goal	Testable specification
Add input validation to the user endpoint	POST /users must return 422 with `{"error":"email_invalid"}` if email is missing or not RFC 5322 format; return 422 `{"error":"name_too_long"}` if name > 100 chars
Make it faster	The `listProducts` query must return in <50 ms at p99 with 10k rows; add an index on `(category_id, created_at DESC)` if missing
Handle errors better	Wrap all `db.Query` calls to log `query=<sql> err=<msg> duration=<ms>` at ERROR level; propagate the original error up, never swallow it
Add caching	Cache GET /products/:id in Redis with TTL 300 s; use key `product:<id>`; on cache miss, fetch from DB and populate; on 404 from DB, do not cache

The right column is more words, but it produces code that is measurably correct. The left column produces code that is plausibly correct — which is a very different thing.

Ask for the Plan First

One of the most reliable techniques for non-trivial tasks is a two-step sequence: first ask the model to outline its approach, then ask it to implement. This catches design problems before they get embedded in code, and it forces the model to reason about the problem rather than pattern-match to the nearest boilerplate.

The planning prompt is usually short: "Before writing any code, outline the steps you'll take to implement X. List any assumptions you're making. Note any edge cases I should be aware of." Read the plan. If the plan is wrong — wrong approach, wrong library, misunderstood requirement — correct it before implementation. A two-minute plan review saves a twenty-minute debugging session.

two-step prompting

## Step 1 — Plan prompt
I need to add distributed rate limiting to our Go API gateway.
Requirements:
- 100 req/min per API key, sliding window
- Limits stored in Redis; gateway pods are stateless
- On limit exceeded: 429 with Retry-After header (seconds until window resets)
- Keys are passed in X-API-Key header

Before writing any code, outline:
1. The algorithm you'll use (token bucket? sliding log? fixed window?)
2. The Redis data structure and key schema
3. The middleware interface in Go
4. Any edge cases (key missing, Redis down, clock skew)

## Step 2 — Implementation prompt (after reviewing the plan)
The plan looks good. Implement it.
Use go-redis v9. The middleware should be chainable with our existing
http.Handler chain. Do not introduce a global singleton — accept a
*redis.Client as a parameter so it can be injected in tests.

Planning-first is especially valuable for agentic tasks where the model will execute multiple steps autonomously. An agent that starts implementing immediately can go deep down a wrong path before you notice. An agent that surfaces a plan lets you redirect at the cheapest possible moment.

Few-Shot Examples: Show, Don't Just Tell

Few-shot prompting — providing one or more input/output examples before asking for the real thing — is one of the oldest and most reliable techniques in prompt engineering. For code, it is particularly powerful because code is unambiguous: a single example pins down naming conventions, indentation, error-handling style, return type patterns, and logging format simultaneously, in a way that paragraphs of description never can.

When few-shot pays off most

Boilerplate with a specific shape — "write CRUD endpoints for the Product model, following the same pattern as these existing User endpoints." Paste the User endpoints as the example.
Code style you haven't documented — if your team uses a particular error-wrapping pattern or a non-standard logging format, one example is worth a thousand words.
Data transformations with tricky edge cases — show input data and expected output data; the model infers the mapping including edge cases from the examples.
Test authoring — show one or two existing tests; the model will match table-driven style, assertion library, setup/teardown patterns exactly.

few-shot — generating consistent handlers

// Example handler (existing code — paste as the few-shot example)
func (h *Handler) GetUser(w http.ResponseWriter, r *http.Request) {
    id := chi.URLParam(r, "id")
    user, err := h.store.GetUser(r.Context(), id)
    if errors.Is(err, store.ErrNotFound) {
        h.writeError(w, http.StatusNotFound, "user_not_found")
        return
    }
    if err != nil {
        h.writeError(w, http.StatusInternalServerError, "internal_error")
        return
    }
    h.writeJSON(w, http.StatusOK, user)
}

// Prompt after pasting example:
// Following exactly the same pattern above (chi router, h.store,
// h.writeError / h.writeJSON, errors.Is for not-found),
// write GetProduct and DeleteProduct handlers.

The key discipline: paste real examples from your codebase, not invented ones. Invented examples can accidentally introduce conventions you don't actually use.

Specifying Language, Style, and Boundaries

AI coding models are polyglot. Without explicit instruction they will pick the language, library, and style they consider most common for the task. That may not be your language, your library, or your style. Always state these explicitly when they matter.

Language and runtime

Specify the language version when it matters: Go 1.22, Python 3.12 with type annotations, TypeScript 5 strict mode, Java 21 with records and sealed interfaces. Models know what features are available per version and will avoid or use them accordingly.

Libraries and frameworks

Name the specific library: use pgx/v5 not database/sql, use Zod for validation not Joi, use React Query v5 not SWR. Without this the model will pick whatever it trained on most heavily, which may conflict with your existing dependency tree.

What is out of scope

Boundary constraints are as important as positive requirements. "Do not add any new dependencies," "do not change the public interface of this function," "do not add a database migration — this must be handled at the application layer," "do not touch the test file." These negative constraints prevent the model from "helpfully" restructuring things you didn't ask it to touch — a very common failure mode.

style and boundary constraints

# Explicit constraints prevent "helpful" drift
Add a retry wrapper around the S3 upload call in upload.go.

Constraints:
- Language: Go 1.22
- Use only stdlib (context, time, errors) — do NOT add a retry library
- Max 3 attempts, exponential backoff starting at 100 ms, cap at 2 s
- Do not change the function signature of UploadFile
- Do not modify upload_test.go
- Log each retry attempt at WARN level with attempt=N err=<msg> using
  our existing slog.Default() logger

Iterating and Following Up Precisely

Good prompting is a dialogue, not a monologue. The first response is rarely perfect; the question is how to correct it efficiently. The worst approach is to start over with a longer mega-prompt. The best approach is a short, surgical follow-up that names exactly what is wrong.

Anatomy of a good correction

Name the specific problem — "the retry logic doesn't reset the backoff timer between calls" is actionable; "this isn't quite right" is not.
Quote the offending code — "on line 23, attempt := 0 is inside the loop; it should be outside." The model can see its own output but a precise quote removes ambiguity.
State what you want instead — not just what's wrong but what correct looks like.
Ask only one correction at a time when possible — compound corrections ("fix X, and also Y, and also refactor Z") produce confused diffs where it's hard to verify each part.

One valuable meta-technique: after getting output, ask the model to critique its own work. "What edge cases does this implementation miss?" or "What assumptions did you make that might not hold?" Models are surprisingly good at finding their own holes when asked directly.

Test-Driven Prompting

The most rigorous prompting workflow borrows from TDD: specify the tests first, then ask the model to make them pass. This forces the specification to be precise (tests are unambiguous) and gives the model an automated feedback loop it can use to verify its own output.

test-driven prompting workflow

## Step 1: Write the tests yourself (or prompt for tests first)
func TestParseISO8601Duration(t *testing.T) {
    cases := []struct{ input string; want time.Duration; wantErr bool }{
        {"PT30S",   30 * time.Second,  false},
        {"PT1M30S", 90 * time.Second,  false},
        {"P1D",     24 * time.Hour,    false},
        {"P1Y",     0,               true},  // years not supported
        {"",        0,               true},
        {"garbage", 0,               true},
    }
    for _, c := range cases {
        got, err := ParseISO8601Duration(c.input)
        if (err != nil) != c.wantErr {
            t.Errorf("%q: wantErr=%v got err=%v", c.input, c.wantErr, err)
        }
        if !c.wantErr && got != c.want {
            t.Errorf("%q: want %v got %v", c.input, c.want, got)
        }
    }
}

## Step 2: Prompt to implement against the tests
// Implement ParseISO8601Duration(s string) (time.Duration, error)
// in duration.go so all cases above pass. Do not add dependencies.

When working with an agentic tool like Claude Code, you can take this further: "run the tests after implementing and iterate until they all pass." The agent closes the feedback loop automatically, and you only review the final diff when tests are green.

Decomposing Large Changes for the Model

Context windows are finite and attention degrades over long, complex tasks. A change that touches fifteen files, reorganises a data model, and updates three API surfaces is not one prompt — it is five or six. Breaking large changes into focused, independently reviewable steps produces better output and makes each step easier to verify.

A decomposition heuristic

One data model change per prompt — if you're changing a schema, do that first and verify it before touching anything that depends on it.
One interface at a time — change the interface definition, then update implementations, then update callers. Each is a separate prompt with the previous output pasted as context.
Tests before implementation — in each step, write or confirm the tests first, then implement.
Vertical slices for feature work — implement one endpoint end-to-end (DB → service → handler → test) before moving to the next, rather than doing all handlers then all services.

When the change is genuinely large, write out the decomposition explicitly and paste it into the first prompt: "I'm going to make this change in four steps. Here is step 1. Implement only step 1." This prevents the model from speculatively implementing steps 2–4 and creating a diff you can't review.

Common Anti-Patterns

Understanding failure modes is as useful as understanding best practices. These are the prompting anti-patterns that consistently produce bad output.

The aspirational vague prompt

"Refactor this to be cleaner and more maintainable." This gives the model unlimited latitude to restructure anything it considers suboptimal. You will get extensive changes that may or may not match your conventions, touching code you didn't mean to touch. Be specific about what "cleaner" means: "extract the three nested if-blocks in processPayment into named helper functions; do not change any other logic."

The copy-paste cargo cult

Pasting a large block of code with "fix the bug." Without a description of the symptom, the reproduction case, or which line the error appears on, the model will guess. It may guess correctly or it may "fix" a different part of the code and introduce a regression. Always include the observable failure: the stack trace, the failing test output, the wrong return value.

The missing negative constraint

Asking for new functionality without specifying what must not change. The model will often "improve" adjacent code, rename variables it finds confusing, or add dependencies it considers standard. These unasked-for changes muddy your diff and can introduce subtle breaks. Always include "do not change X" for anything you need to stay stable.

The one-shot mega-prompt

Trying to fully specify a complex feature in a single enormous prompt. This overloads the model's instruction-following capacity; the later constraints are underweighted relative to the earlier ones. For complex work, iterate: prompt for the plan, approve, prompt for step 1, review, continue. The total quality is higher even if the total number of turns is larger.

anti-pattern vs. correct prompt

## ❌ Anti-pattern: vague + no context + no constraints
Add pagination to the API.

## ✅ Correct: context + spec + constraints + acceptance criteria
File: handlers/products.go (pasted below)
Current GET /products returns all rows, which causes OOM at scale.

Add cursor-based pagination:
- Query params: limit (int, default 20, max 100) and cursor (opaque string)
- Response: add "next_cursor" field to the existing JSON envelope;
  null if no more pages
- Cursor encodes the last row's (created_at, id) as base64 JSON —
  do not use offset
- Return 400 {"error":"invalid_limit"} if limit < 1 or > 100
- Do not change the shape of the Product objects in the response
- Do not add any new SQL queries beyond what listProducts already uses;
  add the WHERE clause to the existing query

[paste handlers/products.go here]
[paste store/products.go here]

Accepting the first output without review

This is not a prompting anti-pattern per se — it is a workflow anti-pattern — but it is worth naming here because it is the failure mode that turns all the above techniques moot. Every AI-generated code block should be read, understood, and consciously accepted. If you cannot explain what a function does, you are not ready to merge it. The model is your pair programmer, not your code reviewer.

Building a Prompting Intuition Over Time

Prompting is a skill that compounds. Engineers who have been using AI coding tools for a year write dramatically better prompts than they did at the start, because they have internalised which techniques close which gaps and which failure modes to pre-empt.

A few habits that accelerate this learning curve:

Keep a prompt log — note prompts that worked unusually well or badly, and what made the difference. You will start to see patterns in what causes failure.
Read your diffs carefully — every time you catch a subtle error in AI output, trace it back to what the prompt was missing. That trace is a prompting lesson.
Experiment on safe ground — use test files or throwaway branches to try new prompting strategies before using them on production-critical changes.
Share good prompts with your team — a shared library of high-quality prompts for common tasks (writing migration scripts, generating gRPC handlers, updating OpenAPI specs) raises the team floor, not just the individual ceiling.

The best mental model for prompt engineering: you are writing a specification that will be executed by a highly capable but completely naive contractor. They will do exactly what you say, interpret ambiguity in whatever way seems most common, and never ask for clarification. Write specifications accordingly — unambiguous, complete, and with explicit constraints on what not to do.

takeaway

Prompt engineering for code is really specification engineering. The techniques — context, acceptance criteria, planning-first, few-shot examples, decomposition, precise follow-up — all serve one goal: reducing the model's uncertainty so its considerable capability is aimed precisely at your actual problem. Master the specification, and the code quality follows.

🎯 interview hot-takes

Why does context matter more than model size in AI coding tools? The context window is finite; the model can only reason about what's in it. Giving it the right files and error messages produces better output than a larger model working with vague prose.
What is planning-first prompting and why does it help? Asking the model to outline its approach before coding surfaces design errors at the cheapest possible moment — before they get embedded in implementation — and forces the model to reason rather than pattern-match.
What is the most dangerous AI coding anti-pattern? Accepting the first output without review. The model produces plausible-looking code even when it has subtly violated your constraints, and "plausible but wrong" is harder to catch than an obvious error.