Every AI coding tool runs the same core loop: you supply text, the model generates code. That sounds simple until you realise that the text you supply — the prompt — is the only lever you have over what comes out. The model is fixed; the context window is finite; the only variable you control is what you put into it. And yet most engineers treat prompting as an afterthought, typing a one-liner and wondering why the output misses the mark.
This article is a deep dive into prompt engineering specifically for code generation and agentic coding tasks. It covers the mechanics of why prompts determine output quality, the concrete techniques that consistently raise the bar — context, constraints, planning, few-shot examples, iterative refinement, test-driven prompting — and the anti-patterns that silently produce mediocre or broken code. By the end you should have a repeatable mental model for prompting any AI coding tool, from Copilot completions to Claude Code agentic tasks.
- Context is the multiplier. Giving the model the right files, the actual error message, and the real constraints is worth more than any other single technique.
- Plan before code. Asking the model to outline its approach first catches design errors before they get embedded in 200 lines of implementation.
- Few-shot examples collapse ambiguity. One concrete "given X, produce Y" example is clearer than a paragraph of prose describing the same thing.
- State acceptance criteria, not just goals. "Add auth" is a goal; "add JWT middleware that rejects requests without a valid token and returns 401 with a JSON error body" is a testable specification.
- Iterate with targeted follow-ups. One long mega-prompt is rarely better than a crisp initial prompt plus focused correction rounds.
- Vague prompts produce plausible-looking wrong code. The model will never tell you the prompt was ambiguous — it will just hallucinate a reasonable-seeming answer.
Prompting for code is a skill, not a knack. The core formula is: right context + clear spec + acceptance criteria + plan-first + few-shot examples. Nail those five ingredients and the model output improves dramatically. Skip them and you get plausible-looking code that silently violates your constraints.
Why the Prompt Determines Code Quality
A large language model is, at its core, a next-token predictor conditioned on everything in the context window. It has no background knowledge about your repo, your team's conventions, the production constraint you mentioned in Slack, or the edge case that burned you last sprint. All it knows is what you give it right now.
This has a profound implication: the model's output is bounded by the quality of its input. A frontier model with a bad prompt will produce worse code than a smaller model with a great prompt, because the smaller model is at least working with accurate, complete information. Context assembly — choosing what to include in the prompt — is the dominant factor in output quality, and it is entirely under your control.
There is also an asymmetry of failure you need to understand: the model will always produce something. It will not tell you the prompt was too vague; it will fill in the gaps with statistically plausible completions. In a code context that means plausible-looking code that may compile, pass a surface read, and still be subtly wrong in ways that only surface in production. This is why a junior engineer who vibe-codes aggressively can look productive for weeks before the technical debt crystallises.
Prompt engineering for code is therefore not about magic incantations. It is about systematically removing the model's uncertainty: giving it the files it needs, the error it must fix, the constraints it must respect, and the examples that show the style it should match. Every technique in this article reduces a different kind of uncertainty.
Giving the Model Enough Context
The single highest-leverage thing you can do is also the most mechanical: paste the right code. Most engineers under-paste. They describe a function in prose when they should paste the function. They mention an error message when they should paste the full stack trace. They say "the auth module" when they should paste the relevant fifty lines from it.
What to include
- The function or file being changed — don't describe it, show it. The model needs the actual signatures, variable names, and existing logic.
- Closely related code — callers, called functions, data types. If your function returns a
UserRecord, paste theUserRecordstruct definition. - The actual error message or failing test output — not "it throws an error," but the full stack trace including line numbers.
- Relevant configuration — build config, schema, environment constraints that change what valid code looks like.
- Prior art in the codebase — "here is how we currently handle pagination in the orders service" grounds the model in your actual conventions.
What not to include
- Entire files when only a few functions matter — summarise or excerpt.
- Noise that pushes important context toward the end of the window (models attend more strongly to the beginning and end).
- Secrets, PII, or proprietary data — always sanitise before pasting into any cloud-hosted model.
A useful heuristic: if a new teammate were pairing with you on this exact task, what would you put on a shared screen? Paste that. Nothing less, nothing gratuitously more.
# ❌ Vague — model must guess what "the auth middleware" looks like
Fix the bug in our auth middleware where sometimes
tokens are accepted even when expired.
# ✅ Model has everything it needs to make a precise fix
Here is our JWT middleware (Go):
func AuthMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
tokenStr := r.Header.Get("Authorization")
claims := &Claims{}
token, err := jwt.ParseWithClaims(tokenStr, claims, keyFunc)
if err != nil || !token.Valid {
http.Error(w, "unauthorized", http.StatusUnauthorized)
return
}
next.ServeHTTP(w, r)
})
}
Bug: tokens that are expired (Claims.ExpiresAt in the past) are
sometimes accepted. jwt.ParseWithClaims does check expiry, but we
strip the "Bearer " prefix inconsistently — see the raw header value
in the failing test output below:
Authorization: Bearer eyJhbGci... <-- has "Bearer " prefix
jwt: token is malformed <-- parse error swallowed, falls through
Fix the prefix stripping and make sure an expired token always 401s.
Specifying Requirements with Acceptance Criteria
There is a category difference between a goal and a specification. "Add rate limiting" is a goal. A specification tells the model what done looks like: which endpoints, what limit, what header carries the remaining count, what status code on exhaustion, what the reset window is, and whether limits are per-IP or per-user. The model cannot read your mind; if you don't specify it, you will get the model's default assumption, which may not match yours.
Good acceptance criteria share two properties: they are concrete (names, numbers, status codes, field names) and they are testable (you can write a test that passes if and only if the criterion is met). If you can't write a test for it, the criterion is probably too vague.
| Vague goal | Testable specification |
|---|---|
| Add input validation to the user endpoint | POST /users must return 422 with {"error":"email_invalid"} if email is missing or not RFC 5322 format; return 422 {"error":"name_too_long"} if name > 100 chars |
| Make it faster | The listProducts query must return in <50 ms at p99 with 10k rows; add an index on (category_id, created_at DESC) if missing |
| Handle errors better | Wrap all db.Query calls to log query=<sql> err=<msg> duration=<ms> at ERROR level; propagate the original error up, never swallow it |
| Add caching | Cache GET /products/:id in Redis with TTL 300 s; use key product:<id>; on cache miss, fetch from DB and populate; on 404 from DB, do not cache |
The right column is more words, but it produces code that is measurably correct. The left column produces code that is plausibly correct — which is a very different thing.
Ask for the Plan First
One of the most reliable techniques for non-trivial tasks is a two-step sequence: first ask the model to outline its approach, then ask it to implement. This catches design problems before they get embedded in code, and it forces the model to reason about the problem rather than pattern-match to the nearest boilerplate.
The planning prompt is usually short: "Before writing any code, outline the steps you'll take to implement X. List any assumptions you're making. Note any edge cases I should be aware of." Read the plan. If the plan is wrong — wrong approach, wrong library, misunderstood requirement — correct it before implementation. A two-minute plan review saves a twenty-minute debugging session.
## Step 1 — Plan prompt
I need to add distributed rate limiting to our Go API gateway.
Requirements:
- 100 req/min per API key, sliding window
- Limits stored in Redis; gateway pods are stateless
- On limit exceeded: 429 with Retry-After header (seconds until window resets)
- Keys are passed in X-API-Key header
Before writing any code, outline:
1. The algorithm you'll use (token bucket? sliding log? fixed window?)
2. The Redis data structure and key schema
3. The middleware interface in Go
4. Any edge cases (key missing, Redis down, clock skew)
## Step 2 — Implementation prompt (after reviewing the plan)
The plan looks good. Implement it.
Use go-redis v9. The middleware should be chainable with our existing
http.Handler chain. Do not introduce a global singleton — accept a
*redis.Client as a parameter so it can be injected in tests.
Planning-first is especially valuable for agentic tasks where the model will execute multiple steps autonomously. An agent that starts implementing immediately can go deep down a wrong path before you notice. An agent that surfaces a plan lets you redirect at the cheapest possible moment.
Few-Shot Examples: Show, Don't Just Tell
Few-shot prompting — providing one or more input/output examples before asking for the real thing — is one of the oldest and most reliable techniques in prompt engineering. For code, it is particularly powerful because code is unambiguous: a single example pins down naming conventions, indentation, error-handling style, return type patterns, and logging format simultaneously, in a way that paragraphs of description never can.
When few-shot pays off most
- Boilerplate with a specific shape — "write CRUD endpoints for the Product model, following the same pattern as these existing User endpoints." Paste the User endpoints as the example.
- Code style you haven't documented — if your team uses a particular error-wrapping pattern or a non-standard logging format, one example is worth a thousand words.
- Data transformations with tricky edge cases — show input data and expected output data; the model infers the mapping including edge cases from the examples.
- Test authoring — show one or two existing tests; the model will match table-driven style, assertion library, setup/teardown patterns exactly.
// Example handler (existing code — paste as the few-shot example)
func (h *Handler) GetUser(w http.ResponseWriter, r *http.Request) {
id := chi.URLParam(r, "id")
user, err := h.store.GetUser(r.Context(), id)
if errors.Is(err, store.ErrNotFound) {
h.writeError(w, http.StatusNotFound, "user_not_found")
return
}
if err != nil {
h.writeError(w, http.StatusInternalServerError, "internal_error")
return
}
h.writeJSON(w, http.StatusOK, user)
}
// Prompt after pasting example:
// Following exactly the same pattern above (chi router, h.store,
// h.writeError / h.writeJSON, errors.Is for not-found),
// write GetProduct and DeleteProduct handlers.
The key discipline: paste real examples from your codebase, not invented ones. Invented examples can accidentally introduce conventions you don't actually use.
Specifying Language, Style, and Boundaries
AI coding models are polyglot. Without explicit instruction they will pick the language, library, and style they consider most common for the task. That may not be your language, your library, or your style. Always state these explicitly when they matter.
Language and runtime
Specify the language version when it matters: Go 1.22, Python 3.12 with type annotations, TypeScript 5 strict mode, Java 21 with records and sealed interfaces. Models know what features are available per version and will avoid or use them accordingly.
Libraries and frameworks
Name the specific library: use pgx/v5 not database/sql, use Zod for validation not Joi, use React Query v5 not SWR. Without this the model will pick whatever it trained on most heavily, which may conflict with your existing dependency tree.
What is out of scope
Boundary constraints are as important as positive requirements. "Do not add any new dependencies," "do not change the public interface of this function," "do not add a database migration — this must be handled at the application layer," "do not touch the test file." These negative constraints prevent the model from "helpfully" restructuring things you didn't ask it to touch — a very common failure mode.
# Explicit constraints prevent "helpful" drift
Add a retry wrapper around the S3 upload call in upload.go.
Constraints:
- Language: Go 1.22
- Use only stdlib (context, time, errors) — do NOT add a retry library
- Max 3 attempts, exponential backoff starting at 100 ms, cap at 2 s
- Do not change the function signature of UploadFile
- Do not modify upload_test.go
- Log each retry attempt at WARN level with attempt=N err=<msg> using
our existing slog.Default() logger
Iterating and Following Up Precisely
Good prompting is a dialogue, not a monologue. The first response is rarely perfect; the question is how to correct it efficiently. The worst approach is to start over with a longer mega-prompt. The best approach is a short, surgical follow-up that names exactly what is wrong.
Anatomy of a good correction
- Name the specific problem — "the retry logic doesn't reset the backoff timer between calls" is actionable; "this isn't quite right" is not.
- Quote the offending code — "on line 23,
attempt := 0is inside the loop; it should be outside." The model can see its own output but a precise quote removes ambiguity. - State what you want instead — not just what's wrong but what correct looks like.
- Ask only one correction at a time when possible — compound corrections ("fix X, and also Y, and also refactor Z") produce confused diffs where it's hard to verify each part.
One valuable meta-technique: after getting output, ask the model to critique its own work. "What edge cases does this implementation miss?" or "What assumptions did you make that might not hold?" Models are surprisingly good at finding their own holes when asked directly.
Test-Driven Prompting
The most rigorous prompting workflow borrows from TDD: specify the tests first, then ask the model to make them pass. This forces the specification to be precise (tests are unambiguous) and gives the model an automated feedback loop it can use to verify its own output.
## Step 1: Write the tests yourself (or prompt for tests first)
func TestParseISO8601Duration(t *testing.T) {
cases := []struct{ input string; want time.Duration; wantErr bool }{
{"PT30S", 30 * time.Second, false},
{"PT1M30S", 90 * time.Second, false},
{"P1D", 24 * time.Hour, false},
{"P1Y", 0, true}, // years not supported
{"", 0, true},
{"garbage", 0, true},
}
for _, c := range cases {
got, err := ParseISO8601Duration(c.input)
if (err != nil) != c.wantErr {
t.Errorf("%q: wantErr=%v got err=%v", c.input, c.wantErr, err)
}
if !c.wantErr && got != c.want {
t.Errorf("%q: want %v got %v", c.input, c.want, got)
}
}
}
## Step 2: Prompt to implement against the tests
// Implement ParseISO8601Duration(s string) (time.Duration, error)
// in duration.go so all cases above pass. Do not add dependencies.
When working with an agentic tool like Claude Code, you can take this further: "run the tests after implementing and iterate until they all pass." The agent closes the feedback loop automatically, and you only review the final diff when tests are green.
Decomposing Large Changes for the Model
Context windows are finite and attention degrades over long, complex tasks. A change that touches fifteen files, reorganises a data model, and updates three API surfaces is not one prompt — it is five or six. Breaking large changes into focused, independently reviewable steps produces better output and makes each step easier to verify.
A decomposition heuristic
- One data model change per prompt — if you're changing a schema, do that first and verify it before touching anything that depends on it.
- One interface at a time — change the interface definition, then update implementations, then update callers. Each is a separate prompt with the previous output pasted as context.
- Tests before implementation — in each step, write or confirm the tests first, then implement.
- Vertical slices for feature work — implement one endpoint end-to-end (DB → service → handler → test) before moving to the next, rather than doing all handlers then all services.
When the change is genuinely large, write out the decomposition explicitly and paste it into the first prompt: "I'm going to make this change in four steps. Here is step 1. Implement only step 1." This prevents the model from speculatively implementing steps 2–4 and creating a diff you can't review.
Common Anti-Patterns
Understanding failure modes is as useful as understanding best practices. These are the prompting anti-patterns that consistently produce bad output.
The aspirational vague prompt
"Refactor this to be cleaner and more maintainable." This gives the model unlimited latitude to restructure anything it considers suboptimal. You will get extensive changes that may or may not match your conventions, touching code you didn't mean to touch. Be specific about what "cleaner" means: "extract the three nested if-blocks in processPayment into named helper functions; do not change any other logic."
The copy-paste cargo cult
Pasting a large block of code with "fix the bug." Without a description of the symptom, the reproduction case, or which line the error appears on, the model will guess. It may guess correctly or it may "fix" a different part of the code and introduce a regression. Always include the observable failure: the stack trace, the failing test output, the wrong return value.
The missing negative constraint
Asking for new functionality without specifying what must not change. The model will often "improve" adjacent code, rename variables it finds confusing, or add dependencies it considers standard. These unasked-for changes muddy your diff and can introduce subtle breaks. Always include "do not change X" for anything you need to stay stable.
The one-shot mega-prompt
Trying to fully specify a complex feature in a single enormous prompt. This overloads the model's instruction-following capacity; the later constraints are underweighted relative to the earlier ones. For complex work, iterate: prompt for the plan, approve, prompt for step 1, review, continue. The total quality is higher even if the total number of turns is larger.
## ❌ Anti-pattern: vague + no context + no constraints
Add pagination to the API.
## ✅ Correct: context + spec + constraints + acceptance criteria
File: handlers/products.go (pasted below)
Current GET /products returns all rows, which causes OOM at scale.
Add cursor-based pagination:
- Query params: limit (int, default 20, max 100) and cursor (opaque string)
- Response: add "next_cursor" field to the existing JSON envelope;
null if no more pages
- Cursor encodes the last row's (created_at, id) as base64 JSON —
do not use offset
- Return 400 {"error":"invalid_limit"} if limit < 1 or > 100
- Do not change the shape of the Product objects in the response
- Do not add any new SQL queries beyond what listProducts already uses;
add the WHERE clause to the existing query
[paste handlers/products.go here]
[paste store/products.go here]
Accepting the first output without review
This is not a prompting anti-pattern per se — it is a workflow anti-pattern — but it is worth naming here because it is the failure mode that turns all the above techniques moot. Every AI-generated code block should be read, understood, and consciously accepted. If you cannot explain what a function does, you are not ready to merge it. The model is your pair programmer, not your code reviewer.
Building a Prompting Intuition Over Time
Prompting is a skill that compounds. Engineers who have been using AI coding tools for a year write dramatically better prompts than they did at the start, because they have internalised which techniques close which gaps and which failure modes to pre-empt.
A few habits that accelerate this learning curve:
- Keep a prompt log — note prompts that worked unusually well or badly, and what made the difference. You will start to see patterns in what causes failure.
- Read your diffs carefully — every time you catch a subtle error in AI output, trace it back to what the prompt was missing. That trace is a prompting lesson.
- Experiment on safe ground — use test files or throwaway branches to try new prompting strategies before using them on production-critical changes.
- Share good prompts with your team — a shared library of high-quality prompts for common tasks (writing migration scripts, generating gRPC handlers, updating OpenAPI specs) raises the team floor, not just the individual ceiling.
The best mental model for prompt engineering: you are writing a specification that will be executed by a highly capable but completely naive contractor. They will do exactly what you say, interpret ambiguity in whatever way seems most common, and never ask for clarification. Write specifications accordingly — unambiguous, complete, and with explicit constraints on what not to do.
Prompt engineering for code is really specification engineering. The techniques — context, acceptance criteria, planning-first, few-shot examples, decomposition, precise follow-up — all serve one goal: reducing the model's uncertainty so its considerable capability is aimed precisely at your actual problem. Master the specification, and the code quality follows.
Why does context matter more than model size in AI coding tools? The context window is finite; the model can only reason about what's in it. Giving it the right files and error messages produces better output than a larger model working with vague prose.
What is planning-first prompting and why does it help? Asking the model to outline its approach before coding surfaces design errors at the cheapest possible moment — before they get embedded in implementation — and forces the model to reason rather than pattern-match.
What is the most dangerous AI coding anti-pattern? Accepting the first output without review. The model produces plausible-looking code even when it has subtly violated your constraints, and "plausible but wrong" is harder to catch than an obvious error.