Prompt Engineering in 2026: What Still Works and What's Changed

The 2023 prompt engineering playbook aged badly. A lot of it was always a little superstitious, tricks that worked on GPT-3.5 and got cargo-culted forward. But reasoning models, the Claude 4.7s, GPT-o5s, and Gemini 3 Thinking tiers of the world, actively punish some of the old techniques. Adding too much instruction makes them worse. Telling them to "think step by step" is now a waste of tokens. Pretending they are an expert does nothing useful.

This is a short honest inventory of what still matters, what does not, and two real templates we still use. If you read one prompt engineering article this year, we would like it to be this one, because most of the others are still recycling 2023 advice.

What reasoning models changed

Reasoning models run an internal deliberation before they answer. Claude 4.7 has a thinking mode. GPT-o5 reasons by default. Gemini 3 has a thinking tier. This means a lot of the old prompt engineering was duplicating work the model already does on its own. When you tell a reasoning model to "think carefully step by step," you are politely telling an expert to do the thing they were about to do anyway. In practice, this actually makes them slightly worse, because you are steering the thinking instead of letting them do it.

Reasoning models also handle ambiguity better. In 2023 the standard advice was to be militantly specific, to spell out every constraint, because the model would latch onto whatever was easiest. Modern models are better at asking themselves "what does the user actually want" and producing sensible defaults. Over-specification now backfires by locking them into a worse approach than the one they would have picked.

What still works, in priority order

1. Clear task framing at the top

Lead with the task. One sentence. "Rewrite this email to be shorter and less apologetic." "Identify any security issues in this code and list them with severity." "Extract all dates, amounts, and parties from this contract as JSON." Models do their best work when they know the job in the first ten tokens.

2. Good examples for output format

Few-shot examples are dead for reasoning. They are alive and well for output format. If you want a specific JSON shape, a specific bullet structure, a specific tone, show one example. Not three. Not ten. One good example is almost always better than three mediocre ones.

3. Explicit tool definitions

Agentic tool use is where prompt engineering matters most in 2026. The quality of your tool descriptions is doing the work. Name the tool like a function. Describe what it does in one sentence. Describe when to use it in one sentence. Describe when NOT to use it, because this is where most agents fail.

4. A tight output contract

If you need structured output, ask for it. JSON schemas work. "Respond with only the SQL query, no explanation" works. Reasoning models are shockingly good at following format constraints, so use them.

5. Context, not instructions

Give the model the information it needs to do the task. Attach the file. Paste the email thread. Include the schema. Do not write paragraph after paragraph telling the model how to approach the problem. Let it read the context and decide.

What is dead (please stop doing this)

"Let's think step by step." Reasoning models already think. Non-reasoning models think when asked, but the phrase is now so over-fitted that it adds nothing.
"You are an expert [X]." Roleplay framing no longer improves quality in modern models. It sometimes makes them worse, because they try to perform expertise instead of just being helpful.
Threats, bribes, and urgency."This is very important." "I'll tip you $200." "My grandma used to read me bedtime stories about..." These worked on early GPT models because the training data rewarded them. They are now either no-ops or actively flagged as manipulation attempts.
Excessive constraint lists. Twenty bullet points of rules makes the output worse, not better. If you find yourself writing rule 14, rewrite the whole prompt.
Jailbreak-adjacent tricks. DAN, grandma-ing, character frames designed to bypass safety. They do not work on modern safety-trained models and they get your API key flagged.

Template one: the production structured-extraction prompt

This is the shape of prompt we use in production to extract structured data from unstructured text. It is boring. It works.

You extract structured information from customer emails for a
support triage system.

Output a JSON object with this exact shape:
{
  "category": "billing" | "technical" | "account" | "other",
  "urgency": "low" | "medium" | "high",
  "summary": string,      // one sentence
  "action_required": boolean
}

Here is one example.

Email:
"Hi, my invoice for March is showing twice. Can you check?
I was charged $49 on March 3rd and again on March 5th."

Output:
{
  "category": "billing",
  "urgency": "medium",
  "summary": "Customer reports being double-charged for their March invoice.",
  "action_required": true
}

Now extract from this email:
{{email_body}}

Note what this does and does not do. It does not say "you are an expert." It does not say "think step by step." It does not include five examples. One example, a clear schema, a clear task. This shape produces 99%+ valid JSON on Claude 4.7 and GPT-5.1 across tens of thousands of real emails that hit our Airfind support queue every week.

Template two: the coding agent system prompt

This is a distilled version of what we use for internal coding agents. It is short on purpose.

You are a coding agent working on a Next.js 16 + TypeScript
codebase in a monorepo.

When asked to make a change:
1. Read the relevant files before editing.
2. Make the smallest change that accomplishes the task.
3. Run the tests for the affected package before reporting done.
4. If tests fail, fix the failure before stopping.

Tools:
- read_file(path): returns file contents
- write_file(path, content): overwrites the file
- run_cmd(cmd): runs a shell command, returns stdout+stderr

Do not:
- Edit files outside the task scope.
- Install new dependencies without confirming.
- Commit without being asked.

If the task is ambiguous, ask one clarifying question and stop.

Again: no roleplay framing, no step-by-step prompting, no manipulation language. Clear behavior, clear tools, clear boundaries. The "do not" block is doing a lot of work. In our experience, telling an agent what not to do is more useful than telling it what to do, because the failure modes are fewer and more predictable than the success modes.

The best prompts in 2026 look like job descriptions, not magic spells.

The one trick that is genuinely new: controlled thinking

Reasoning models with thinking modes let you control how much deliberation the model does. Claude 4.7 has a thinking_budgetparameter. OpenAI's reasoning models have reasoning_effort (low, medium, high). On hard problems, cranking this up produces real gains. On easy problems, it wastes tokens and money.

A pattern we use: route by problem complexity. Simple classification and extraction tasks go to non-reasoning models on low effort. Hard synthesis, multi-step reasoning, or anything where wrong answers are expensive go to reasoning models on medium. We reserve high effort for genuinely hard problems, maybe 5% of volume. This kind of routing is the single biggest cost-and-quality lever we have found in 2026.

A note on XML tags and structured prompts

Anthropic's guidance for years has been to use XML-style tags to structure prompts. This still works well on Claude 4.7. Wrap the document in <document>, the instructions in<task>, examples in <example>. It makes the prompt easier for the model to parse and easier for you to reason about. GPT-5.1 handles XML tags fine too, though Markdown headings work about as well there.

The pattern is not about the tag syntax specifically. It is about giving the prompt structure. A prompt that looks like one long paragraph is harder for the model to follow than the same content split into labeled sections. Use whatever syntax feels natural, just do not dump everything into a single blob.

Caching, which is now a real prompt engineering tool

Prompt caching on Claude and OpenAI's Prompt Caching API changed how we structure agentic workflows. Putting the static parts of a prompt (system instructions, tool definitions, few-shot examples) at the top and the variable parts (user input, tool outputs) at the bottom means repeated invocations hit the cache and cost up to 90% less. On a production agent making thousands of calls a day, this is not an optimization, it is a design constraint.

The knock-on effect on prompt engineering: you now write prompts with cache boundaries in mind. Long stable preambles good. Inserting variable content in the middle of a long prompt, bad. If you are not paying attention to this, you are spending 5x to 10x more than you need to on a production system.

Evals, not vibes

The single biggest improvement we made to our prompt engineering practice in 2026 was building a proper eval harness for every production prompt. Thirty to fifty test cases with expected outputs. Run the prompt against all of them every time we change it. Score based on whatever criteria matter (exact match, JSON schema validity, task success).

This killed an entire category of bad engineering: the "I changed a word in the prompt and it seems better" kind of iteration. Without evals, every prompt change is superstition. With evals, you find out in 90 seconds that your clever new instruction dropped JSON validity from 99% to 94%, and you revert. Anyone doing prompt engineering in production without an eval harness in 2026 is flying blind.

Temperature and the rest of the knobs

A quick note on sampling parameters because they still confuse people. For extraction, classification, code, and anything where you want the "right" answer, set temperature to 0 or very low. For creative writing and brainstorming, 0.7 to 1.0 gives you variety. Middle values (0.3-0.5) are almost never what you want, they produce answers that are neither deterministic nor interestingly varied.

Top-p and top-k still exist, still matter on edge cases, but default values are fine for 95% of uses. If you are twiddling top-p in production, you are either doing something genuinely advanced or you are cargo-culting. Be honest about which.

What we'd actually do

Throw out your 2023 prompt library. Start over. Write prompts like you are writing a work order for a good contractor: what the job is, what done looks like, what not to touch, what to ask about. Trust the model to handle the reasoning. Use examples for format, not for thinking. And route hard problems to reasoning models deliberately, not by default. The teams getting the most out of LLMs in 2026 are the ones writing the least instruction and the best context.