TokenMix Research Lab · 2026-04-07

Chain of Thought Prompting Guide 2026: Zero-Shot, Few-Shot, Tree-of-Thought, and Cost Implications

Chain of Thought Prompting: Complete Guide to CoT, Zero-Shot, Few-Shot, and Tree-of-Thought Techniques (2026)

Chain of thought prompting makes AI models show their reasoning step by step before giving a final answer. This simple technique can improve accuracy on math, logic, and multi-step reasoning tasks by 20-70% depending on the model and task. The cost trade-off is real: CoT generates more output tokens, increasing API costs 2-5x per request. This guide covers every CoT variant -- zero-shot, few-shot, self-consistency, and tree-of-thought -- with actual prompts you can copy and test. We include when CoT helps, when it hurts, and how to manage the cost implications. All performance data and pricing tracked by TokenMix.ai as of April 2026.

Table of Contents


Quick Comparison: CoT Prompting Techniques

Technique Complexity Accuracy Boost Token Cost Best For
Standard (no CoT) Lowest Baseline 1x Simple factual questions
Zero-Shot CoT Low +15-40% 2-3x Quick reasoning improvement
Few-Shot CoT Medium +30-60% 3-5x Domain-specific reasoning
Self-Consistency High +40-70% 5-15x (multiple calls) High-stakes decisions
Tree-of-Thought Highest +50-80% 10-30x (many calls) Complex multi-step problems

What Is Chain of Thought Prompting?

Chain of thought (CoT) prompting is a technique where you instruct an AI model to reason through a problem step by step before producing a final answer. Instead of asking for a direct answer (which the model might get wrong on complex tasks), you ask the model to show its work.

The concept was formalized in Google's 2022 research paper by Wei et al., which demonstrated that adding "Let's think step by step" to a prompt significantly improved performance on arithmetic, commonsense, and symbolic reasoning tasks. Since then, CoT has become one of the most impactful prompting techniques in production AI systems.

The core principle: When a model generates intermediate reasoning steps, each step conditions the next, making it less likely to skip logic or make arithmetic errors. The model effectively builds a scaffold of reasoning that supports its final answer.

There are four main variants of CoT prompting, each with different complexity, accuracy, and cost profiles. We cover all four in this guide.


Why Step by Step Prompting Works

Step by step prompting works because of how transformer-based language models generate text. Models predict one token at a time, with each new token conditioned on all previous tokens. When you ask for a direct answer, the model must make a single "jump" from question to answer. When you ask for step-by-step reasoning, the model gets to generate intermediate tokens that serve as working memory.

Three reasons CoT improves accuracy:

1. Decomposition. Complex problems become a sequence of simpler sub-problems. A model that struggles with multi-digit multiplication can reliably compute partial products and sum them as separate steps.

2. Error correction. When reasoning is visible, the model can detect inconsistencies in its own chain. If step 3 contradicts step 1, the model is more likely to self-correct than if it jumped directly to an answer.

3. Knowledge activation. Each reasoning step activates relevant knowledge in the model's parameters. A step like "First, I need to identify what type of problem this is" primes the model to retrieve relevant patterns for that problem type.

When does CoT improve performance the most?

Research across multiple studies (Wei et al. 2022, Kojima et al. 2022, Wang et al. 2023) shows consistent patterns:

Task Type CoT Improvement Explanation
Arithmetic/Math +30-70% Decomposition prevents calculation errors
Multi-step logic +25-60% Each step builds on previous reasoning
Commonsense reasoning +15-35% Activates relevant world knowledge
Code debugging +20-40% Forces systematic error analysis
Simple factual recall 0-5% No reasoning needed, CoT adds noise
Creative writing -5 to +5% Reasoning can inhibit fluency

The pattern is clear: CoT helps most when the task requires multi-step reasoning. It helps least (or hurts) when the task is simple recall or creative generation.


Zero-Shot CoT: The Simplest Approach

Zero-shot CoT is the easiest CoT technique. You add a single instruction to your prompt -- no examples needed.

The trigger phrase: "Let's think step by step" or "Think through this step by step."

Example prompt:

A farmer has 3 fields. Field A produces 120 kg of wheat per hectare.
Field B produces 95 kg per hectare. Field C produces 140 kg per hectare.
Field A is 5 hectares, Field B is 8 hectares, Field C is 3 hectares.
What is the total wheat production?

Think through this step by step.

Model response (with zero-shot CoT):

Step 1: Calculate production for each field.
- Field A: 120 kg/ha x 5 ha = 600 kg
- Field B: 95 kg/ha x 8 ha = 760 kg
- Field C: 140 kg/ha x 3 ha = 420 kg

Step 2: Sum total production.
600 + 760 + 420 = 1,780 kg

The total wheat production is 1,780 kg.

Variations that work:

Trigger Phrase When to Use
"Let's think step by step" General reasoning
"Break this down step by step" Complex multi-part problems
"Show your reasoning" When you need auditable logic
"Think carefully before answering" When accuracy matters more than speed
"First, identify the key information. Then solve." When the problem has irrelevant details

Accuracy impact: Zero-shot CoT typically improves accuracy by 15-40% on reasoning tasks compared to standard prompting. Kojima et al. (2022) found that simply adding "Let's think step by step" improved accuracy from 17.7% to 78.7% on the MultiArith benchmark using text-davinci-002.

Token cost impact: Expect 2-3x more output tokens per response. The reasoning steps add 100-500 tokens depending on task complexity. At GPT-4.1 mini pricing ( .60/M output tokens), this adds roughly $0.0002-0.0008 per request.


Few-Shot CoT: Teaching by Example

Few-shot CoT takes zero-shot CoT further by providing 2-5 worked examples before the actual question. The model learns the reasoning pattern from your examples and applies it to new problems.

Example prompt (2-shot CoT):

Solve each problem by showing your reasoning step by step.

Q: A bookstore sold 45 books on Monday and 62 books on Tuesday.
Each book costs 
  2. What is the total revenue?
A: Step 1: Total books sold = 45 + 62 = 107 books.
Step 2: Revenue = 107 x 
  2 = 
  ,284.
The total revenue is 
  ,284.

Q: A factory produces 230 units per day for 5 days, then 180 units
per day for 3 days. What is the total production?
A: Step 1: Production in first period = 230 x 5 = 1,150 units.
Step 2: Production in second period = 180 x 3 = 540 units.
Step 3: Total = 1,150 + 540 = 1,690 units.
The total production is 1,690 units.

Q: A delivery driver makes 8 stops. At each stop, she delivers
3 large packages and 5 small packages. Large packages weigh 12 kg
and small packages weigh 3 kg. What is the total weight delivered?
A:

Why few-shot CoT outperforms zero-shot:

  1. Format control. Your examples define exactly how the model structures its reasoning. Want numbered steps? Show numbered steps. Want a final summary statement? Include one in your examples.
  2. Domain adaptation. Examples from your specific domain (medical, legal, financial) prime the model to use domain-appropriate reasoning patterns.
  3. Consistency. Few-shot examples produce more consistent output formats across multiple runs, making it easier to parse programmatically.

Accuracy impact: Few-shot CoT typically improves accuracy by 30-60% over standard prompting. Wei et al. (2022) demonstrated that few-shot CoT improved GPT-3.5 from 17.9% to 58.1% on the GSM8K math benchmark.

Cost considerations: Few-shot examples add to your input tokens. Each example might be 50-150 tokens. With 3-5 examples, you add 150-750 input tokens to every request. At GPT-4.1 mini pricing ($0.40/M input), that is negligible. But the longer output (reasoning steps) adds 3-5x to output costs.

Best practice: Use 2-3 examples for simple tasks, 4-5 for complex ones. More than 5 examples rarely improves performance and significantly increases input token costs.


Self-Consistency: Multiple Reasoning Paths

Self-consistency, introduced by Wang et al. (2023), takes CoT further by generating multiple reasoning chains for the same problem and selecting the most common answer through majority voting.

How it works:

  1. Send the same CoT prompt to the model N times (typically 5-20 times) with temperature > 0
  2. Each run produces a different reasoning chain and potentially different final answers
  3. Extract the final answer from each chain
  4. The answer that appears most frequently is selected as the final answer

Why it works: Different reasoning paths may make different intermediate errors, but the correct final answer tends to be reached more often than any specific wrong answer. Majority voting filters out random errors.

Accuracy impact: Self-consistency adds 5-15% accuracy on top of few-shot CoT. Wang et al. showed improvement from 58.1% to 74.4% on GSM8K with 40 reasoning paths. Even 5 paths provides meaningful improvement.

Cost impact: This is where costs escalate significantly. N=5 means 5x the API calls. N=20 means 20x. At GPT-4.1 mini pricing, a single self-consistent query with 5 paths and 500 output tokens each costs roughly 5x a standard CoT query. For high-stakes applications (medical diagnosis, financial analysis), the accuracy gain justifies the cost. For high-volume tasks, it usually does not.

When to use: Reserve self-consistency for high-stakes, low-volume decisions where accuracy is worth the cost premium. Not appropriate for high-throughput production pipelines.


Tree-of-Thought: Structured Exploration

Tree-of-thought (ToT) prompting, introduced by Yao et al. (2023), extends CoT from a single linear chain into a branching tree of reasoning possibilities. The model explores multiple reasoning branches, evaluates each, and prunes unpromising paths before continuing.

How it differs from self-consistency:

Conceptual structure:

Problem
  |
  +-- Step 1a (promising) --> Step 2a --> Solution A (best)
  |                            |
  |                            +--> Step 2b (dead end, pruned)
  |
  +-- Step 1b (promising) --> Step 2c --> Solution B (viable)
  |
  +-- Step 1c (unpromising, pruned early)

How to implement ToT in practice:

You can approximate ToT with a single prompt by instructing the model to generate multiple approaches, evaluate them, and pursue the best one. A more rigorous implementation uses multiple API calls -- generate candidates, evaluate in a separate call, continue the best ones.

Accuracy impact: ToT shows the strongest improvements on planning and search problems -- tasks where exploring multiple paths is inherently valuable. Yao et al. reported solving the Game of 24 at 74% accuracy vs 4% with standard CoT.

Cost impact: ToT is the most expensive technique. A single problem may require 10-30 API calls for the branching, evaluation, and continuation steps. TokenMix.ai data shows that teams using ToT typically spend 15-30x per query compared to standard prompting. This limits practical use to high-value, low-volume scenarios.


Head-to-Head: CoT Techniques Compared

Dimension Zero-Shot CoT Few-Shot CoT Self-Consistency Tree-of-Thought
Setup effort Add one phrase Write 2-5 examples Write examples + voting code Custom orchestration code
Math accuracy (GSM8K) ~55-65% ~65-75% ~75-85% ~80-90%
Logic accuracy +15-30% +25-50% +35-60% +45-70%
Output tokens 2-3x base 3-5x base (3-5x) x N paths 10-30x base
API calls 1 1 N (5-20) 10-30+
Latency 2-3x base 2-3x base Nx base (parallel OK) 10-30x base
Implementation Trivial Simple Moderate Complex
Production readiness High High Medium Low

When Chain of Thought Prompting Helps vs Hurts

CoT is not universally beneficial. Understanding when to use it and when to skip it is crucial for both accuracy and cost optimization.

When CoT Helps

Multi-step math and arithmetic. Any problem requiring 2+ calculation steps benefits significantly from CoT. The accuracy gap between CoT and direct prompting widens with problem complexity.

Logical reasoning and deduction. Syllogisms, constraint satisfaction, and "if-then" chains. CoT forces the model to verify each logical step.

Word problems with irrelevant information. CoT helps models identify relevant details and ignore distractors.

Code debugging and analysis. Asking a model to trace through code step by step catches bugs that direct prompting misses.

Decision-making with multiple criteria. When weighing pros and cons across several dimensions, CoT ensures each dimension is explicitly considered.

When CoT Hurts or Does Not Help

Simple factual retrieval. "What is the capital of France?" Adding CoT wastes tokens and can introduce unnecessary hedging.

Creative writing. CoT can make creative output feel mechanical. The reasoning steps interfere with the flow of creative generation.

Classification tasks. For sentiment analysis, topic classification, or yes/no questions, CoT often does not improve accuracy but doubles token cost.

Very easy problems. If the model already achieves 95%+ accuracy without CoT, adding it wastes tokens with minimal benefit.

High-throughput, low-stakes tasks. When you process 100,000 customer support tickets for basic categorization, the 2-5x token cost of CoT rarely justifies the marginal accuracy improvement.


Cost Implications: More Reasoning, More Tokens

CoT prompting directly increases your API bill because the model generates more output tokens. Here is what that looks like in practice.

Cost comparison per 1,000 requests (GPT-4.1 mini pricing):

Technique Avg Output Tokens Output Cost/1K Requests Total Cost (incl. input)
Standard 150 tokens $0.24 ~$0.30
Zero-Shot CoT 400 tokens $0.64 ~$0.70
Few-Shot CoT 500 tokens $0.80 ~ .00
Self-Consistency (5x) 2,500 tokens (5x500) $4.00 ~$4.50
Tree-of-Thought 5,000+ tokens $8.00+ ~$9.00+

Assumes 100 input tokens per request at $0.40/M. Output at .60/M.

Cost-saving strategies when using CoT:

  1. Route selectively. Only apply CoT to problems that need it. Use a classifier to determine complexity, then route simple questions to standard prompting and complex ones to CoT. TokenMix.ai's routing capabilities can automate this model selection.

  2. Use cheaper models with CoT. CoT on GPT-4.1 mini often outperforms standard prompting on GPT-5.4 -- at 6x lower cost. Test this on your specific task before defaulting to expensive models.

  3. Truncate reasoning in production. If you only need the final answer, instruct the model to put its reasoning in a specific tag and extract only the answer. You still pay for the reasoning tokens, but downstream processing is simpler.

  4. Cache CoT results. If you see repeated or similar questions, cache the full CoT response. TokenMix.ai's built-in caching handles this automatically for identical prompts.


Which Models Handle CoT Best?

Not all models benefit equally from chain of thought prompting. Larger models and dedicated reasoning models show the strongest improvements.

Model CoT Effectiveness Notes
GPT-5.4 Excellent Strong baseline; CoT adds 15-25% on hard tasks
Claude Opus 4.6 Excellent Extended thinking mode is built-in CoT
o4-mini / o3 Built-in Reasoning models use internal CoT automatically
GPT-4.1 mini Good CoT on mini often matches non-CoT on larger models
Gemini 3.1 Pro Good Strong math/code improvement with CoT
DeepSeek R1 Built-in Dedicated reasoning model with internal CoT
DeepSeek V4 Good Significant improvement on math/logic
Llama 4 Maverick Moderate MoE architecture shows variable CoT benefit
Small models (<7B) Limited CoT can actually hurt very small models

Key insight: Dedicated reasoning models (o4-mini, DeepSeek R1, Claude with extended thinking) have CoT built into their inference process. You pay for the reasoning tokens automatically. For these models, explicit CoT prompting is redundant.

For standard chat models, CoT prompting is most valuable on mid-tier models (GPT-4.1 mini, DeepSeek V4) where it can close the performance gap with more expensive flagship models at a fraction of the cost.

You can test CoT effectiveness across all these models through TokenMix.ai, which provides unified access to 155+ models with consistent pricing and response tracking.


How to Choose: CoT Decision Guide

Your Situation Recommended Technique Why
Simple Q&A or classification No CoT Adds cost without improving accuracy
General reasoning improvement, low effort Zero-Shot CoT One phrase, immediate improvement
Domain-specific tasks (medical, legal, financial) Few-Shot CoT Domain examples improve accuracy significantly
High-stakes decisions, accuracy critical Self-Consistency Majority voting catches reasoning errors
Complex planning or search problems Tree-of-Thought Explores multiple paths, best for hard problems
Using o4-mini, DeepSeek R1, or Claude extended thinking No explicit CoT needed Reasoning is built into the model
High-volume production pipeline Zero-Shot CoT or None Cost scales linearly; self-consistency too expensive
Budget-constrained Zero-Shot CoT on cheap model CoT on GPT-4.1 mini beats standard on GPT-5.4 for many tasks

Conclusion

Chain of thought prompting is the single most impactful prompt engineering technique for reasoning tasks. Zero-shot CoT -- adding "Think step by step" to your prompt -- takes zero effort and improves accuracy by 15-40%. Few-shot CoT, where you provide worked examples, pushes improvements to 30-60%.

The advanced techniques -- self-consistency and tree-of-thought -- deliver the highest accuracy but at 5-30x the cost. Use them for high-stakes, low-volume decisions. For production pipelines, zero-shot or few-shot CoT on a cost-efficient model (GPT-4.1 mini or DeepSeek V4 via TokenMix.ai) delivers the best accuracy-per-dollar ratio.

The key insight most teams miss: CoT on a cheap model often outperforms standard prompting on an expensive model. Before upgrading to GPT-5.4 or Claude Opus, try adding CoT to GPT-4.1 mini. You might get better results at 6x lower cost.

Track your CoT experiments across models with TokenMix.ai -- one API key, 155+ models, and built-in cost tracking to measure exactly what each technique costs per accuracy point.


FAQ

What is chain of thought prompting in simple terms?

Chain of thought prompting is telling an AI model to show its reasoning step by step before giving a final answer. Instead of asking for a direct answer, you add a phrase like "Think step by step." This helps the model avoid errors on math, logic, and complex reasoning tasks by breaking them into smaller, manageable steps.

Does chain of thought prompting work with all AI models?

CoT works best with larger models (30B+ parameters). Mid-tier models like GPT-4.1 mini and DeepSeek V4 show strong improvements. Very small models (under 7B parameters) may not benefit and can actually perform worse with CoT. Dedicated reasoning models like o4-mini and DeepSeek R1 have CoT built in, so explicit prompting is unnecessary.

How much does CoT prompting cost compared to standard prompting?

Zero-shot CoT typically costs 2-3x more per request due to additional output tokens. Few-shot CoT costs 3-5x more. Self-consistency (5 paths) costs 5x more. Tree-of-thought can cost 10-30x more. The cost increase comes from the model generating more output tokens for reasoning steps.

What is the difference between zero-shot and few-shot chain of thought?

Zero-shot CoT uses a simple trigger phrase like "Think step by step" with no examples. Few-shot CoT provides 2-5 worked examples showing the reasoning pattern you want. Few-shot is more accurate (+30-60% vs +15-40%) but requires more effort to set up and adds input token costs for the examples.

When should I NOT use chain of thought prompting?

Skip CoT for simple factual questions, classification tasks, creative writing, and high-throughput low-stakes processing. CoT adds cost without meaningful accuracy improvement on tasks that do not require multi-step reasoning. If your model already achieves 95%+ accuracy on a task, CoT wastes tokens.

What is tree-of-thought prompting and how is it different from chain of thought?

Tree-of-thought (ToT) extends chain of thought from a single linear reasoning chain into a branching tree. The model explores multiple reasoning paths, evaluates each, prunes unpromising ones, and continues only the best branches. ToT achieves the highest accuracy but costs 10-30x more than standard CoT due to multiple API calls for branching and evaluation.


Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: TokenMix.ai, Wei et al. 2022 - Chain-of-Thought Prompting, Yao et al. 2023 - Tree of Thoughts