TokenMix Research Lab · 2026-04-22

GPT-5.4 Thinking Beats Human on OSWorld: 75% Desktop Agent 2026

GPT-5.4 Thinking Beats Human on OSWorld: 75% Desktop Agent 2026

GPT-5.4 Thinking, OpenAI's test-time compute variant released alongside GPT-5.4 in March 2026, officially surpassed human-level performance on OSWorld-Verified at 75.0% in April benchmarks. Human baseline: ~72%. OSWorld-Verified measures real desktop computer use — clicking buttons, filling forms, navigating apps, installing software — in a live VM environment. GPT-5.4 Thinking's win matters because it's the first frontier model to cross the human baseline on this benchmark, which means production-grade computer-use agents become practical for the first time. This article explains what OSWorld measures, how test-time compute enables 75%, and what developers can actually build today. TokenMix.ai exposes GPT-5.4 Thinking via standard API and tracks per-request test-time compute costs for enterprise deployments.

Table of Contents


Confirmed vs Speculation: GPT-5.4 Thinking Facts

Claim Status Source
GPT-5.4 Thinking scored 75.0% on OSWorld-Verified Confirmed OpenAI benchmark page
Human baseline ~72% Confirmed OSWorld methodology docs
Uses test-time compute Confirmed OpenAI technical report
Integrated into GPT-5.4 Thinking API Confirmed API docs
Premium pricing vs standard GPT-5.4 Confirmed (higher output rate) Pricing page
First model to surpass human on OSWorld Confirmed Independent reproductions
Production-ready for autonomous desktop work Partial — works but expensive

Bottom line: genuine benchmark milestone. Practical deployment is cost-gated more than capability-gated.

What OSWorld-Verified Actually Tests

OSWorld is a benchmark suite evaluating computer-use agents on real desktop tasks:

Task category Example
File operations "Find all PDFs modified this week, copy to Archive folder"
Email tasks "Reply to the invoice email from Acme with approval"
Web navigation "Book a flight from NYC to Tokyo for next Tuesday, aisle seat"
Document editing "Open report.docx, update date to today, convert to PDF, email to boss"
Software installation "Install PostgreSQL, create a database named 'analytics'"
Data extraction "Scrape top 10 reviews from this page into a spreadsheet"

The "Verified" subset filters out ambiguous tasks where human judges disagreed on what "correct" means. Remaining tasks have unambiguous success criteria.

Methodology: agent runs in a live VM, performs the task by emitting mouse/keyboard actions (or via screen-reading + command API), success measured by automated checks.

Why 72% human baseline? Even humans fail 28% of OSWorld tasks on first attempt — these tasks are genuinely hard (dialogs close unexpectedly, forms have weird validation, apps crash).

How Test-Time Compute Hits 75%

Standard GPT-5.4 scores ~55% on OSWorld-Verified. GPT-5.4 Thinking hits 75%. The 20 percentage point gap comes from test-time compute — the model spends more inference cycles planning, critiquing, and revising before acting.

Mechanism:

  1. User query arrives
  2. Model enters "thinking" phase — generates 5-50× more internal tokens than standard mode
  3. Multiple candidate plans evaluated internally
  4. Best plan selected, actions emitted
  5. After each action, observe result, re-evaluate

This mirrors human deliberation for complex tasks. Cost: significantly more tokens used per user-facing response.

Token breakdown example (same OSWorld task):

Mode Input tokens "Thinking" tokens (hidden) Output tokens (user-visible) Total billed
GPT-5.4 Standard 2,500 0 800 3,300
GPT-5.4 Thinking 2,500 8,000-40,000 1,200 11,700-43,700

Thinking tokens are billed as output tokens at 5/M. A single complex OSWorld task can cost $0.12-0.65 on Thinking mode vs $0.02 on standard.

GPT-5.4 Thinking vs Claude Opus 4.7 Computer Use

Both models support computer use for desktop automation. Head-to-head:

Dimension GPT-5.4 Thinking Claude Opus 4.7 (Computer Use)
OSWorld-Verified 75.0% ~68% (est)
Terminal-Bench 2.0 ~60% 69.4%
Latency per action 5-15s 3-8s
Cost per task $0.12-0.65 $0.08-0.40
Vision resolution Standard GPT-5.4 3.75MP
Ecosystem maturity Early More mature (released 2024)
Platform API + Playwright API + Claude Desktop

Specialization:

Many production agents route between them based on task type.

The Cost: Test-Time Compute Is Expensive

Cost modeling for a production desktop agent:

Scenario: customer support agent that automates account tasks

Daily cost: $2,400 Monthly cost: $72,000

Compare to human agent:

GPT-5.4 Thinking costs 50% more than human at this profile.

Where it pays off:

Cost optimization:

Real Production Use Cases

Use cases where Thinking's cost is justified:

1. Enterprise data migration automation

Moving data between legacy apps with no API. Thinking handles the weird edge cases (SAP dialogs, Salesforce custom fields). Manual labor alternative costs $80-150/hour consultants.

2. QA test automation for desktop apps

Thinking drives the app through test scenarios, reporting failures with screenshots. Replaces manual QA passes for regression testing. Cost per test run: $5-20 vs human tester at $40-80.

3. Customer onboarding automation

Guiding customers through complex setup (VPN config, software install, account provisioning) via screen-sharing agent. Higher conversion than chat-only onboarding.

4. Accessibility assistance

Helping users with disabilities perform complex computer tasks. Privacy considerations significant but life-changing capability for some users.

5. Compliance documentation

Automating creation of audit trails, screenshots, and compliance reports by navigating enterprise software and capturing evidence.

FAQ

Is 75% on OSWorld actually "better than human"?

On average, yes. Human baseline is ~72% for these specific tasks. But humans handle the 28% they fail differently — we ask for help, we adapt creatively, we know when to stop. GPT-5.4 Thinking at 75% doesn't always know when it's failing; silent wrong answers are a real production risk.

Can I use GPT-5.4 Thinking for coding instead of OSWorld?

Yes, it's accessible via the standard API with the "thinking" mode parameter. For coding, it scores similarly to standard GPT-5.4 (~58% SWE-bench Verified) — the benchmark win is specifically on computer-use, not code. For coding use Claude Opus 4.7 or GLM-5.1.

What's the API parameter to enable Thinking mode?

response = client.chat.completions.create(
    model="gpt-5.4-thinking",  # or pass reasoning_effort="high"
    messages=messages,
)

Reasoning effort levels: "low", "medium", "high", "max". Higher = more thinking tokens = higher cost = better quality on complex tasks.

Will Thinking mode be cheaper in GPT-5.5?

Unlikely to be dramatically cheaper. Thinking compute scales with task complexity, not model version. GPT-5.5 "Spud" pricing is likely flat or modest cuts. Expect Thinking cost structure similar to GPT-5.4 Thinking in the next release.

Can I self-host a computer-use agent?

Not at frontier quality. OSWorld-Verified 75% requires GPT-5.4 Thinking or Claude Opus 4.7. Open-weight alternatives (LLaVA-Next, Gemma 4) top out around 40-50%. For non-production research, open-source agents work; for production, API access is mandatory.

How does this compare to Anthropic's Computer Use?

Claude Opus 4.7's Computer Use scores ~68% on OSWorld — lower than GPT-5.4 Thinking's 75%. Advantages: lower latency, better terminal integration, more mature Claude Desktop UI. Disadvantages: lower raw OSWorld score, limited to Pro/Max subscribers for desktop version.

Is this production-ready for building autonomous agents?

Yes for specific bounded tasks. Use cases with clear success criteria (did the file upload? did the email send?), limited blast radius (can't delete production database), and human-in-loop for approvals work well today. Fully autonomous "just figure it out" agents remain risky — the 25% failure rate compounds over long-running sessions.


Sources

By TokenMix Research Lab · Updated 2026-04-22