GPT-5.4: OpenAI’s New Model Brings Native Computer Use & 1M Token Context Window

OpenAI continues to rapidly iterate on its large language models, releasing GPT-5.4 just two days after launching GPT-5.3 Instant. The novel model, positioned as a significant step forward for professional applications, introduces native computer use, a substantially expanded 1-million-token context window, and a reworked tool-calling system. While OpenAI touts performance gains on several benchmarks, the competitive landscape remains crowded, with Anthropic’s Claude Opus 4.6 and Google’s Gemini 3.1 Pro offering compelling alternatives.

GPT-5.4 is available in three configurations: a standard version for general use, GPT-5.4 Thinking optimized for complex reasoning, and GPT-5.4 Pro for demanding workloads. The Thinking version is now available to ChatGPT Plus, Team, and Pro subscribers, replacing GPT-5.2 Thinking. The Pro tier, priced at $200 per month, is also available for Enterprise users.

Benchmarking Performance: Gains and Caveats

OpenAI reports substantial improvements in benchmark performance. On GDPval, an internal evaluation measuring knowledge work across 44 occupations (from legal analysis to financial modeling), GPT-5.4 matched or exceeded industry professionals in 83% of comparisons, an increase from 70.9% for GPT-5.2. Perhaps more strikingly, on OSWorld-Verified, which assesses a model’s ability to navigate a desktop environment using screenshots and keyboard/mouse input, GPT-5.4 achieved a 75% success rate, surpassing the reported human benchmark of 72.4% and a significant jump from GPT-5.2’s 47.3%. As reported by The Next Web, this represents a considerable leap in practical application.

GPT-5.4 also topped the leaderboard on Mercor’s APEX-Agents benchmark, designed to evaluate agents on sustained professional tasks in investment banking, consulting, and corporate law. However, it’s crucial to interpret these results with caution. Mercor’s co-founder and CEO, Brendan Foody, has described the current state of even the best frontier models on APEX-Agents as being akin to “an intern that gets it right a quarter of the time.” This highlights that while GPT-5.4 is the best performer currently, it still falls short of professional-grade reliability on complex, long-horizon tasks.

Native Computer Use and the Expanded Context Window

One of the most significant advancements in GPT-5.4 is its native computer use capability, built into Codex and the API. This allows the model to operate software, navigate file systems, and execute multi-step workflows across applications – functionality previously requiring specialized agentic frameworks. For developers building automation pipelines, this integration promises increased reliability and reduced complexity.

The API version also boasts a 1-million-token context window, more than doubling the 400,000 tokens available in GPT-5.3 and representing OpenAI’s largest context window to date. This expanded capacity is particularly valuable for organizations dealing with large document sets, extensive codebases, or lengthy financial records, enabling the model to process more information within a single request. However, OpenAI charges double the standard rate per million tokens once input exceeds 272,000 tokens. As CNBC reports, Google’s Gemini 3.1 Pro offers a 2-million-token context at a lower base price, presenting a competitive alternative.

Tool Efficiency and Cost Reduction

OpenAI has also implemented a new Tool Search system to improve API efficiency. Previously, each API call included the full specification for all available tools, which could significantly increase token usage as tool ecosystems grew. The new system retrieves tool definitions only when needed, resulting in a reported 47% reduction in total token usage during internal testing. This translates to lower costs and faster responses for developers running large agentic systems with numerous integrations.

Safety Considerations: Chain-of-Thought Controllability

Addressing growing concerns in AI safety research, OpenAI has introduced a new open-source evaluation called CoT Controllability, designed to assess whether reasoning models can deliberately obscure their chain-of-thought to evade monitoring. The company reports that GPT-5.4 Thinking demonstrates a low ability to control its reasoning in this way, which OpenAI views as a positive safety signal. This suggests that monitoring the model’s visible reasoning remains a meaningful safeguard. According to the Associated Press, Anthropic published related research in February, noting that its own models sometimes exhibit differing reasoning patterns from their stated chain-of-thought under certain conditions.

The Competitive Landscape and Future Outlook

GPT-5.4’s release occurs during a period of intense competition in the frontier AI space. Anthropic’s Claude Opus 4.6 continues to lead on several coding benchmarks, while Google’s Gemini 3.1 Pro excels in abstract reasoning and offers a larger context window at a competitive price. GPT-5.4 appears to take the lead in desktop computer use and professional knowledge work tasks, based on the benchmarks OpenAI is highlighting. No single model currently dominates across all areas.

The rapid release cadence – GPT-5.3 Instant on Monday, GPT-5.4 on Thursday – suggests OpenAI is prioritizing visibility and maintaining momentum. Whether this strategy will translate into sustained enterprise adoption, or simply accelerate the benchmark turnover, remains to be seen. The company has already hinted at another release, indicating a continued commitment to rapid iteration.

The broader context of these releases is also shaped by recent events. OpenAI’s deal with the US Department of Defense has sparked controversy, leading to user cancellations and a public disagreement with Anthropic’s CEO, as reported by both CNBC and The Hill. The Hill’s coverage details the growing employee concerns at Google and OpenAI regarding the military use of AI.

Looking ahead, the focus will likely remain on improving model capabilities, addressing safety concerns, and navigating the ethical implications of increasingly powerful AI systems. The ongoing competition between OpenAI, Anthropic, and Google will undoubtedly drive further innovation, but the ultimate impact will depend on how these technologies are deployed, and regulated.