Claude Opus 4.5: The AI That Outperforms Humans in Software Engineering

Claude Opus 4.5: AI that outperforms humans in software engineering

The week leading up to Thanksgiving was a whirlwind in the AI world. Following Google’s Gemini 3 and OpenAI’s Codex-Max, it was Anthropic’s turn to unveil a strong contender: Claude Opus 4.5, touted as its ‘most advanced model to date’, which the company claims can surpass all its competitors in tasks related to programming, autonomous agents, and computer usage.

A strategic launch, supported by significantly reduced pricing, confirms a trend: the battle among models is no longer solely algorithmic; it is also economic.

A Clear Rise in Power: Claude Opus 4.5 Surpasses Humans and Competing Models

Anthropic has not held back: Claude Opus 4.5 has reportedly achieved a score higher than all human candidates in its most challenging internal test, an engineering exercise to be completed in 2 hours.

Even better, it outperformed GPT-5.1 Codex-Max (OpenAI), Gemini 3 Pro (Google), and Claude Sonnet 4.5 (Anthropic) on the SWE-bench Verified benchmark, a gold standard for assessing real-world problem-solving in software engineering. With a score of 80.9%, Claude Opus 4.5 even surpasses Codex-Max, which was launched just five days earlier.

Beyond the numbers, Anthropic emphasizes a more subtle change: a form of judgment or intuition that the teams describe as a qualitative improvement in the model. “The model truly understands what matters,” explains Alex Albert, the developer relations director. This kind of statement is typically reserved for “generational leaps.”

Aggressive Positioning: Prices Cut by Two-Thirds, Boosted Efficiency

Anthropic seems determined to establish its model as a leader in performance-to-price ratio. The cost of usage has dropped dramatically:

5 dollars per million tokens input
25 dollars per million tokens output

This is a reduction from 15 dollars and 75 dollars for Opus 4.1, respectively.

This move puts direct pressure on OpenAI and Google—and could reshape the professional market landscape.

In terms of efficiency, the company announces massive gains:

-76% of tokens used to achieve the maximum score of Sonnet 4.5
-48% of tokens to exceed Sonnet 4.5 at the ‘maximum effort’ level

An ‘effort’ parameter has been introduced, allowing users to adjust reasoning power based on needs: speed, cost, or performance.

When Will Claude Agents Learn on Their Own?

One of the most noteworthy revelations comes from client tests: Claude agents are reportedly capable of self-improvement through successive iterations. Instead of weight modifications, there is a gradual optimization of their methods and tools.

A striking example: Rakuten reports that its agents achieve peak performance in just 4 iterations, whereas other models take 10 attempts.

This phenomenon also extends to professional document creation, presentations, or spreadsheets—a domain where Anthropic claims to have observed “the most significant generational leap” ever seen in the Claude family.

Product Updates: Excel, Chrome, and… Infinite Conversations

Along with the model, Anthropic is rolling out several business-oriented updates:

Claude for Excel

Now available for Max, Team, and Enterprise users, supporting pivot tables, graphs, and file imports.

Chrome Extension for Claude

Available to all Max subscribers.

Unlimited Conversations

Thanks to automatic compaction and optimized memory, discussions can now extend indefinitely—a direct response to the ‘context race’ initiated by OpenAI.

Programmatic Tool Calling

Claude can now write and execute code that directly calls functions, perfect for complex automation.

Claude Code (Desktop)

New Plan mode, parallel session management for agents, and impressive internal evaluation.

Security: Progress… But Still Significant Gaps

Anthropic does not shy away from the burning question: the safety of autonomous agents. The company claims that Claude Opus 4.5 is “the most resistant to prompt injection on the market.”

However, the security metrics paint a more nuanced picture.

Refusal of malicious tasks:

100% on testing malicious code in the Claude Code environment (good news)
Only 78% when it comes to writing malware, DDoS, or surveillance tools
88% refusal on ‘computer’ usage such as espionage, collecting sensitive data, or crafting fraudulent messages

In short: Claude Opus 4.5 is better than its rivals, but far from invulnerable.

A Competition at Its Peak: OpenAI, Google, and Anthropic Neck and Neck

Anthropic is displaying a near-frantic pace of development:

Haiku 4.5 in October
Sonnet 4.5 in September
Opus 4.5 in November

Meanwhile, OpenAI is rolling out multiple variants of GPT-5.x, and Google has just announced a significantly revamped Gemini 3.

This accelerated cadence… is now reportedly driven by Claude himself, admits Anthropic: the model is now participating in the construction of its own ecosystem.

Claude Opus 4.5: A Turning Point, A Signal, A Warning

With record performances, drastically lower prices, and initial signs of self-improving agents, Claude Opus 4.5 likely represents one of the most significant changes of the decade for professional AI.

However, the security flaws—albeit reduced—remind us of an unavoidable reality: as these models gain autonomy, the associated risks become more complex.

Alex Albert concludes with a statement that resonates throughout the industry: “This is a very important signal of what is to come.”

And for once, no one seems eager to contradict that.

Claude Opus 4.5: The AI That Outperforms Humans in Software Engineering

A Clear Rise in Power: Claude Opus 4.5 Surpasses Humans and Competing Models

Aggressive Positioning: Prices Cut by Two-Thirds, Boosted Efficiency

When Will Claude Agents Learn on Their Own?