OpenAI EVMbench Results: How Claude, GPT-5 and Gemini Ranked on Crypto Security
TLDR
OpenAI released EVMbench, a benchmark that tests AI models on finding and fixing smart contract security flaws
Built with Paradigm and OtterSec, it draws on 120 real vulnerabilities from 40 audits
Anthropic’s Claude Opus 4.6 ranked first with a detect award of $37,824
OpenAI’s GPT-5.2 placed second at $31,623, Google’s Gemini 3 Pro third at $25,112
Crypto hackers stole $3.4 billion in 2025, making AI security tools more pressing
OpenAI has launched a new benchmark called EVMbench, built to test how well AI models can detect, exploit, and fix vulnerabilities in smart contracts.
Introducing EVMbench—a new benchmark that measures how well AI agents can detect, exploit, and patch high-severity smart contract vulnerabilities. https://t.co/op5zufgAGH
— OpenAI (@OpenAI) February 18, 2026
The tool was created alongside crypto investment firm Paradigm and security firm OtterSec. Results were published in a research paper on Wednesday, February 18.
Smart contracts are permanent pieces of code that run on blockchains like Ethereum. They control billions of dollars across lending platforms and decentralized exchanges. Once deployed, they cannot easily be changed, so a single flaw can lead to major losses.
EVMbench used 120 real vulnerabilities pulled from 40 smart contract audits, most sourced from open-source security competitions.
Each AI model was scored using a “detect award,” which estimates the dollar value an AI could theoretically recover by correctly identifying a flaw in a contract.
How Each AI Model Ranked
Anthropic’s Claude Opus 4.6 took the top spot with an average detect award of $37,824.
OpenAI’s own OC-GPT-5.2 came in second at $31,623. Google’s Gemini 3 Pro placed third at $25,112.
The benchmark tested three core skills: finding security bugs, exploiting those bugs in a controlled setting, and patching the broken code without disrupting the contract.
Why OpenAI Built This Tool
Crypto attackers stole $3.4 billion in 2025, a slight increase from the year before. OpenAI said testing AI performance in “economically meaningful environments” is becoming more important as AI adoption grows.
“Smart contracts routinely secure $100B+ in open-source crypto assets,” OpenAI wrote. “It becomes increasingly important to measure AI capabilities in economically meaningful environments.”
OpenAI also noted it expects AI agents to play a growing role in stablecoin payments. Circle CEO Jeremy Allaire predicted in January that billions of AI agents will be transacting with stablecoins within five years.
What Comes Next
Dragonfly managing partner Haseeb Qureshi posted on X that smart contracts were never designed for human intuition. He said signing large transactions still feels “terrifying” due to threats like drainer wallets, unlike a standard bank transfer.
Qureshi believes AI-managed wallets will eventually handle these risks for everyday users. He compared the pairing to GPS meeting the smartphone.
OpenAI said it hopes EVMbench becomes a long-term standard for tracking AI progress in blockchain security.
Claude Opus 4.6 holding the top detect award score remains the latest data point from the published study.
The post OpenAI EVMbench Results: How Claude, GPT-5 and Gemini Ranked on Crypto Security appeared first on Blockonomi.
Filed under: Bitcoin - @ February 19, 2026 12:30 pm