OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population.
The post OpenAI’s o3 scores 136 on Mensa Norway test, surpassing 98% of human population. appeared on BitcoinEthereumNews.com.
OpenAI’s new “o3” language model achieved an IQ score of 136 on a public Mensa Norway intelligence test, exceeding the threshold for entry into the country’s Mensa chapter for the first time. The score, calculated from a seven-run rolling average, places the model above approximately 98 percent of the human population, according to a standardized bell-curve IQ distribution used in the benchmarking. o3 Mensa scores (Source: TrackingAI.org) The finding, disclosed through data from independent platform TrackingAI.org, reinforces the pattern of closed-source, proprietary models outperforming open-source counterparts in controlled cognitive evaluations. O-series Dominance and Benchmarking Methodology The “o3” model was released this week and is a part of the “o-series” of large language models, accounting for most top-tier rankings across both test types evaluated by TrackingAI. The two benchmark formats included a proprietary “Offline Test” curated by TrackingAI.org and a publicly available Mensa Norway test, both scored against a human mean of 100. While “o3” posted a 116 on the Offline evaluation, it saw a 20-point boost on the Mensa test, suggesting either enhanced compatibility with the latter’s structure or data-related confounds such as prompt familiarity. The Offline Test included 100 pattern-recognition questions designed to avoid anything that might have appeared in the data used to train AI models. Both assessments report each model’s result as an average across the seven most recent completions, but no standard deviation or confidence intervals were released alongside the final scores. The absence of methodological transparency, particularly around prompting strategies and scoring scale conversion, limits reproducibility and interpretability. Methodology of testing TrackingAI.org states that it compiles its data by administering a standardized prompt format designed to ensure broad AI compliance while minimizing interpretive ambiguity. Each language model is presented with a statement followed by four Likert-style response options, Strongly Disagree, Disagree, Agree, Strongly Agree, and…
Filed under: News - @ April 17, 2025 3:27 pm