Close Menu
Gossips Today
  • Tech & Innovation
  • Healthcare
  • Personal Finance
  • Lifestyle
  • Travel
  • Business
  • Recipes
What's Hot

Charging Ahead: How an E-bike Made My Portugal Vacation Amazing

Why it’s perfectly normal (and good, even) to question what you do for a living

Week in Review: WWDC 2025 recap

Facebook X (Twitter) Instagram
Sunday, June 15
Gossips Today
Facebook X (Twitter) Instagram
  • Tech & Innovation

    Week in Review: WWDC 2025 recap

    June 15, 2025

    How to delete your 23andMe data

    June 15, 2025

    Clay secures a new round at a $3B valuation, sources say

    June 14, 2025

    New York passes a bill to prevent AI-fueled disasters

    June 14, 2025

    11 startups from YC Demo Day that investors are talking about

    June 13, 2025
  • Healthcare

    CommonSpirit CFO Daniel Morissette to retire

    June 15, 2025

    Employers eye rising costs as they assess benefit offerings: WTW

    June 15, 2025

    Providence cuts 600 roles amid restructuring

    June 14, 2025

    Joint Commission, CHAI partner to develop guidance on health AI

    June 14, 2025

    M&A to play ‘important role’ at Teladoc: CEO

    June 13, 2025
  • Personal Finance

    16 Budgeting Tips to Manage Your Money Better

    May 28, 2025

    How to Stick to a Budget

    May 20, 2025

    4 Steps to Navigate Marriage and Debt

    May 11, 2025

    Buying a Fixer-Upper Home: What to Know

    May 10, 2025

    How to Talk to Your Spouse About Money

    May 10, 2025
  • Lifestyle

    Halfway Through the Year. This Is the Pivot Point

    June 12, 2025

    16 Father’s Day Gift Ideas He (or You) Will Love

    June 4, 2025

    The Getup: Sand

    May 25, 2025

    Your Summer Style Starts Here: 17 Memorial Day Sale Picks to Grab Now + 4 Getups

    May 24, 2025

    3 Fixes If You Hate the Way Your Pants Fit (That Have Nothing to Do with Your Waist Size)

    May 14, 2025
  • Travel

    Charging Ahead: How an E-bike Made My Portugal Vacation Amazing

    June 15, 2025

    Lululemon’s ‘We Made Too Much’ Section Is Bursting With Packable Summer Styles—Here, 15 Top Picks From $39

    June 15, 2025

    10 Best Places to Live in North Carolina, According to Local Real Estate Experts

    June 14, 2025

    These $60 Amazon Sneakers Are Nurse-approved and ‘More Comfortable’ Than $145 Hokas

    June 14, 2025

    You Can Glamp 8 Minutes Outside of New York City This Summer in Tents, Tiny Cabins, and Glass-enclosed Suites

    June 13, 2025
  • Business

    Why it’s perfectly normal (and good, even) to question what you do for a living

    June 15, 2025

    How a planetarium show discovered a spiral at the edge of our solar system

    June 15, 2025

    ‘No Kings Day’ map, speakers, cities: Everything to know about today’s protests

    June 14, 2025

    From strain to support: Your AC could help stabilize the power grid

    June 14, 2025

    Who will build the next generation of digital products?

    June 13, 2025
  • Recipes

    slushy paper plane

    June 6, 2025

    one-pan ditalini and peas

    May 29, 2025

    eggs florentine

    May 20, 2025

    challah french toast

    May 6, 2025

    charred salt and vinegar cabbage

    April 25, 2025
Gossips Today
  • Tech & Innovation
  • Healthcare
  • Personal Finance
  • Lifestyle
  • Travel
  • Business
  • Recipes
Technology & Innovation

Did xAI lie about Grok 3’s benchmarks?

gossipstodayBy gossipstodayFebruary 22, 2025No Comments3 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
X Gains A Faster Grok Model And A New 'grok
Share
Facebook Twitter LinkedIn Pinterest Email

Debates over AI benchmarks — and how they’re reported by AI labs — are spilling out into public view.

This week, an OpenAI employee accused Elon Musk’s AI company, xAI, of publishing misleading benchmark results for its latest AI model, Grok 3. One of the co-founders of xAI, Igor Babushkin, insisted that the company was in the right.

The truth lies somewhere in between.

In a post on xAI’s blog, the company published a graph showing Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational mathematics exam. Some experts have questioned AIME’s validity as an AI benchmark. Nevertheless, AIME 2025 and older versions of the test are commonly used to probe a model’s math ability.

xAI’s graph showed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing available model, o3-mini-high, on AIME 2025. But OpenAI employees on X were quick to point out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score at “cons@64.”

What is cons@64, you might ask? Well, it’s short for “consensus@64,” and it basically gives a model 64 tries to answer each problem in a benchmark and takes the answers generated most frequently as the final answers. As you can imagine, cons@64 tends to boost models’ benchmark scores quite a bit, and omitting it from a graph might make it appear as though one model surpasses another when in reality, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s scores for AIME 2025 at “@1” — meaning the first score the models got on the benchmark — fall below o3-mini-high’s score. Grok 3 Reasoning Beta also trails ever-so-slightly behind OpenAI’s o1 model set to “medium” computing. Yet xAI is advertising Grok 3 as the “world’s smartest AI.”

Babushkin argued on X that OpenAI has published similarly misleading benchmark charts in the past — albeit charts comparing the performance of its own models. A more neutral party in the debate put together a more “accurate” graph showing nearly every model’s performance at cons@64:

Hilarious how some people see my plot as attack on OpenAI and others as attack on Grok while in reality it’s DeepSeek propaganda
(I actually believe Grok looks good there, and openAI’s TTC chicanery behind o3-mini-*high*-pass@”””1″”” deserves more scrutiny.) https://t.co/dJqlJpcJh8 pic.twitter.com/3WH8FOUfic

— Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

But as AI researcher Nathan Lambert pointed out in a post, perhaps the most important metric remains a mystery: the computational (and monetary) cost it took for each model to achieve its best score. That just goes to show how little most AI benchmarks communicate about models’ limitations — and their strengths.

benchmarks Grok lie xAI
Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleOrlando Health to close derelict Florida hospital, citing ‘years of neglect’ under Steward
Next Article Housing market map: Zillow just revised its 2025 home price forecast
admin
gossipstoday
  • Website

Related Posts

Week in Review: WWDC 2025 recap

June 15, 2025

How to delete your 23andMe data

June 15, 2025

Clay secures a new round at a $3B valuation, sources say

June 14, 2025
Leave A Reply Cancel Reply

Demo
Trending Now

Charging Ahead: How an E-bike Made My Portugal Vacation Amazing

Why it’s perfectly normal (and good, even) to question what you do for a living

Week in Review: WWDC 2025 recap

CommonSpirit CFO Daniel Morissette to retire

Latest Posts

Charging Ahead: How an E-bike Made My Portugal Vacation Amazing

June 15, 2025

Why it’s perfectly normal (and good, even) to question what you do for a living

June 15, 2025

Week in Review: WWDC 2025 recap

June 15, 2025

Subscribe to News

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Advertisement
Demo
Black And Beige Minimalist Elegant Cosmetics Logo (4) (1)
Facebook X (Twitter) Pinterest Vimeo WhatsApp TikTok Instagram

Categories

  • Tech & Innovation
  • Health & Wellness
  • Personal Finance
  • Lifestyle & Productivity

Company

  • About Us
  • Contact Us
  • Advertise With Us

Services

  • Privacy Policy
  • Terms & Conditions
  • Disclaimer

Subscribe to Updates

© 2025 Gossips Today. All Right Reserved.

Type above and press Enter to search. Press Esc to cancel.