Close Menu
Gossips Today
  • Tech & Innovation
  • Healthcare
  • Personal Finance
  • Lifestyle
  • Travel
  • Business
  • Recipes
What's Hot

Week in Review: WWDC 2025 recap

CommonSpirit CFO Daniel Morissette to retire

Lululemon’s ‘We Made Too Much’ Section Is Bursting With Packable Summer Styles—Here, 15 Top Picks From $39

Facebook X (Twitter) Instagram
Sunday, June 15
Gossips Today
Facebook X (Twitter) Instagram
  • Tech & Innovation

    Week in Review: WWDC 2025 recap

    June 15, 2025

    How to delete your 23andMe data

    June 15, 2025

    Clay secures a new round at a $3B valuation, sources say

    June 14, 2025

    New York passes a bill to prevent AI-fueled disasters

    June 14, 2025

    11 startups from YC Demo Day that investors are talking about

    June 13, 2025
  • Healthcare

    CommonSpirit CFO Daniel Morissette to retire

    June 15, 2025

    Employers eye rising costs as they assess benefit offerings: WTW

    June 15, 2025

    Providence cuts 600 roles amid restructuring

    June 14, 2025

    Joint Commission, CHAI partner to develop guidance on health AI

    June 14, 2025

    M&A to play ‘important role’ at Teladoc: CEO

    June 13, 2025
  • Personal Finance

    16 Budgeting Tips to Manage Your Money Better

    May 28, 2025

    How to Stick to a Budget

    May 20, 2025

    4 Steps to Navigate Marriage and Debt

    May 11, 2025

    Buying a Fixer-Upper Home: What to Know

    May 10, 2025

    How to Talk to Your Spouse About Money

    May 10, 2025
  • Lifestyle

    Halfway Through the Year. This Is the Pivot Point

    June 12, 2025

    16 Father’s Day Gift Ideas He (or You) Will Love

    June 4, 2025

    The Getup: Sand

    May 25, 2025

    Your Summer Style Starts Here: 17 Memorial Day Sale Picks to Grab Now + 4 Getups

    May 24, 2025

    3 Fixes If You Hate the Way Your Pants Fit (That Have Nothing to Do with Your Waist Size)

    May 14, 2025
  • Travel

    Lululemon’s ‘We Made Too Much’ Section Is Bursting With Packable Summer Styles—Here, 15 Top Picks From $39

    June 15, 2025

    10 Best Places to Live in North Carolina, According to Local Real Estate Experts

    June 14, 2025

    These $60 Amazon Sneakers Are Nurse-approved and ‘More Comfortable’ Than $145 Hokas

    June 14, 2025

    You Can Glamp 8 Minutes Outside of New York City This Summer in Tents, Tiny Cabins, and Glass-enclosed Suites

    June 13, 2025

    The Most Luxurious Hotel on the Italian Riviera Just Reopened With a New Look, but the Same Breathtaking Sea Views

    June 13, 2025
  • Business

    How a planetarium show discovered a spiral at the edge of our solar system

    June 15, 2025

    ‘No Kings Day’ map, speakers, cities: Everything to know about today’s protests

    June 14, 2025

    From strain to support: Your AC could help stabilize the power grid

    June 14, 2025

    Who will build the next generation of digital products?

    June 13, 2025

    Spot the scam: How to outsmart online cons like a pro

    June 13, 2025
  • Recipes

    slushy paper plane

    June 6, 2025

    one-pan ditalini and peas

    May 29, 2025

    eggs florentine

    May 20, 2025

    challah french toast

    May 6, 2025

    charred salt and vinegar cabbage

    April 25, 2025
Gossips Today
  • Tech & Innovation
  • Healthcare
  • Personal Finance
  • Lifestyle
  • Travel
  • Business
  • Recipes
Technology & Innovation

These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

gossipstodayBy gossipstodayFebruary 17, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
These Researchers Used Npr Sunday Puzzle Questions To Benchmark Ai
Share
Facebook Twitter LinkedIn Pinterest Email

Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to be solvable without too much foreknowledge, the brainteasers are usually challenging even for skilled contestants.

That’s why some experts think they’re a promising way to test the limits of AI’s problem-solving abilities.

In a recent study, a team of researchers hailing from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor created an AI benchmark using riddles from Sunday Puzzle episodes. The team says their test uncovered surprising insights, like that reasoning models — OpenAI’s o1, among others — sometimes “give up” and provide answers they know aren’t correct.

“We wanted to develop a benchmark with problems that humans can understand with only general knowledge,” Arjun Guha, a computer science faculty member at Northeastern and one of the co-authors on the study, told TechCrunch.

The AI industry is in a bit of a benchmarking quandary at the moment. Most of the tests commonly used to evaluate AI models probe for skills, like competency on PhD-level math and science questions, that aren’t relevant to the average user. Meanwhile, many benchmarks — even benchmarks released relatively recently — are quickly approaching the saturation point.

The advantages of a public radio quiz game like the Sunday Puzzle is that it doesn’t test for esoteric knowledge, and the challenges are phrased such that models can’t draw on “rote memory” to solve them, explained Guha.

“I think what makes these problems hard is that it’s really difficult to make meaningful progress on a problem until you solve it — that’s when everything clicks together all at once,” Guha said. “That requires a combination of insight and a process of elimination.”

No benchmark is perfect, of course. The Sunday Puzzle is U.S. centric and English only. And because the quizzes are publicly available, it’s possible that models trained on them can “cheat” in a sense, although Guha says he hasn’t seen evidence of this.

“New questions are released every week, and we can expect the latest questions to be truly unseen,” he added. “We intend to keep the benchmark fresh and track how model performance changes over time.”

On the researchers’ benchmark, which consists of around 600 Sunday Puzzle riddles, reasoning models such as o1 and DeepSeek’s R1 far outperform the rest. Reasoning models thoroughly fact-check themselves before giving out results, which helps them avoid some of the pitfalls that normally trip up AI models. The trade-off is that reasoning models take a little longer to arrive at solutions — typically seconds to minutes longer.

At least one model, DeepSeek’s R1, gives solutions it knows to be wrong for some of the Sunday Puzzle questions. R1 will state verbatim “I give up,” followed by an incorrect answer chosen seemingly at random — behavior this human can certainly relate to.

The models make other bizarre choices, like giving a wrong answer only to immediately retract it, attempt to tease out a better one, and fail again. They also get stuck “thinking” forever and give nonsensical explanations for answers, or they arrive at a correct answer right away but then go on to consider alternative answers for no obvious reason.

“On hard problems, R1 literally says that it’s getting ‘frustrated,’” Guha said. “It was funny to see how a model emulates what a human might say. It remains to be seen how ‘frustration’ in reasoning can affect the quality of model results.”

R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.

The current best-performing model on the benchmark is o1 with a score of 59%, followed by the recently released o3-mini set to high “reasoning effort” (47%). (R1 scored 35%.) As a next step, the researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.

NPR benchmark
The scores of the models the team tested on their benchmark.Image Credits:Guha et al.

“You don’t need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don’t require PhD-level knowledge,” Guha said. “A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are — and aren’t — capable of.”

benchmark models NPR Puzzle Questions reasoning Researchers Sunday
Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleSurf Therapy Is the Latest Trend in Active, Accessible Adventure
Next Article 5 free time-saving Windows apps every PC should have
admin
gossipstoday
  • Website

Related Posts

Week in Review: WWDC 2025 recap

June 15, 2025

How to delete your 23andMe data

June 15, 2025

Clay secures a new round at a $3B valuation, sources say

June 14, 2025
Leave A Reply Cancel Reply

Demo
Trending Now

Week in Review: WWDC 2025 recap

CommonSpirit CFO Daniel Morissette to retire

Lululemon’s ‘We Made Too Much’ Section Is Bursting With Packable Summer Styles—Here, 15 Top Picks From $39

How a planetarium show discovered a spiral at the edge of our solar system

Latest Posts

Week in Review: WWDC 2025 recap

June 15, 2025

CommonSpirit CFO Daniel Morissette to retire

June 15, 2025

Lululemon’s ‘We Made Too Much’ Section Is Bursting With Packable Summer Styles—Here, 15 Top Picks From $39

June 15, 2025

Subscribe to News

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Advertisement
Demo
Black And Beige Minimalist Elegant Cosmetics Logo (4) (1)
Facebook X (Twitter) Pinterest Vimeo WhatsApp TikTok Instagram

Categories

  • Tech & Innovation
  • Health & Wellness
  • Personal Finance
  • Lifestyle & Productivity

Company

  • About Us
  • Contact Us
  • Advertise With Us

Services

  • Privacy Policy
  • Terms & Conditions
  • Disclaimer

Subscribe to Updates

© 2025 Gossips Today. All Right Reserved.

Type above and press Enter to search. Press Esc to cancel.