Skip to main content

On Sundays, NPR host Will Shortz, known for his work with The New York Times’ crossword puzzle, challenges thousands of listeners with a long-standing segment called the Sunday Puzzle. Although the puzzles are designed to be solvable without extensive prior knowledge, they often pose a challenge even for skilled participants.

This characteristic makes them an appealing way to assess the problem-solving capabilities of artificial intelligence (AI) systems, according to some experts.

A recent study conducted by researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and the startup Cursor, involved creating an AI benchmark using riddles from Sunday Puzzle episodes. The researchers found some surprising insights, including that certain reasoning models, such as OpenAI’s o1, sometimes “give up” and provide incorrect answers despite knowing they are wrong.

Arjun Guha, a computer science faculty member at Northeastern and co-author of the study, explained to TechCrunch, “Our goal was to develop a benchmark with problems that can be understood by humans using only general knowledge.”

The AI industry currently faces a challenge in terms of benchmarking. Many tests used to evaluate AI models focus on skills that are not relevant to the average user, such as proficiency in PhD-level math and science questions. Furthermore, even recently released benchmarks are quickly becoming outdated.

The advantage of using a public radio quiz game like the Sunday Puzzle as a benchmark is that it does not require esoteric knowledge, and the challenges are phrased in a way that prevents models from relying on “rote memory” to solve them, as explained by Guha.

Guha noted, “I believe what makes these problems difficult is that it’s really challenging to make meaningful progress until you solve them, at which point everything falls into place. This requires a combination of insight and a process of elimination.”

While no benchmark is perfect, the Sunday Puzzle has its limitations, being U.S.-centric and English-only. Additionally, since the quizzes are publicly available, there is a possibility that models trained on them could “cheat,” although Guha has not seen evidence of this occurrence.

Guha added, “New questions are released every week, and we can expect the latest questions to be truly unseen. We intend to keep the benchmark fresh and track how model performance changes over time.”

The researchers’ benchmark, consisting of around 600 Sunday Puzzle riddles, shows that reasoning models like o1 and DeepSeek’s R1 significantly outperform other models. Reasoning models thoroughly fact-check themselves before providing results, which helps them avoid common pitfalls that usually trip up AI models. However, this comes at the cost of taking slightly longer to arrive at solutions, typically seconds to minutes longer.

At least one model, DeepSeek’s R1, provides solutions it knows to be incorrect for some Sunday Puzzle questions. R1 will state “I give up” followed by an incorrect answer, behavior that humans can certainly relate to.

The models exhibit other unusual behaviors, such as providing incorrect answers, retracting them, and attempting to find better ones, only to fail again. They also get stuck “thinking” indefinitely and provide nonsensical explanations for their answers, or arrive at a correct answer immediately but then consider alternative answers without a clear reason.

Guha said, “On hard problems, R1 literally says it’s getting ‘frustrated.’ It’s interesting to see how a model mimics human-like behavior. It remains to be seen how ‘frustration’ in reasoning can impact the quality of model results.”

NPR benchmark
R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.

The current top-performing model on the benchmark is o1, achieving a score of 59%, followed by the recently released o3-mini set to high “reasoning effort” with a score of 47%. R1 scored 35%. The researchers plan to expand their testing to additional reasoning models, which they hope will help identify areas where these models can be improved.

NPR benchmark
The scores of the models the team tested on their benchmark.Image Credits:Guha et al.

Guha emphasized, “You don’t need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don’t require PhD-level knowledge. A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may lead to better solutions in the future. As state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to understand what these models are — and aren’t — capable of.”


Source Link