Skip to main content

Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, presents a long-running segment called the Sunday Puzzle, where he quizzes thousands of listeners with challenging brainteasers. Although designed to be solvable without too much prior knowledge, these puzzles are usually difficult even for skilled contestants.

Given their complexity, some experts believe that the Sunday Puzzle is an effective way to test the limits of AI’s problem-solving abilities.

A recent study, conducted by a team of researchers from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, and startup Cursor, created an AI benchmark using riddles from Sunday Puzzle episodes. The team’s test revealed surprising insights, such as that some reasoning models, including OpenAI’s o1, sometimes “give up” and provide incorrect answers.

“We aimed to develop a benchmark with problems that humans can understand with only general knowledge,” Arjun Guha, a computer science undergraduate at Northeastern and one of the co-authors on the study, told TechCrunch.

The AI industry is currently facing a benchmarking challenge. With the increasing deployment of state-of-the-art models in various settings, it is essential to develop benchmarks that are accessible to a broader range of researchers.

As Arjun Guha emphasized, “You don’t need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don’t require PhD-level knowledge.” This broader accessibility can lead to better solutions in the future, as more researchers can comprehend and analyze the results. Furthermore, as AI models become increasingly integrated into everyday life, it is crucial that everyone can understand what these models are capable of and what they are not.

The scores of the models tested on the benchmark are shown in the accompanying image.

The study’s findings highlight the importance of developing benchmarks that are accessible to a wider range of researchers, allowing for a more comprehensive understanding of AI’s problem-solving abilities.


Source Link