AI coding challenge yields grim results

The first winner of a new AI coding challenge has been announced, setting a high standard for AI-powered software engineers.

The Laude Institute, a non-profit organization, announced the winner of the K Prize, a multi-round AI coding challenge initiated by Databricks and Perplexity co-founder Andy Konwinski, on Wednesday at 5 pm PST. Eduardo Rocha de Andrade, a Brazilian prompt engineer, was declared the winner and will receive a prize of $50,000. What’s more surprising is that he achieved this feat by correctly answering only 7.5% of the questions on the test.

Andy Konwinski expressed his satisfaction with the challenge, stating, “We’re glad we built a benchmark that is actually hard. Benchmarks should be challenging if they’re going to be meaningful.” He added, “The scores would be different if the big labs had participated with their largest models. However, that’s the point of the K Prize – it runs offline with limited compute, which favors smaller and open models. I appreciate that aspect, as it levels the playing field.”

Konwinski has pledged to donate $1 million to the first open-source model that can achieve a score higher than 90% on the test.

The K Prize is similar to the well-known SWE-Bench system, as it tests models against flagged issues from GitHub to assess their ability to handle real-world programming problems. However, unlike SWE-Bench, which is based on a fixed set of problems that models can train on, the K Prize is designed to be a “contamination-free version of SWE-Bench.” It uses a timed entry system to prevent any benchmark-specific training. For the first round, models were due by March 12th, and the K Prize organizers built the test using only GitHub issues flagged after that date.

The top score of 7.5% is significantly lower than the scores achieved on SWE-Bench, which currently shows a 75% top score on its easier ‘Verified’ test and 34% on its harder ‘Full’ test. Konwinski is unsure whether the disparity is due to contamination on SWE-Bench or the challenge of collecting new issues from GitHub, but he expects the K Prize project to provide answers soon.

“As we conduct more runs of the challenge, we’ll have a better understanding,” he told TechCrunch. “We expect participants to adapt to the dynamics of competing on this every few months.”

Techcrunch event

San Francisco
|
October 27-29, 2025

It may seem unusual that the top score was so low, given the wide range of AI coding tools already available. However, with benchmarks becoming too easy, many critics see projects like the K Prize as a necessary step towards addressing AI’s growing evaluation problem.

Princeton researcher Sayash Kapoor, who recently proposed a similar idea in a paper, stated, “I’m quite optimistic about developing new tests for existing benchmarks. Without such experiments, we can’t determine if the issue is contamination or just targeting the SWE-Bench leaderboard with human intervention.”

For Konwinski, the K Prize is not just about creating a better benchmark, but also an open challenge to the rest of the industry. “If you listen to the hype, it’s like we should be seeing AI doctors and AI lawyers and AI software engineers, and that’s just not true,” he says. “If we can’t even achieve more than 10% on a contamination-free SWE-Bench, that’s the reality check for me.”

Source Link

AI coding challenge yields grim results

Lego Returns to ‘Stranger Things’

Galaxy S22+ runs One UI 7

US libraries cut ebook, audiobook lending

Android Auto 14.2 adds new home UI feature

Home

Services

Domains & Hosting

FUSION MAG

AI coding challenge yields grim results

Lego Returns to ‘Stranger Things’

You May Also Like

Galaxy S22+ runs One UI 7

US libraries cut ebook, audiobook lending

Android Auto 14.2 adds new home UI feature

Home

Services

Domains & Hosting

FUSION MAG