Ignore AI Benchmarks?

Here is a rewritten version of the content without changing its meaning, retaining the original length, and keeping the proper headings and titles:

Introduction to TechCrunch’s AI Newsletter: We’re taking a break, but you can find all our AI coverage, including columns, analysis, and breaking news, on TechCrunch. To get these stories and more in your inbox daily, sign up for our newsletters here.

This week, Elon Musk’s AI startup, xAI, launched its latest flagship AI model, Grok 3, which powers the company’s Grok chatbot apps. Trained on approximately 200,000 GPUs, the model outperforms other leading models, including those from OpenAI, on benchmarks for mathematics, programming, and more.

However, what do these benchmarks really tell us?

At TechCrunch, we often report benchmark figures because they’re one of the few standardized ways to measure model improvements in the AI industry. Nevertheless, popular AI benchmarks tend to test for esoteric knowledge and provide aggregate scores that poorly correlate to proficiency on tasks that most people care about.

As Wharton professor Ethan Mollick pointed out in a series of posts on X after Grok 3’s unveiling, there’s an “urgent need for better batteries of tests and independent testing authorities.” Since AI companies often self-report benchmark results, making those results even harder to accept at face value.

“Public benchmarks are both ‘meh’ and saturated, leaving a lot of AI testing to be like food reviews, based on taste,” Mollick wrote. “If AI is critical to work, we need more.”

Although there are independent tests and organizations proposing new benchmarks for AI, their relative merit is far from settled within the industry. Some commentators and experts propose aligning benchmarks with economic impact to ensure their usefulness, while others argue that adoption and utility are the ultimate benchmarks.

This debate may continue indefinitely. Perhaps we should instead, as X user Roon suggests, pay less attention to new models and benchmarks unless there are significant AI technical breakthroughs. For our collective sanity, that may not be the worst idea, even if it leads to some level of AI FOMO.

As mentioned earlier, This Week in AI is going on hiatus. Thank you, readers, for sticking with us through this journey. Until next time.

News

**Image Credits:**Nathan Laine/Bloomberg / Getty Images

OpenAI attempts to “uncensor” ChatGPT: Max wrote about how OpenAI is revising its AI development approach to explicitly support “intellectual freedom,” regardless of the topic’s challenge or controversy.

Mira’s new startup: Former OpenAI CTO Mira Murati’s new startup, Thinking Machines Lab, aims to develop tools that make AI work for people’s unique needs and goals.

Grok 3 is released: Elon Musk’s AI startup, xAI, has released its latest flagship AI model, Grok 3, and introduced new capabilities for the Grok apps for iOS and the web.

Meta announces LlamaCon: Meta will host its first developer conference dedicated to generative AI this spring, called LlamaCon, after Meta’s Llama family of generative AI models, scheduled for April 29.

AI and Europe’s digital sovereignty: Paul profiled OpenEuroLLM, a collaboration between 20 organizations to build foundation models for transparent AI in Europe that preserves linguistic and cultural diversity.

Research Paper of the Week

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo. — **Image Credits:**Jakub Porzycki/NurPhoto / Getty Images

OpenAI researchers created a new AI benchmark, SWE-Lancer, to evaluate the coding capabilities of powerful AI systems. The benchmark consists of over 1,400 freelance software engineering tasks, ranging from bug fixes to technical implementation proposals.

According to OpenAI, the best-performing AI model, Anthropic’s Claude 3.5 Sonnet, scores 40.3% on the full SWE-Lancer benchmark, indicating that AI still has a long way to go. Notably, the researchers didn’t benchmark newer models like OpenAI’s o3-mini or Chinese AI company DeepSeek’s R1.

Model of the Week

A Chinese AI company, Stepfun, released an “open” AI model, Step-Audio, which can understand and generate speech in several languages, including Chinese, English, and Japanese, and allows users to adjust the emotion and dialect of the synthetic audio it creates, including singing.

Stepfun is one of several well-funded Chinese AI startups releasing models under a permissive license. Founded in 2023, Stepfun reportedly closed a funding round worth several hundred million dollars from a host of investors, including Chinese state-owned private equity firms.

Grab Bag

Nous Research DeepHermes — **Image Credits:**Nous Research

Nous Research, an AI research group, has released what it claims is one of the first AI models that unifies reasoning and language model capabilities.

The model, DeepHermes-3 Preview, can toggle on and off long “chains of thought” for improved accuracy at the cost of some computational resources. In “reasoning” mode, DeepHermes-3 Preview “thinks” longer for harder problems and shows its thought process to arrive at the answer.

Anthropic reportedly plans to release an architecturally similar model soon, and OpenAI has said such a model is on its near-term roadmap.

Source Link

Ignore AI Benchmarks?

Real estate a top target for cyber attacks

Companies block DeepSeek over China data risks

Microsoft Patches Exploited Power Pages Flaw

US Nonprofit Healthcare Provider Hit by Data Breach Affecting 1M+ Patients

Home

Services

Domains & Hosting

FUSION MAG

Ignore AI Benchmarks?

News

Research Paper of the Week

Model of the Week

Grab Bag

Real estate a top target for cyber attacks

You May Also Like

Companies block DeepSeek over China data risks

Microsoft Patches Exploited Power Pages Flaw

US Nonprofit Healthcare Provider Hit by Data Breach Affecting 1M+ Patients

Home

Services

Domains & Hosting

FUSION MAG