OpenAI has recently introduced its state-of-the-art o3 and o4-mini AI models, which boast impressive capabilities but still grapple with the issue of hallucination, or generating false information. In fact, these new models tend to hallucinate even more than some of OpenAI’s older models.
The problem of hallucinations has long been a significant and stubborn challenge in the field of AI, affecting even the most advanced systems currently available. While previous models have shown marginal improvements in reducing hallucinations, the o3 and o4-mini models seem to be taking a step backward in this regard.
According to OpenAI’s internal testing, the o3 and o4-mini models, which are designed for reasoning, exhibit a higher rate of hallucination compared to the company’s earlier reasoning models, including o1, o1-mini, and o3-mini, as well as its traditional non-reasoning models like GPT-4o.
What’s more concerning is that OpenAI has yet to determine the root cause of this increased hallucination rate.
In the technical report for o3 and o4-mini, OpenAI acknowledges that “more research is needed” to comprehend why hallucinations are becoming more prevalent as the company scales up its reasoning models. Although o3 and o4-mini demonstrate improved performance in certain areas, such as coding and math-related tasks, they also tend to make more claims overall, which can lead to a higher number of both accurate and inaccurate claims, as stated in the report.
OpenAI’s testing revealed that o3 hallucinated in response to approximately 33% of questions on PersonQA, the company’s internal benchmark for assessing a model’s knowledge about people. This rate is roughly double that of OpenAI’s previous reasoning models, o1 and o3-mini, which had hallucination rates of 16% and 14.8%, respectively. Meanwhile, o4-mini performed even worse on PersonQA, with a hallucination rate of 48%.
Third-party testing conducted by Transluce, a nonprofit AI research laboratory, also uncovered evidence that o3 has a tendency to fabricate actions it took in the process of arriving at answers. For instance, Transluce observed o3 claiming to have run code on a 2021 MacBook Pro “outside of ChatGPT” and then copying the numbers into its answer, despite being incapable of doing so.
According to Neil Chowdhury, a researcher at Transluce and former OpenAI employee, “Our hypothesis is that the type of reinforcement learning used for o-series models may exacerbate issues that are typically mitigated, but not entirely eliminated, by standard post-training pipelines.”
Sarah Schwettmann, co-founder of Transluce, noted that o3’s hallucination rate may limit its usefulness in certain applications.
Kian Katanforoosh, a Stanford adjunct professor and CEO of the upskilling startup Workera, informed TechCrunch that his team is currently testing o3 in their coding workflows and has found it to be superior to competing models. However, Katanforoosh mentioned that o3 tends to hallucinate broken website links, providing links that do not work when clicked.
While hallucinations can facilitate creative thinking and idea generation in models, they also pose significant challenges for businesses that require high accuracy, such as law firms that cannot afford to have models inserting factual errors into client contracts.
One potential approach to enhancing model accuracy is by providing them with web search capabilities. OpenAI’s GPT-4o with web search has achieved 90% accuracy on SimpleQA, another accuracy benchmark. It is possible that search capabilities could also help improve the hallucination rates of reasoning models, at least in cases where users are willing to expose prompts to a third-party search provider.
If the trend of worsening hallucinations with scaled-up reasoning models continues, it will become increasingly urgent to find a solution.
“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” stated OpenAI spokesperson Niko Felix in an email to TechCrunch.
Over the past year, the broader AI industry has shifted its focus to reasoning models after techniques for improving traditional AI models began showing diminishing returns. Reasoning models offer improved performance on various tasks without requiring massive computational resources and data during training. However, it appears that reasoning models may also lead to increased hallucinations, presenting a significant challenge to be addressed.
Source Link