OpenAI's Next-Gen Model Hits Performance Wall: Report

OpenAI’s upcoming artificial intelligence model is delivering smaller performance gains than its predecessors, sources familiar with the matter told The Information.

Employee testing reveals Orion achieved GPT-4 level performance after completing only 20% of its training, The Information reports.

The quality increase from GPT-4 to the current version of GPT-5 seems to be smaller than that of GPT-3 to GPT-4.

“Some researchers at the company believe Orion isn’t reliably better than its predecessor in handling certain tasks, according to the (OpenAI) employees,” The Information reported. “Orion performs better at language tasks but may not outperform previous models at tasks such as coding, according to an OpenAI employee.”

While Orion getting closer to GPT-4 at 20% of its training might sound impressive to some, it is important to note that the early stages of AI training typically deliver the most dramatic improvements, with subsequent phases yielding smaller gains.

So, the remaining 80% of training time isn’t likely to produce the same magnitude of advancement seen in previous generational leaps, sources said.

Image: V7 Labs

The limitations emerge at a critical juncture for OpenAI following its recent $6.6 billion funding round.

The company now faces heightened expectations from investors while grappling with technical constraints that challenge traditional scaling approaches in AI development. If these early versions don’t meet expectations, the company’s upcoming fundraising efforts may not be met with the same hype as before—and that could be a problem for a potentially for-profit company, which is what Sam Altman seems to want for OpenAI.

Underwhelming results point to a fundamental challenge facing the entire AI industry: the diminishing supply of high-quality training data and the need to remain relevant in a field as competitive as generative AI.

Research published in June predicted that AI companies will exhaust available public human-generated text data between 2026 and 2032, marking a critical inflection point for traditional development approaches.

“Our findings indicate that current LLM development trends cannot be sustained through conventional data scaling alone,” the research paper states, highlighting the need for alternative approaches to model improvement, including synthetic data generation, transfer learning from data-rich domains, and the use of non-public data.

The historical strategy of training language models on publicly available text from websites, books, and other sources has reached a point of diminishing returns, with developers having “largely squeezed as much out of that type of data as they can,” according to The Information.

How OpenAI is Tackling This Issue: Reasoning vs. Language Models

To tackle these challenges, OpenAI is fundamentally restructuring its approach to AI development.

“In response to the recent challenge to training-based scaling laws posed by slowing GPT improvements, the industry appears to be shifting its effort to improving models after their initial training, potentially yielding a different type of scaling law,” The Information reports.

To achieve this state of continuous improvement, OpenAI is separating model development into two distinct tracks:

The O-Series (which seems to be codename Strawberry), focused on reasoning capabilities, represents a new direction in model architecture. These models operate with significantly higher computational intensity and are explicitly designed for complex problem-solving tasks.

The computational demands are substantial, with early estimates suggesting operational costs at six times that of current models. However, the enhanced reasoning capabilities could justify the increased expense for specific applications requiring advanced analytical processing.

This model, if it’s the same as Strawberry, is also tasked to generate enough synthetic data to increase the quality of OpenAI’s LLMs constantly.

In parallel, the Orion Models or the GPT Series (considering OpenAI trademarked the name GPT-5) continue to evolve, focusing on general language processing and communication tasks. These models maintain more efficient computational requirements while leveraging their broader knowledge base for writing and argumentation tasks.

OpenAI’s CPO Kevin Weil also confirmed this during an AMA and said he expects to converge both developments at some point in the future.

“It’s not either or, it’s both,” he replied when asked whether OpenAI would focus on scaling LLMs with more data or using a different approach, focusing on smaller but faster models, “better base models plus more strawberry scaling/inference time compute.”

A workaround or the ultimate solution?

OpenAI’s approach to addressing data scarcity through synthetic data generation presents complex challenges for the industry. The company’s researchers are developing sophisticated models designed to generate training data, yet this solution introduces new complications in maintaining model quality and reliability.

As previously reported by Decrypt, researchers have found that model training on synthetic data represents a double-edged sword. While it offers a potential solution to data scarcity, it introduces new risks of model degradation and reliability concerns with proven degradation after several training iterations.

In other words, as models train on AI-generated content, they may begin to amplify subtle imperfections in their outputs. These feedback loops can perpetuate and magnify existing biases, creating a compounding effect that becomes increasingly difficult to detect and correct.

OpenAI’s Foundations team is developing new filtering mechanisms to maintain data quality, implementing different validation techniques to distinguish between high-quality and potentially problematic synthetic content. The team is also exploring hybrid training approaches that strategically combine human and AI-generated content to maximize the benefits of both sources while minimizing their respective drawbacks.

Post-training optimization has also gained relevance. Researchers are developing new methods to enhance model performance after the initial training phase, potentially offering a way to improve capabilities without relying solely on expanding the training dataset.

That said, GPT-5 is still an embryo of a complete model with significant development work ahead. Sam Altman, OpenAI’s CEO, has indicated that it won’t be ready for deployment this year or next. This extended timeline could prove advantageous, allowing researchers to address current limitations and potentially discover new methods for model enhancement, considerably improving GPT-5 before its eventual release.

Edited by Josh Quittner and Sebastian Sinclair