On Monday, OpenAI unveiled a new lineup of models, dubbed GPT-4.1, which may add to the existing confusion surrounding the company’s naming conventions.
The GPT-4.1 family, comprising GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, boasts exceptional coding and instruction-following capabilities, according to OpenAI. Although these multimodal models are accessible through OpenAI’s API, they are not available on ChatGPT. With a 1-million-token context window, these models can process approximately 750,000 words at once, surpassing the length of “War and Peace).
The introduction of GPT-4.1 comes as OpenAI’s competitors, such as Google and Anthropic, intensify their efforts to develop sophisticated programming models. Google’s recently launched Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet, as well as Chinese AI startup DeepSeek’s upgraded V3, have all achieved high rankings on popular coding benchmarks.
The ultimate goal of many tech giants, including OpenAI, is to train AI coding models that can perform complex software engineering tasks. OpenAI’s ambitious objective is to create an “agentic software engineer,” as stated by CFO Sarah Friar during a recent tech summit in London. The company envisions its future models being capable of programming entire apps from start to finish, handling tasks such as quality assurance, bug testing, and documentation writing.
GPT-4.1 represents a significant step towards achieving this goal.
According to an OpenAI spokesperson, “We have optimized GPT-4.1 for real-world applications based on direct feedback, focusing on areas that developers care most about, such as frontend coding, minimizing unnecessary edits, reliably following formats, adhering to response structure and ordering, consistent tool usage, and more. These enhancements enable developers to build agents that excel at real-world software engineering tasks.”
OpenAI claims that the full GPT-4.1 model outperforms its predecessor models, GPT-4o and GPT-4o mini, on coding benchmarks, including SWE-bench. The GPT-4.1 mini and nano models are said to be more efficient and faster, albeit at the cost of some accuracy, with GPT-4.1 nano being the speediest and most affordable model to date.
The pricing for GPT-4.1 is set at $2 per million input tokens and $8 per million output tokens. In contrast, GPT-4.1 mini costs $0.40 per million input tokens and $1.60 per million output tokens, while GPT-4.1 nano is priced at $0.10 per million input tokens and $0.40 per million output tokens.
OpenAI’s internal testing reveals that GPT-4.1, which can generate more tokens at once than GPT-4o (32,768 versus 16,384), achieved a score of 52% to 54.6% on SWE-bench Verified, a human-validated subset of SWE-bench. Although these figures are slightly lower than those reported by Google and Anthropic for their respective models, Gemini 2.5 Pro (63.8%) and Claude 3.7 Sonnet (62.3%), on the same benchmark.
In a separate evaluation, OpenAI assessed GPT-4.1 using Video-MME, a test designed to measure a model’s ability to comprehend video content. GPT-4.1 achieved a notable 72% accuracy on the “long, no subtitles” video category, according to OpenAI.
While GPT-4.1 demonstrates reasonable performance on benchmarks and boasts a more recent “knowledge cutoff,” providing a better frame of reference for current events (up to June 2024), it is essential to acknowledge that even the most advanced models today struggle with tasks that would not pose a challenge to human experts. For instance, numerous studies have shown that code-generating models often fail to fix and even introduce security vulnerabilities and bugs.
OpenAI also acknowledges that GPT-4.1 becomes less reliable (i.e., more prone to mistakes) as the number of input tokens increases. On one of the company’s own tests, OpenAI-MRCR, the model’s accuracy decreased from approximately 84% with 8,000 tokens to 50% with 1 million tokens. Furthermore, GPT-4.1 tends to be more “literal” than GPT-4o, sometimes requiring more specific and explicit prompts.
Source Link