The spring press conference of OpenAI is filled with overwhelming topics, and shopping malls are not waiting for GPT-5. However, GPT-4o is also sufficient for the industry to repeatedly speculate – what does ChatGPT’s participation in the “He” period mean for domestic large model companies?
OpenAI demonstrated several short and concise scenarios, allowing users to intuitively experience the multi-modal understanding ability of GPT-4o, almost no early reflection ability, as well as the empathy and confession ability similar to humans. This led to the actual replication of the abstract female voice intelligent robot in the science fiction film “He”.


ChatGPT has become “Her”, and its technical vibration points mainly come from the rapid improvement of GPT-4o’s response rate to multimodal real-time interaction. For example, the corresponding time for audio input is as short as 232 milliseconds, with an average of 320 milliseconds, which is very similar to the corresponding time for humans in conversations. However, prior to this, the average lead time using GPT-3.5 and GPT-4 was 2.8 seconds and 5.4 seconds, respectively.
OpenAI has stopped explaining the reasons behind this change on its official website. Previously, its Voice Mode was a pipeline composed of approximately three independent modules: a simple module transcribed audio into text, GPT-3.5 or GPT-4 absorbed text and input it, and a third simple module transformed the text back into audio.
In this process, the main source of mold talent, GPT-4 (or GPT-3.5), will lose a lot of information: it cannot directly observe the tone, multiple speakers, and background music, nor can it lose out on laughter, singing, or confession emotions.
On GPT-4o, OpenAI practiced an end-to-end new model that spans text, visual, and audio, which means that all inputs and outputs are collected from the same neural network. This is probably the focus reason why this multimodal model has significantly improved in terms of understanding, innate talent, and reaction speed.
In fact, the native multimodal big model that does not require cross modal integration and is collected and practiced by the same neural network is precisely the bias that domestic big model companies are trying to break through. Because it can take away all the advantages that GPT-4o currently shows, such as low cost and high effectiveness, which is not only the foundation of product optimization, but also a condition for large-scale commercialization.
However, from the perspective of self investors, the observation and conclusion is that even if it is only an end-to-end practice of audio models, domestic large model companies have not yet reached this stage.
Another important factor in driving GPT-4o to reflect so quickly, in addition to native multimodal mold variations, is the size of the mold. OpenAI has not publicly commented on the parameter details of GPT-4o or GPT-4 Turbo, and the industry has only speculated on the size of these new models based on API misappropriation prices and rumors (such as whether GPT-3.5 Turbo can be a 20B model, while GPT-3.5 is 175B), according to the ratio relationship.
Under these conditions, the GPT-4o is likely to be a smaller size mold than the 1.8T, and this engineering talent that becomes smaller, faster, and even stronger in certain dimensions while ensuring the mold’s talent is precisely the “magic” location that OpenAI is difficult to overcome by competitors.
This is also one of the biases that the domestic large mold industry is striving for. In order to lower the reasoning cost of the big model, the industry should shrink the model, and from the perspective of Scaling Law, this goal can only be achieved through “increasing first and then decreasing”, so as to balance speed and function.
As for how to “become stronger while getting smaller” like GPT-4o, this is also the crux of various skill competitions.
According to Interface News reporters, domestic companies that deal with the development of lower level large molds are generally concerned about this skill bias, but their levels of discrimination are not the same. For example, some concepts may think that it is more important to be the first to create a large model with no trillion parameters and reach GPT-4, while others may think that the cost and cost-effectiveness of coordinating and utilizing it during research and development are equally important. But without exception, there is currently a lack of good solutions in the industry.
At the product level, when it comes to products from domestic big model unicorns, Kimi (Kimi+), Wanzhi, Yueda, and so on, although they already have a certain level of multimodal understanding and innate talent, they still focus more on lightweight agents such as AI search engines and AI amateur help. In terms of real-time voice interaction, they still have no choice but to reach the level of ChatGPT talent.
Moreover, Minimax has participated in the role of “Little Conch” in the latest released Conch AI, which aims to provide real-time voice dialogue services and directly benchmark ChatGPT’s ability in voice interaction. But according to the company’s folk demonstration videos, “Little Conch” still has a clear sense of distance in terms of speed of reflection, form of confession, as well as personification abilities such as tone and emotion.
Therefore, the technical challenge that GPT-4o poses to the industry is whether it can achieve the same speed as it at the same parameter level and functional level? If the speed can also catch up, can we coordinate its multi-modal real-time interaction ability for audio, visual, and text?
In fact, the cost increase of GPT-4o can only be reflected in product and commercialization, which is a condition for expanding the scope of AI application users.
The resolution plan of OpenAI regarding the decline of ChatGPT fees has been widely regarded as the “right path” in the industry. The aforementioned investor stated, “Expanding user base and allowing multimodal big models to make it faster and easier for users to use first is definitely the right idea.”
After the press conference, Fu Sheng, Chairman and CEO of Cheetah Static, commented and debated, “The announcement of OpenAI application precisely explains whether it is widely used in the field of artificial intelligence. Every practitioner should strive to make good use of artificial intelligence.”
However, looking back, OpenAI’s release of GPT-4o instead of GPT-5 (or GPT-4.5) still provides career space for the concept of big mold skills encountering coldness.
Fu Sheng said, “If we don’t consider the cost of accumulating parameters and improve our so-called big mold talents, this path is sure to encounter difficulties. Currently, it is expected that GPT-5.0 will still be difficult to produce for some time.”
Zhu Xiaohu, the managing partner of Jinshajiang Venture Capital, presented three concepts regarding this: firstly, the skill iteration curve of the big model has significantly slowed down; Secondly, the initial fee is charged, explaining that both the number of GPT users and the increase in expenses have reached a bottleneck, and the model companies that are not deeply bound to large factories are basically out of business; Thirdly, the use of AI will rapidly explode, and increasing the cost of reasoning by another order of magnitude will spread to people in the pre AI era.
Regardless of the survival of big model companies, the absence of GPT-4o has indeed left two conflicting signals: the good news is that the bottleneck of whether OpenAI can meet in GPT-5 may have arrived, and the window for domestic lower level big models to accelerate their pursuit may have arrived; The bad news is that in terms of application, the user experience that OpenAI is currently able to polish may also require domestic industry practitioners to spend a considerable amount of time chasing it.
In addition, OpenAI has left a humble challenge for the industry, as the team pointed out that GPT-4o was the first model to implement this end-to-end practice, so the team is still exploring its effectiveness and limitations – which means that the limitations of GPT-4o, which has not undergone iteration, may be far beyond that.

作者 admin

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注