“In the category of literary and cultural illustrations, the difference between ‘he yuan’ (mold) and ‘guan yuan’ is gradually widening. We hope that our ‘he yuan’ can expand this difference.”
On the afternoon of May 14th, Tencent announced the latest collaboration of the Hunyuan Wenshu Image model with external sources. The product manager, Lu Qinglin, said to interface messages and other media during a small meeting.
According to Tencent, the above large model is the industry’s first Chinese original familiar DiT (Diffusion With Transformer) architecture text and graphics integration model, supporting bilingual input and understanding in both Chinese and English, with a parameter target of 1.5 billion (1.5B). The focus of DiT is to apply the Transformer architecture to centralized molds in order to improve the quality and effectiveness of mature images.

The DiT architecture adopted by this model is consistent with the revolutionary product Sora of Open AI, which not only does not support textual graphics, but also lacks the foundation of multimodal visual maturity such as action videos. At present, it has been published on Hugging Face, a specialized training platform for natural speech punishment (NLP), and Github, a code hosting platform for software collaborators. The complete models, including model weights, inference codes, and model algorithms, are not required for commercial use by enterprises and individual collaborators.
If we look at the evaluation data provided by Tencent, the consequences will be higher than the Stable Diffusion model of Heyuan, and overall, we can only stay on the front line.
“We will also try models with larger parameter sizes, which will consume longer computational capital and duration,” said Lu Qinglin. “Although they are not ready yet, they are already being worked on.”
Why choose DiT architecture?
The focus of this downgrade of the hybrid text and graphics module is on adopting a new DiT architecture. In this regard, the hybrid maintains consistency with Sora and Stable Diffusion 3, both of which are centralized modules based on the Transformer architecture. Its strength is that it is more user-friendly for large parameter objects.
Photo taken by Cui Peng
In the past, the visual convergence model was mainly based on the U-Net architecture, which would first shrink the image and then re shrink it, but this process often resulted in information loss. As the number of parameters increases, the practice stability of this architecture is gradually deteriorating, and the number of models that use it is decreasing.
Lu Qinglin expressed that the centralized module collection and construction based on the Transformer architecture prevents information shrinkage and punishment, and can significantly improve the quality and effectiveness of the modules.
The semantic expression of the new architecture can be stronger, with more inclusive information quality and the ability to adapt to a larger range of parameter objectives. “Before downgrading to 5B or even 10B, we are confident in practicing the big model more reasonably,” Lu Qinglin exaggerated.
According to Interface News, as early as around July 2023, the Tencent team had already clarified the mold bias based on the DiT architecture, driving the development of new generation molds. However, at that time, there were few domestic products attempting this bias.
At the beginning of this year, the Hunyuan Wenshutu large model was downgraded to the DiT architecture, and during this process, Tencent also stopped departmental improvements to the original model. For example, at the algorithmic level, the long text understanding ability of the module has been optimized, which can support up to 256 characters of substantive input.
Previously, the focus dataset of sub models such as Stable Diffusion was mainly in English, while the mixed element text familiar graph exaggerated that it was a Chinese original familiar DiT model with bilingual understanding and natural proficiency in both Chinese and English.
Lu Qinglin informed the interface that in the past, many big models used to translate foreign data into a single layer and then stop practicing in Chinese, which would cause a lot of information loss and ambiguity. And the hybrid text and familiar image big model supports the original Chinese language understanding ability, cutting off the translation process, whether users directly use Chinese data to stop practicing, the big model also understands Chinese.
According to the evaluation results presented by Tencent, the overall consequences of the new Wen Shu Tu Da Mo Visual Tian Shu are 20% higher than the previous generation, with significant improvements in semantic understanding, image quality, and authenticity.
At the same time, Hunyuan has also achieved multiple rounds of image and dialogue skills, allowing users to stop integrating through natural language descriptions based on an initial natural image.
From Lu Qinglin’s perspective, in terms of multi round dialogue quality, the big talk model constantly interacts between users and chatboxes (AI chatbots). Users do not need to provide complete reminder vocabulary at once. They can write and change at the same time (pictures), feel unsatisfied, and then change at the same time, greatly improving the application threshold.
Heyuan can make big molds move faster
According to the interface information, Hunyuan Wenshu Tu Da Mo mainly stopped mutual assistance with Tencent Advertising last year and built AI driving objects in advertising scenarios. This year, it plans to increase mutual assistance with QQ, enterprise WeChat, and gaming businesses, and implement it on a large scale in more business scenarios.
Lu Qinglin exaggerates that the current hybrid version of the Hunyuan Wenshu Tu big model is also a version currently being used internally by Tencent. The first few generations that do not survive for personal use will lag behind the industry.
In fact, it has already been used in many scenes within Tencent, such as material creation, product decomposition, and gameplay without pictures. For example, at the beginning of this year, Tencent Advertising launched a one-stop AI advertising creative platform based on the hybrid model, which can provide advertisers with objects such as text and image illustrations, as well as product landscape decomposition.
In the past, the path taken by the hybrid text and graphics big model was also crucial. After iterating the big model internally, the interface withered and was not utilized. Now the team has realized that creating a collaborative community can encourage more collaborators to participate, and the form of co construction can help the big model go faster.
“Last July, we embarked on a transformation (DiT architecture) and went through many pitfalls. It wasn’t until January of this year that we gradually lost track of our results.” Lu Qinglin believes that choosing Heyuan at this time is a suitable opportunity. Based on the Heyuan model, companies don’t need to practice it again. Instead, they can use it directly for reasoning and save a lot of manpower and computing power.
After the release of Sora in Open AI during this year’s Spring Festival, the team led by Lu Qinglin has “never had a good year” and has confirmed the powerful talent of the DiT architecture. “We hope to share the DiT of images, so that the industry can quickly follow up and pursue it with the help of videos.”
Before making the decision to merge with the source, Tencent also made a horizontal comparison attempt internally. The conclusion is that in terms of consistency, aesthetics, clarity, and other comparison dimensions, the difference between the hybrid and tributary source models is not too significant, and can be ranked behind Dalle 3 and SD 3 (Stable Diffusion 3).
In addition, the current community for combining cultural and graphic elements is mainly composed of English language communities such as Stable Diffusion. After Tencent chooses to merge resources, it can enrich the cultural and graphic elements mainly in Chinese, create more diverse original plugins, and promote the development and application of Chinese cultural and graphic elements.
The integration of Hunyuan Wenshu Tu Da Mo is also a part of Tencent’s promotion of integration strategy. According to statistics provided by the public, Tencent has currently merged over 170 names, all based on real business scenarios, and has uncovered key businesses such as WeChat, Tencent Cloud, and Tencent Play.

作者 admin


您的电子邮箱地址不会被公开。 必填项已用 * 标注