Revolutionizing AI-Generated Audio with Hunyuan Video-Foley

In the realm of advanced technology, a team from Tencent’s Hunyuan Lab has developed a new artificial intelligence system known as ‘Hunyuan Video-Foley,’ which has created a breakthrough in the world of AI-generated audio for videos. This system aims to add realistic and harmonious sound effects to generated videos, enhancing the viewing experience and adding a touch of realism.

Challenges in AI-Generated Audio

When watching AI-generated videos, the visuals may be stunning, but the audio often lacks vitality, creating a gap in the overall experience. In the film industry, the sound effects that fill this silence are known as Foley art, a craft that requires high skill and precision from experts.

The greatest challenge that automated systems have faced for years is producing believable sounds for videos. This process requires the ability to capture fine details and synchronize audio with movement accurately, which previous systems lacked.

Tencent’s Solutions to Audio Problems in Generated Videos

Tencent identified that one of the main reasons for the failure of video-to-audio (V2A) models was what researchers called ‘media imbalance.’ These AI systems relied more on attached text rather than focusing on the actual video content.

For example, if a model is given a video of a crowded beach with people and birds, but the attached text only mentions ‘the sound of waves,’ the model will focus on the waves, ignoring other sounds like footsteps or bird calls. Additionally, the quality of the generated audio was often subpar, and there was not enough high-quality video with audio to effectively train the models.

Tencent’s Strategy in Deep Learning and Advanced Technology

The Hunyuan team addressed these issues from three different angles. First, they built a massive library containing 100,000 hours of videos, audio, and textual descriptions to train the AI. They also developed an automated system to filter out low-quality content from the internet, ensuring that the AI learns from the best available materials.

Secondly, they designed a smarter AI architecture, enabling the model to learn how to perform multiple tasks correctly. The system begins by focusing on the link between sound and image to precisely adjust timing, such as matching the sound of footsteps with the moment a foot touches the ground. After adjusting the timing, the system incorporates the attached text to understand the overall context and mood of the scene.

Representation Alignment Strategy for Improved Audio Quality

To ensure the quality of the AI-generated audio, the team used a training strategy known as Representation Alignment (REPA). This strategy is akin to having an expert sound engineer oversee the AI during training, guiding it towards producing purer, richer, and more stable audio.

When testing Hunyuan Video-Foley against other leading AI models, the results were clear. Not only were the computational metrics better, but human listeners also rated the output as higher quality and more aligned with the video in both content and timing.

Conclusion

Tencent’s efforts in developing Hunyuan Video-Foley represent a significant step towards bridging the gap between silent AI-generated videos and immersive viewing experiences with high-quality audio. It brings the magic of Foley art to the world of automated content creation, offering powerful possibilities for filmmakers, animators, and creators everywhere.