New Baidu AI Turns Voice Recordings into Realistic Talking Videos

A team of researchers, supported by Baidu Inc. and various universities, today introduced AudCast, a groundbreaking AI system that transforms simple voice recordings into fully realistic, full‑body talking videos.

Unlike previous AI tools that generate only talking heads, AudCast produces coherent head, body, and hand movements synchronized precisely with speech.

A comparison of reference frames and generated videos showing the transformation of simple voice recordings into realistic full-body animations using the AudCast AI system. The first row features an animated character in a nun outfit, while the second and third rows display human presenters engaging with audio-driven visuals. — Comparison of reference frames and generated video using AudCast, showcasing the synchronization of motion and audio. (Image Source: AudCast Team)

It employs a novel two‑stage cascade of diffusion transformers: first generating holistic human motion from a single reference image and audio clip, then refining facial expressions and finger gestures using detailed 3D structural priors.

A diagram illustrating the architecture of the AudCast AI system, showcasing the processes of feature encoding and video generation with various components like VAE encoders, transformer layers, and adapters for motion, appearance, and identity. — Diagram illustrating the feature encoding and video generation processes of AudCast (Image Source: AudCast Team)

In head‑to‑head evaluations, AudCast outperformed all existing methods in both video quality and motion realism, achieving a Structural Similarity Index (SSIM) of 0.8275 and cutting perceptual error nearly in half compared to the best baseline.

Image showcasing various versions of a video synthesis model, comparing their output against reference frames. The rows display different generated videos labeled as 'Ours', 'Vlogger', 'Talk-Mimic', 'Prob-CNxt', and 'S2G', highlighting differences in hand gestures and motion realism. — Comparison of synthesized videos from AudCast and other AI models, showcasing various gestures and speech synchronization. (Image Source: AudCast Team)

Independent reviewers praised its lifelike gestures and identity consistency, noting its ability to maintain a subject’s appearance throughout speech.

Image showing a side-by-side comparison of facial and hand movements generated by the AudCast AI system. Left side features a vocal performer with articulated hand gestures, while the right side displays a refined 3D representation of the face and hands. — (Image Source: AudCast Team)

A series of images showing a person gesturing while talking, overlaid with visual representations of motion and a heatmap indicating movement dynamics across time. — (Image Source: AudCast Team)

AudCast was accepted for presentation at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025, underscoring its significance in advancing audio‑driven video synthesis.

Potential applications range from virtual hosts and digital educators to personalized video messages, enabling anyone’s voice to be brought to life onscreen without traditional filming.

Source: arXiv:2503.19824 [cs.GR], AudCast PDF

Categorized in:

Audio AI Digital Humans Motion Capture Speech Sync Video Synthesis

New Baidu AI Turns Voice Recordings into Realistic Talking Videos

Adarsh

Leave a Reply Cancel reply

Press ESC to close

Or check our Popular Categories...

Adarsh

Leave a Reply Cancel reply

Related Articles

Researcher Builds Real-Time Emotion Classifier Using Spectrograms and Neural Networks