A team of researchers, supported by Baidu Inc. and various universities, today introduced AudCast, a groundbreaking AI system that transforms simple voice recordings into fully realistic, full‑body talking videos.
Unlike previous AI tools that generate only talking heads, AudCast produces coherent head, body, and hand movements synchronized precisely with speech.

It employs a novel two‑stage cascade of diffusion transformers: first generating holistic human motion from a single reference image and audio clip, then refining facial expressions and finger gestures using detailed 3D structural priors.

In head‑to‑head evaluations, AudCast outperformed all existing methods in both video quality and motion realism, achieving a Structural Similarity Index (SSIM) of 0.8275 and cutting perceptual error nearly in half compared to the best baseline.

Independent reviewers praised its lifelike gestures and identity consistency, noting its ability to maintain a subject’s appearance throughout speech.


AudCast was accepted for presentation at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025, underscoring its significance in advancing audio‑driven video synthesis.
Potential applications range from virtual hosts and digital educators to personalized video messages, enabling anyone’s voice to be brought to life onscreen without traditional filming.
Source: arXiv:2503.19824 [cs.GR], AudCast PDF
