In a world increasingly shaped by generative AI, editing video and audio with a simple text prompt is no longer science fiction — but achieving seamless synchronization between the two remains a stubborn technical barrier. Now, a team of Microsoft interns might have just cracked the code.
A new research paper introduces AVED, a novel framework designed to edit both audio and video in perfect harmony — all without any additional model training.
Developed during an internship at Microsoft, AVED represents a leap forward in zero-shot editing, tackling one of AI’s most persistent multimodal challenges: keeping sound and visuals in sync.
From Dogs Barking to Lions Roaring — All in a Single Prompt
At the core of AVED is the idea that generative models should be able to transform both visual and auditory elements of a scene in tandem. Imagine a video of a dog barking that, when prompted, transforms into a lion roaring.

Existing models could edit either the video or the audio, but usually not both — at least not without introducing jarring misalignments, audio artifacts, or incoherent motion.
AVED addresses this head-on with a technique the authors call cross-modal delta denoising.
Subscribe
Enter your email below to receive updates.
It leverages a diffusion-based framework that edits both audio and video simultaneously, using text prompts as supervision — and it does all this in zero-shot mode, meaning no fine-tuning or retraining is required.

A New Benchmark: AVED-Bench
To evaluate the performance of their model, the team created AVED-Bench, a new benchmark dataset comprising 110 ten-second videos sourced from the VGGSound dataset.
Each clip is annotated with both a source and target prompt, covering categories ranging from animals and vehicles to environmental sounds.
Compared to earlier benchmarks like DreamMotion or TokenFlow, AVED-Bench ups the ante by requiring coherent edits across both modalities.
It’s a test of true multimodal understanding — not just style transfer or surface-level generation.
Video Demo
Outperforming the State-of-the-Art
AVED doesn’t just sound good on paper. In rigorous quantitative evaluations, it outperforms leading zero-shot video models like TokenFlow and RAVE, as well as audio models like ZEUS.

It achieves higher scores across key metrics, including CLIP-F (frame consistency), CLAP (audio-text alignment), and AV-Align (audio-visual synchronization).
In human evaluations conducted via Amazon Mechanical Turk, participants overwhelmingly preferred AVED-edited videos over those produced by competing methods.
Around 75% of raters chose AVED over ControlVideo, and over 60% preferred it to both TokenFlow and RAVE.
No Training Required — Really
What makes AVED especially impressive is that it achieves this synchronization without retraining any diffusion models.
Instead, it adapts off-the-shelf models like Stable Diffusion and AudioLDM2, applying a clever scoring and patch-based contrastive loss mechanism to guide the edits in both space and time.
In practical terms, that means creators — from indie filmmakers to TikTok editors — could one day use AVED-like systems to generate fully aligned, prompt-based edits on the fly, without waiting for massive training cycles or requiring technical expertise.
Subscribe
Enter your email below to receive updates.
The Bigger Picture: AI That Understands Context
AVED isn’t just a cool editing trick. It represents a broader shift toward context-aware generative models that understand how audio and visuals relate to each other.
In a landscape increasingly dominated by short-form video, immersive media, and AI-generated content, tools like AVED could be foundational for the next generation of multimedia storytelling.
The researchers hint that AVED could be extended even further — potentially supporting longer video segments, multi-scene transitions, or even interactive editing tools that adjust in real time.
A Glimpse Into the Future
For now, AVED is a research prototype. But its implications are far-reaching.
By solving one of the hardest problems in multimodal AI — synchronizing edits across sound and image — this framework sets a new bar for what’s possible in zero-shot editing.
And perhaps most impressively, it was born out of an internship.
If Microsoft’s interns are building tech like this, the future of creative AI might be arriving even faster than expected.
Source: arXiv:2503.20782, Research Paper PDF, GitHub Project Page.
