In a world increasingly shaped by generative AI, editing video and audio with a simple text prompt is no longer science fiction — but achieving seamless synchronization between the two remains a stubborn technical barrier. Now, a team of Microsoft interns might have just cracked the code.

A new research paper introduces AVED, a novel framework designed to edit both audio and video in perfect harmony — all without any additional model training.

Developed during an internship at Microsoft, AVED represents a leap forward in zero-shot editing, tackling one of AI’s most persistent multimodal challenges: keeping sound and visuals in sync.

From Dogs Barking to Lions Roaring — All in a Single Prompt

At the core of AVED is the idea that generative models should be able to transform both visual and auditory elements of a scene in tandem. Imagine a video of a dog barking that, when prompted, transforms into a lion roaring.

An illustrative diagram showing the transition of a video from a dog barking to a lion roaring, with annotations explaining solely editing audio/video versus jointly editing audio and video.
Source: AVED Research Team

Existing models could edit either the video or the audio, but usually not both — at least not without introducing jarring misalignments, audio artifacts, or incoherent motion.

AVED addresses this head-on with a technique the authors call cross-modal delta denoising.

Subscribe

Enter your email below to receive updates.

It leverages a diffusion-based framework that edits both audio and video simultaneously, using text prompts as supervision — and it does all this in zero-shot mode, meaning no fine-tuning or retraining is required.

A diagram illustrating the cross-modal delta denoising scheme used in the AVED framework for synchronizing audio and video edits based on text prompts, featuring source and optimized video frames alongside source and optimized audio visualizations.
Source: AVED Research Team

A New Benchmark: AVED-Bench

To evaluate the performance of their model, the team created AVED-Bench, a new benchmark dataset comprising 110 ten-second videos sourced from the VGGSound dataset.

Each clip is annotated with both a source and target prompt, covering categories ranging from animals and vehicles to environmental sounds.

Compared to earlier benchmarks like DreamMotion or TokenFlow, AVED-Bench ups the ante by requiring coherent edits across both modalities.

It’s a test of true multimodal understanding — not just style transfer or surface-level generation.

Video Demo

Outperforming the State-of-the-Art

AVED doesn’t just sound good on paper. In rigorous quantitative evaluations, it outperforms leading zero-shot video models like TokenFlow and RAVE, as well as audio models like ZEUS.

A comparison image showing a video editing framework where a source video of a cat is transformed into a target video of a dog. The top section displays frames from the transformation process, while the bottom section presents spectrograms indicating audio synchronization for different models: ControlVideo + ZEUS, TokenFlow + ZEUS, and RAVE + ZEUS.
Source: AVED Research Team

It achieves higher scores across key metrics, including CLIP-F (frame consistency), CLAP (audio-text alignment), and AV-Align (audio-visual synchronization).

In human evaluations conducted via Amazon Mechanical Turk, participants overwhelmingly preferred AVED-edited videos over those produced by competing methods.

Around 75% of raters chose AVED over ControlVideo, and over 60% preferred it to both TokenFlow and RAVE.

No Training Required — Really

What makes AVED especially impressive is that it achieves this synchronization without retraining any diffusion models.

Instead, it adapts off-the-shelf models like Stable Diffusion and AudioLDM2, applying a clever scoring and patch-based contrastive loss mechanism to guide the edits in both space and time.

In practical terms, that means creators — from indie filmmakers to TikTok editors — could one day use AVED-like systems to generate fully aligned, prompt-based edits on the fly, without waiting for massive training cycles or requiring technical expertise.

Subscribe

Enter your email below to receive updates.

The Bigger Picture: AI That Understands Context

AVED isn’t just a cool editing trick. It represents a broader shift toward context-aware generative models that understand how audio and visuals relate to each other.

In a landscape increasingly dominated by short-form video, immersive media, and AI-generated content, tools like AVED could be foundational for the next generation of multimedia storytelling.

The researchers hint that AVED could be extended even further — potentially supporting longer video segments, multi-scene transitions, or even interactive editing tools that adjust in real time.

A Glimpse Into the Future

For now, AVED is a research prototype. But its implications are far-reaching.

By solving one of the hardest problems in multimodal AI — synchronizing edits across sound and image — this framework sets a new bar for what’s possible in zero-shot editing.

And perhaps most impressively, it was born out of an internship.

If Microsoft’s interns are building tech like this, the future of creative AI might be arriving even faster than expected.

Source: arXiv:2503.20782, Research Paper PDF, GitHub Project Page.

Categorized in: