Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling

Yinuo Wang1,※ , Yanbo Fan2,*,※, Xuan Wang2, Yu Guo1,*, Fei Wang1
1 Xi'an Jiaotong University, 2 Ant Group
CVPR 2025 (Highlight)

Equal Contribution

*Corresponding Authors
arXiv Supplementary Code

Abstract

Listening head generation aims to synthesize non-verbal responsive listening head videos that naturally react to a certain speaker, for which, both realistic head movements, expressive facial expressions, and high visual qualities are expected. Previous approaches typically follow a two-stage pipeline that first generates intermediate 3D motion signals such as 3DMM coefficients, and then synthesizes the videos by deterministic rendering, suffering from limited motion expressiveness and low visual quality (e.g. 256×256). In this work, we propose a novel listening head generation method that harnesses the generative capabilities of the diffusion model for both motion generation and high-quality rendering. Crucially, we propose an effective hybrid motion modeling module that addresses training difficulties caused by the scarcity of listening head data while preserving the intricate details that may be lost in explicit motion representations. We further develop a tailored control guidance for head pose and facial expression, by integrating their intrinsic motion characteristics. Our method enables high-fidelity video generation with 512×512 resolution and delivers vivid listener motion feedback. We conduct comprehensive experiments and obtain superior performance in terms of both visual quality and motion expressiveness compared with existing methods.

Overview

MY ALT TEXT

Overview of the proposed framework. Given one listener portrait image and speaker's audio and head motion as input, we generate realistic and vivid listening head video. (a) is a lightweight diffusion transformer to generate the explicit motion representation, (b) is a reference net to extract the listener identity feature, (c) is our generation backbone, (d) is the implicit motion refinement module.

Generated Videos

Video Presentation

BibTeX (Coming Soon)

BibTex Code Here