LipSody Demo

LipSody: Lip-to-Speech Synthesis
with Enhanced Prosody Consistency

Jaejun Lee, Yoori Oh, and Kyogu Lee
Music and Audio Research Group (MARG), Seoul National University

Submitted to ICASSP 2026


Abstract

Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such as LipVoicer have demonstrated impressive performance in reconstructing linguistic content, they often lack prosodic consistency. In this work, we propose LipSody, a lip-to-speech framework enhanced for prosody consistency. LipSody introduces a prosody-guiding strategy that leverages three complementary cues: speaker identity extracted from facial images, linguistic content derived from lip movements, and emotional context inferred from face video. Experimental results demonstrate that LipSody substantially improves prosody-related metrics—including global and local pitch deviations, energy alignment, and speaker similarity—compared to prior approaches.


Figure : Overview of LipSody, the proposed diffusion-based lip-to-speech framework with enhanced pitch consistency. During training, ground-truth pitch and energy values are used to provide prosody-related supervision to the lip-to-speech network. For inference, an independent network is trained to predict these prosody features from lip movement-based linguistic content, face image-based speaker identity, and face video-based emotional expression.

Samples

Below are 15 sample results. Each row shows Ground Truth (GT), LipVoicer (Yemini, 2023), and LipSody (Our proposed) in order.

Rather than focusing on intelligibility (content reproduction), please try to pay attention to how close the prosody is to the GT
—covering global pitch, local pitch deviations, and voice similarity.

Sample GT LipVoicer LipSody
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15