Conversion samples

1. Conversion samples with benchmark

Utilize speech audio from the Source and a facial image from the Target.
(HYFace: The proposed method. FVMVC: The benchmark. Sheng $\color{black} et \hspace{1mm} al$ ., 2023)

Sample 1 (male-to-male)

Source Speaker

Target Speaker

Converted

FVMVC

HYFace

Sample 2 (male-to-male)

Source Speaker

Target Speaker

Converted

FVMVC

HYFace

Sample 3 (female-to-female)

Source Speaker

Target Speaker

Converted

FVMVC

HYFace

Sample 4 (female-to-female)

Source Speaker

Target Speaker

Converted

FVMVC

HYFace

Sample 5 (male-to-female)

Source Speaker

Target Speaker

Converted

FVMVC

HYFace

Sample 6 (male-to-female)

Source Speaker

Target Speaker

Converted

FVMVC

HYFace

Sample 7 (female-to-male)

Source Speaker

Target Speaker

Converted

FVMVC

HYFace

Sample 8 (female-to-male)

Source Speaker

Target Speaker

Converted

FVMVC

HYFace

2. Conversion samples with different target image

Utilize speech audio from the Source and a facial image from the Target.
All results are from our proposed model, HYFace.

Sample 9

Source audio
Conversion with different speakers

Sample 10

Source audio
Conversion with different speakers

3. Conversion samples with different fundamental frequency (F0)

Utilize speech audio from the Source and a facial image from the Target.
All results are from our proposed model, HYFace.
Which do you think is more natural: the ground truth F0 or our predicted F0?

Sample 11

Source speaker	Target image
Conversion with different F0
GT F0 (177.5)	Predicted F0 (131.1)	F0 control (71.1)	F0 control (191.1)

Sample 12

Source speaker	Target image
Conversion with different F0
GT F0 (207.7)	Predicted F0 (246.3)	F0 control (186.3)	F0 control (276.3)

Hear Your Face: Face-based voice conversion with F0 estimation

Abstract

Github

Conversion samples

1. Conversion samples with benchmark

Utilize speech audio from the Source and a facial image from the Target.
(HYFace: The proposed method. FVMVC: The benchmark. Sheng $\color{black} et \hspace{1mm} al$ ., 2023)

Sample 1 (male-to-male)

Sample 2 (male-to-male)

Sample 3 (female-to-female)

Sample 4 (female-to-female)

Sample 5 (male-to-female)

Sample 6 (male-to-female)

Sample 7 (female-to-male)

Sample 8 (female-to-male)

2. Conversion samples with different target image

Utilize speech audio from the Source and a facial image from the Target.
All results are from our proposed model, HYFace.

Sample 9

Sample 10

3. Conversion samples with different fundamental frequency (F0)

Utilize speech audio from the Source and a facial image from the Target.
All results are from our proposed model, HYFace.
Which do you think is more natural: the ground truth F0 or our predicted F0?

Sample 11

Sample 12