Conversion samples

1. Conversion samples with benchmark

Utilize speech audio from the Source and a facial image from the Target.
(HYFace: The proposed method. FVMVC: The benchmark. Sheng ., 2023)

Sample 1 (male-to-male)

Source Speaker



Target Speaker



Converted


FVMVC

HYFace

Sample 2 (male-to-male)

Source Speaker



Target Speaker



Converted


FVMVC

HYFace

Sample 3 (female-to-female)

Source Speaker



Target Speaker



Converted


FVMVC

HYFace

Sample 4 (female-to-female)

Source Speaker



Target Speaker



Converted


FVMVC

HYFace

Sample 5 (male-to-female)

Source Speaker



Target Speaker



Converted


FVMVC

HYFace

Sample 6 (male-to-female)

Source Speaker



Target Speaker



Converted


FVMVC

HYFace

Sample 7 (female-to-male)

Source Speaker



Target Speaker



Converted


FVMVC

HYFace

Sample 8 (female-to-male)

Source Speaker



Target Speaker



Converted


FVMVC

HYFace

2. Conversion samples with different target image

Utilize speech audio from the Source and a facial image from the Target.
All results are from our proposed model, HYFace.

Sample 9

Source audio
Conversion with different speakers






Sample 10

Source audio
Conversion with different speakers




3. Conversion samples with different fundamental frequency (F0)

Utilize speech audio from the Source and a facial image from the Target.
All results are from our proposed model, HYFace.
Which do you think is more natural: the ground truth F0 or our predicted F0?

Sample 11

Source speaker

Target image

Conversion with different F0

GT F0 (177.5)

Predicted F0 (131.1)

F0 control (71.1)

F0 control (191.1)



Sample 12

Source speaker

Target image

Conversion with different F0

GT F0 (207.7)

Predicted F0 (246.3)

F0 control (186.3)

F0 control (276.3)