Speaking Without Sound:
Multi-speaker Silent Speech Voicing
with Facial Inputs Only

Jaejun Lee, Yoori Oh, and Kyogu Lee
Music and Audio Research Group (MARG), Seoul National University


Abstract

In this paper, we introduce a novel framework for generating multi-speaker speech without relying on any audible inputs. Our approach leverages silent electromyography (EMG) signals to capture linguistic content, while facial images are used to align with the vocal identity of the target speaker. Notably, we present a pitch-disentangled content embedding that enhances the extraction of linguistic content from EMG signals. Extensive analysis demonstrates that our method can successfully generate multi-speaker speech without any audible inputs and confirms the effectiveness of the proposed pitch-disentanglement approach.


Figure : Overview of the proposed multi-speaker EMG-to-Speech generation framework. In the (a) Inference phase, content embedding is estimated from EMG signals, while speaker embedding and global pitch information are derived from a facial image. During the (b) Training phase, for training the (i) face-based voice conversion network, a predefiend speaker-wise global pitch is used to estimate frame-wise F0 values. However, since the global pitch values for unseen target speakers are not available during the inference, a (iii) face-based global pitch estimation network is independently trained using only face images. Additionally, as the speech-based content encoder is not available during inference, the (ii) EMG-based contents estimation network is trained. Each network is trained independently during the training phase.

Conversion samples

Note: All the source input here consists of silent EMG signals, meaning there is no audible sound when recorded. The audio in the source part is vocalized speech paired with voiced EMG, recorded independently but containing the same content as the silent EMG, meaning that the script for both EMG signals was identical.

1. Multi-speaker conversion samples

Utilize the content information from the input silent EMG and a facial image from the target.
The conversion model used here is the proposed model with a pitch-flattening module.
We recommend playing the source speech first, then listening to the converted speech while looking at the target face image.
This is because the goal of this research is not to match the target voice exactly, but to align with the target face.
We suggest playing the target voice as the last step.

Sample 1-1 (male)

Source EMG

"He read and re-read the paper, fearing the worst had happened to me."
Target Face

Converted


Sample 1-2 (male)

Source EMG

"He heard footsteps running to and fro in the rooms, and up and down stairs behind him."
Target Face

Converted


Sample 1-3 (male)

Source EMG

"Even then he scarcely understood what this indicated, until he heard a muffled grating sound and saw the black mark jerk forward an inch or so."
Target Face

Converted


Sample 1-4 (male)

Source EMG

"People were fighting savagely for standing-room in the carriages even at two o'clock."
Target Face

Converted


Sample 1-5 (female)

Source EMG

"That was it!"
Target Face

Converted


Sample 1-6 (female)

Source EMG

"These hill-like forms grew lower and broader even as we stared."
Target Face

Converted


Sample 1-7 (female)

Source EMG

"I did not know it, but that was the last civilised dinner I was to eat for very many strange and terrible days."
Target Face

Converted


Sample 1-8 (female)

Source EMG

"The place was impassable."
Target Face

Converted






2. Conversion sample with same source content, different target faces

Utilize the content information from the input silent EMG and a facial image from the target.
The conversion model used here is the proposed model with a pitch-flattening module.

Sample 2-1

Source EMG

"I know I did."
Conversion




Sample 2-2

Source EMG

"I, too, on my side began to move
towards the pit."
Conversion








3. Local pitch variation

One of the main contributions we highlight in this paper is the pitch-flattening module. Not only does it improve intelligibility, but we also demonstrate that this module enables the model to better predict content-related local pitch. The graphs below illustrate that the model with the pitch-flattening module (Flatten) more closely follows the ground-truth pitch variance contour compared to the base model without the module (Base).

Sample 3-1

Source
"He became alarmed at the news in this, and went again to Waterloo station."
Target
Gaddy and Klein (2020)
Source audio
Converted - Base model
Converted - Flatten model


Sample 3-2

Source
"Crossed the river, and two of them, black against the western sky."
Target
Gaddy and Klein (2020)
Source audio
Converted - Base model
Converted - Flatten model