EfficientSpeech: An On-Device Text to Speech Model

Abstract

State of the art (SOTA) neural text to speech (TTS) models can generate natural-sounding synthetic voices. These models are characterized by large memory footprints and substantial number of operations due to the long-standing focus on speech quality with cloud inference in mind. Neural TTS models are generally not designed to perform standalone speech syntheses on resource-constrained and no Internet access edge devices. In this work, an efficient neural TTS called EfficientSpeech synthesizes speech on an ARM CPU in real-time. EfficientSpeech uses a shallow non-autoregressive pyramid-structure transformer forming a U-Network. EfficientSpeech has 266k parameters and consumes 90 MFLOPS only or about 1% of the size and amount of computation in modern compact models such as Mixer-TTS. EfficientSpeech achieves an average mel generation real-time factor (RTF) of 104.3 on an RPi4. Human evaluation shows only a slight degradation in audio quality as compared to FastSpeech2.

Audio Samples

Lifting a print involves the use of adhesive material to remove the fingerprint powder which adheres to the original print.

FastSpeech2 (30.8M)

EfficientSpeech (Tiny 266k)

EfficientSpeech (Small 952k)

EfficientSpeech (Base 4M)

PortaSpeech (21.8M)


I cannot find that calcraft was sworn in when appointed, or any exact information when the old forbidding ceremony ceased to be practiced.

FastSpeech2 (30.8M)

EfficientSpeech (Tiny 266k)

EfficientSpeech (Small 952k)

EfficientSpeech (Base 4M)

PortaSpeech (21.8M)


Here's the identical rope at sixpence an inch.

FastSpeech2 (30.8M)

EfficientSpeech (Tiny 266k)

EfficientSpeech (Small 952k)

EfficientSpeech (Base 4M)

PortaSpeech (21.8M)


She sent brewer into the theatre to find the man and check the exits, told him about the assassination, and said, quote,

FastSpeech2 (30.8M)

EfficientSpeech (Tiny 266k)

EfficientSpeech (Small 952k)

EfficientSpeech (Base 4M)

PortaSpeech (21.8M)

Note:

LightSpeech is not open source. LightSpeech samples below are from its project page.


From the time when books first took their present shape till the end of the sixteenth century, or indeed later.

FastSpeech2 (30.8M)

EfficientSpeech (Tiny 266k)

EfficientSpeech (Small 952k)

EfficientSpeech (Base 4M)

LightSpeech (1.8M)


The modern printer, in the teeth of the evidence given by his own eyes, considers the single page as the unit, and prints the page in the middle of his paper.

FastSpeech2 (30.8M)

EfficientSpeech (Tiny 266k)

EfficientSpeech (Small 952k)

EfficientSpeech (Base 4M)

LightSpeech (1.8M)


No definite rules, however, except the avoidance of "rivers" and excess of white, can be given for the spacing.

FastSpeech2 (30.8M)

EfficientSpeech (Tiny 266k)

EfficientSpeech (Small 952k)

EfficientSpeech (Base 4M)

LightSpeech (1.8M)


Only nominally so, however, in many cases, since when he uses a headline he counts that in.

FastSpeech2 (30.8M)

EfficientSpeech (Tiny 266k)

EfficientSpeech (Small 952k)

EfficientSpeech (Base 4M)

LightSpeech (1.8M)