The Achievement Award
Pioneering Work of High-Quality Speech Synthesis with Small Memory Usage
Masami Akamine@E@Takehiko Kagoshima
@Speech synthesis is a technology for producing artificial human speech for given texts. Text-to-speech synthesis (TTS) systems are used to, for example, read texts to visually handicapped people and provide drivers with voice guidance to their destinations in car navigation systems.
@Speech synthesis methods can be categorized into two types, namely, unit-editing and corpus-based synthesis. In unit-editing-based speech synthesis, synthetic speech is produced by concatenating pieces of the speech signal called speech units. To add intonation to the synthetic speech, the prosody (pitch and duration) of these speech units is converted to predicted values from input texts prior to the concatenation. This prosody conversion causes a heavy distortion in synthesized speech. Until the time that Dr. Akamine and Dr. Kagoshima started their research on speech synthesis, speech units had been created by engineers by trial and error to decrease distortion. The quality of the resulting speech, therefore, was poor with muffled voices and unnatural intonation. Moreover, the process of creating speech units required too much time and labor to develop a TTS system within a short time. The other type of speech synthesis, i.e., corpus-based speech synthesis, produces synthetic speech by selecting speech units from a huge speech database (called a corpus) and concatenating them without any modification of the pitch and duration. The corpus has speech units with all types of pitch and duration. This type could produce high-quality voices. However, it needs too large a memory (several gigabytes) for storing speech units and heavy computations for searching the corpus for speech units to be applied in embedded systems such as car navigation systems.
@ Dr. Akamine and Dr. Kagoshima invented a new method of creating speech units for unit-editing-based speech synthesis. This method, called closed-loop training (CLT), automatically generates the speech units that minimize the distortion between synthesized speech and recorded reference speech. In CLT, Dr. Akamine and Dr. Kagoshima formulated a speech synthesis process using a mathematical equation and successfully formulated the distortion as a squared error between recorded speech and synthesized speech for the first time in the world. The squared error is calculated over all variations of the prosody in the recorded reference speech. The formulation of the distortion leads to a linear equation for minimizing the distortion. CLT can create the minimum number of speech units required to produce all sounds, minimizing the distortion that cannot be avoided in unit-editing-based speech synthesis. The CLT method automatically generates speech units not by trial and error. CLT made it possible for the first time to realize high-quality speech synthesis within a short time for embedded systems.
@ he pioneering work by Dr. Akamine and Dr. Kagoshima is highly appreciated in academic and industrial societies. They received the Prime Ministerfs Award, National Commendation for Invention, from the Japan Institute of Invention and Innovation in 2008, the Society Paper Award from the Institute of Electronics, Information and Communication Engineers (IEICE) in 2003, and the Ichimura Industrial Award from the New Technology Foundation Japan in 2003, among others. Their pioneering work on high-quality speech synthesis deserves the IEICE achievement award.
 
 
 
‚˜‚˜‚˜‚˜‚˜‚˜
[1] T. Kagoshima and M. Akamine, "Automatic Generation of Speech Synthesis units based on Closed Loop Training," Proc. ICASSP97, pp.963-966, Apr. 1997.
[2] T. Kagoshima and M. Akamine," Automatic Generation of Synthesis Units by Units Selection Based on Closed Loop Training" IEICE Trans. (D-II), vol. J81-D-II, no.9, pp.1949-1954, Sep. 1998.
[3] M. Akamine and T. Kagoshima, "Analytic Generation of Synthesis Units by Closed Loop Training for Totally Speaker Driven Text to Speech System (TOS Drive TTS)," Proc. ICSLP'98, pp.1927-1930, Dec. 1998.
[4] T. Kagoshima, M. Morita, S. Seto, and M. Akamine, "An F0 Contour Control Model for Totally Speaker Driven Text to Speech System," Proc. ICSLP'98, pp.1975-1978, Dec. 1998.
[5] T. Kagoshima and M. Akamine, "Automatic Generation of Optimal Synthesis Units Based on Closed Loop Training," IEICE Trans. D-II, Vol.J83-D-II, No.6, pp.1405-1411, 2000. -D-II, no.6, pp.1405-1411, 2000.
[6] T. Kagoshima, M. Morita, S. Seto, M. Akamine and Y. Shiga, " An F0 Contour Control Model Using an F0 Contour Codebook," IEICE Trans. D-II, Vol.J85-D-II, No.6, pp.976-986, 2002.

    Close