An Integrated Prosodic Approach to Device-Independent, Natural-Sounding Speech Synthesis



This grant ran from 1 October 1997 to 31 March 2000. The award holders are Sarah Hawkins at Cambridge, John Local and Richard Ogden at the University of York, and Jill House and Mark Huckvale at University College London. The project is funded by EPSRC grants GR/L53069 (Cambridge), GR/L51829 (York) and GR/L52109 (UCL)


Objectives
The project explores the viability of a phonological model that rectifies some of the phonetic weaknesses of current concatenative and formant-based text-to-speech systems. The new model integrates timing, intonation and systematic segmental variation. For the selected linguistic structures modelled, the result should be high-quality, natural-sounding synthetic speech that is robust in noise. Our objectives are:

  1. Demonstration of selected parts of a text-to-speech system constructed on linguistically-motivated, declarative computational principles.
  2. A system-independent description of the linguistic structures developed.
  3. Perceptual test results using criteria of naturalness and robustness.


Summary
Current text-to-speech systems, both concatenative and formant-based, have some common shortcomings: the speech often sounds unnatural because the rhythm, intonation and fine phonetic detail reflecting coarticulatory patterns are poor, so although intelligibility rates may be good, listeners experience increased cognitive load and poorer perception in noise. These shortcomings restrict the applications for which synthetic speech is useful. This collaborative project aims to integrate and extend existing knowledge to produce the core of a new model of computational phonology and phonetic interpretation which will deliver high-quality speech synthesis. The complete model will comprise a unified, language- and accent-independent linguistic representation. The current project is developing a partial model, using representative linguistic structures which will test the viability of our approach, applied initially to Southern British English. The three focal areas of research are intonation, morphological structure, and systematic segmental variation. The common factor is a temporal model that systematically structures information from all three areas and governs the output of synthesizer parameters. The signal generation component will be based initially on time-domain modification of natural speech signals, at times supplemented by formant-based synthesis by hand, but is being adapted to concatenative and formant-based methods. Evaluation includes perceptual tests for naturalness, intelligibility and communicative success under conditions of high cognitive load.

Related research issues
York's contribution to the project is predominantly in the field of timing. We have developed a model of timing which uses syllable structure as one of its determining factors, and generates natural-sounding rhythms for British English. We have a demonstration of an earlier system. We are also interested in seeing how the "segmental" and "timing" issues intersect with each other. Our earlier work suggests also that there may be different timing and rhythmical properties for Germanic (level 2) and Latinate (level 1) lexical items in spoken English.