This grant ran from 1 October 1997 to 31 March
2000. The award holders are
Sarah Hawkins
at Cambridge,
John Local
and
Richard Ogden
at the University of York, and
Jill House
and
Mark Huckvale
at University College London. The project is funded by EPSRC grants
GR/L53069 (Cambridge), GR/L51829 (York) and GR/L52109 (UCL)
Objectives
The project explores the viability of a phonological model that
rectifies some of the phonetic weaknesses of current concatenative and
formant-based text-to-speech systems. The new model integrates timing,
intonation and systematic segmental variation. For the selected
linguistic structures modelled, the result should be high-quality,
natural-sounding synthetic speech that is robust in noise. Our
objectives are:
Summary
Current text-to-speech systems, both concatenative and formant-based,
have some common shortcomings: the speech often sounds unnatural
because the rhythm, intonation and fine phonetic detail reflecting
coarticulatory patterns are poor, so although intelligibility rates may
be good, listeners experience increased cognitive load and poorer
perception in noise. These shortcomings restrict the applications for
which synthetic speech is useful. This collaborative project aims to
integrate and extend existing knowledge to produce the core of a new
model of computational phonology and phonetic interpretation which will
deliver high-quality speech synthesis. The complete model will
comprise a unified, language- and accent-independent linguistic
representation. The current project is developing a partial model,
using representative linguistic structures which will test the
viability of our approach, applied initially to Southern British
English. The three focal areas of research are intonation,
morphological structure, and systematic segmental variation. The
common factor is a temporal model that systematically structures
information from all three areas and governs the output of synthesizer
parameters. The signal generation component will be based initially on
time-domain modification of natural speech signals, at times
supplemented by formant-based synthesis by hand, but is being adapted
to concatenative and formant-based methods. Evaluation includes
perceptual tests for naturalness, intelligibility and communicative
success under conditions of high cognitive load.
Related research issues
York's contribution to the project is predominantly in the field of timing. We
have developed a model of timing which uses syllable structure as one
of its determining factors, and generates natural-sounding rhythms for
British English. We
have a
demonstration of an earlier system.
We are also interested in seeing how the "segmental" and "timing"
issues intersect with each other. Our earlier work suggests also that
there may be different timing and rhythmical properties for Germanic
(level 2) and Latinate (level 1) lexical items in spoken English.