ProSynth

An
Integrated Prosodic Approach
to
Device-Independent
Natural-Sounding
Speech Synthesis

ProSynth logo
 

 

 

 

Phonological Structure and Temporal Modelling

This report provides a description of some of the work conducted at York as part of the ProSynth project.

York's primary role has been to specify the phonological grammar and to develop and implement a temporal model for synthesis.

1 ProSynth: a linguistic model

Prosynth uses a phonological model which encodes phonological information in a hierarchical fashion using structures based on attribute-value pairs. Each phonological unit occurs in a complete metrical context. This context is a prosodic hierarchy with phonological contrasts available at all levels, as described in Section 1.1. The complex interacting levels of rules present in traditional layered systems are replaced in ProSynth by a one-step phonetic interpretation function operating on the entire context, which makes rule-ordering unnecessary. Whereas conventional synthesis systems use a relatively poor structure and complex, interacting rules, ProSynth uses instead a rich structure and applies simple rules of phonetic interpretation which are highly structure-bound. Systematic phonetic variation is thus constrained by position in structure. The basis of phonetic interpretation is not the segment, but phonological features at places in structure. We thus extend the principles successfully demonstrated in Local (1993) and Local & Ogden (1997) to a wider variety of phonological and domains and phonetic details. The details of the units of structure and their attributes are set out in Section 1.2.

1.1 The Prosodic Hierarchy

The phonological structure is organized as a prosodic hierarchy, with phonological information distributed across the structure. The knowledge is formally represented as a kind of tree structure. Trees are commonly used for phonological representation.

The hierarchy has units at the following levels: syllable constituents (onset, rhyme, nucleus, coda); syllable; foot; accent group (AG); intonational phrase (IP). The prosodic hierarchy, building on House & Hawkins (1995) and Local & Ogden (1997) is a head-driven (Pollard & Sag, 1994) and strictly layered structure. Each unit is dominated by a unit at the next highest level (the Strict Layer Hypothesis; Selkirk, 1984). This produces a linguistically well-motivated and computationally tractable hierarchy which accords with the representational requirements of its implementation in XML. Constituents at each level have a set of possible attributes, and relationships between units at the same level are determined by the principle of headedness. Structure-sharing is explicitly recognised through ambisyllabicity.

Fig. 1 shows a partial phonological structure for the phrase Come with a bloom. Note that phonological information is spread around the structure. For example, the feature [voice] is treated as a property of the rhyme as a whole, and not of just one of the terminal nodes headed by the rhyme. Timing information is also included: in Fig. 1, the [start] of the IP is the same as the [start] of the onset of the first syllable of the utterance, and the [end] of the IP is the same as the [end] of the coda of the last syllable, as indicated by the tags (1) and (2). The value for [ambisyllabic] is shown for two consonants: note that for the [ambisyllabic:+] consonant /D/, the terminal node is immediately dominated by two other nodes.

[IP[start(1)][end(2)]
[AG[Foot[Syl[O[start(1)][k]][Rh[Nu[V]][Co[m[ambisyllabic:-]]]]]
[Syl[O[w]][Rh[Nu[I]][Co[D[ambisyllabic:+]]]]]
[Syl[O[D[ambisyllabic:+]]][Rh[Nu[@]]]]]]
[AG[Foot
[Syl[strength:strong][weight:heavy][O[b][l]][Rh[Nu[u:]][Co[end(2)][m]]]]]]]

There is no separate level of phonological word within the hierarchy. Such a unit does not sit happily in a strictly layered structure, because the boundaries of prosodic constituents like AG and Foot can occur in the middle of a lexical item. Conversely, word boundaries can occur in the middle of a Foot or AG. For example, in the phrase maths department there are two feet: [maths de-], and [-partment]. The second begins in the middle of a word, and the first contains a word boundary.

The computational representation of the prosodic structure allows us to get round this problem: word-level and syntactic-level information is hyper-linked into the prosodic hierarchy. Phonetic interpretation may be sensitive to information at any level, so that it is possible to distinguish, for instance, a plosive in the onset of a weak word-final syllable from an onset plosive in a weak word-medial syllable. In this way lexical boundaries and the grammatical categories of words can be used to inform phonetic interpretation.

1.2 Units of Structure and their Attributes

Input text is parsed into both a syntactic and a phonological structure. The phonological parse allots material to places in the prosodic hierarchy and is supplemented with links to the syntactic parse. The lexicon itself is in the form of a partially parsed representation. This section describes in more detail the units of structure–in particular supra-syllabic constituents–and their attributes.

Phonological features: Features are represented as <attribute, value> pairs. In <attribute, value> pairs, where the value is non-Boolean, such as [weight: heavy/light], we abbreviate this to eg. [light] where it is clearer to do so in the text. To the set of conventional features are added the features [rhotic:±], to allow us to mimic the long-domain resonance effects of /r/ (Kelly & Local, 1989; Tunley, 1999), and [ambisyllabic:±] for ambisyllabic constituents (see later this section). Phonological <attribute, value> pairs are distributed around the entire prosodic hierarchy rather than at just the terminal nodes (or even associated to just terminal nodes), as in many phonological theories. [voice:±], for instance, is a property of the rhyme as a whole in order to model durational and resonance effects. Attributes at any level in the hierarchy may be accessed for use in phonetic interpretation.

Headedness: When a unit branches into sub-constituents, one of these constituents is its head. If the leftmost sub-constituent is the head, the unit is said to be left-headed. If the rightmost sub-constituent is the head, the unit is right-headed. AGs and feet are left-headed. Properties of a head are shared by the nodes it dominates (Broe, 1991; Ogden, 1999). Therefore a [heavy] syllable has a [heavy] rhyme; the syllable-level resonance features [grave:±] and [round:±] can also be shared by nodes they dominate: this is how some aspects of coarticulation are modelled.

The feature [head:±] is used to mark headedness. A constituent with the feature [head:+] is the head of the superordinate constituent it belongs to. In Fig. 2, headedness is indicated by vertical lines, as opposed to slanting ones. Phonetic interpretation proceeds head-first and is therefore determined in a structurally principled fashion without resort to extrinsic ordering.

Intonational Phrase (IP): The IP, the domain of a well-formed, coherent intonation contour, contains one or more AGs; minimally it must include a strong AG. The head of the IP is the rightmost AG–traditionally the intonational nucleus. The IP is the largest prosodic domain recognized in the current implementation of the ProSynth model. The attributes of IP are (1) position in discourse, (2) speech act function, (3) focus. (1) and (2) together determine f0 scaling and boundary tones; and (2) determines Pitch Accent type, whereas (3) determines intonational nucleus placement, using information from the syntax or the lexicon as a default when other discourse information is unknown.

Accent Groups (AG): AGs are units of intonation. They immediately dominate one or more feet. The head of the AG is the leftmost [heavy] foot, and is associated with an intonational pitch accent. AG attributes include [weight: heavy/light], number of component feet, position within the IP and pitch accent specifications. Only [heavy] AGs can have pitch accents assigned to them. When an IP begins with one or more unaccented syllables, we maintain the strictly layered structure by analysing them as constituting a [light] or "degenerate" AG, which in turn contains a [light] foot. Degenerate AGs have no head, cannot carry pitch accents, and can only occur as the first AG in an IP.

Feet: All syllables are organized into feet, which are units of rhythm. Types of feet are differentiated using attributes of [weight: heavy/light], [strength: strong/weak], [head:±] and number of component syllables. Feet with the attribute [head:+] act as domains for the realisation of pitch accents (see above). The attribute [weight] distinguishes between fully-formed ([heavy]) and degenerate ([light]) feet. A degenerate foot cannot act as a site for rhythmic stress because it is also [weak]. Only [strong] feet are associated with a rhythmically stressed position. The leftmost syllable within a foot acts as its head, so the syllable at the head of a [strong] foot, itself [strong], is stressed. However, [strong] syllables may occur inside [weak] feet; for example, the fourth syllable known in the phrase in the well-known maths department is [strong], but is dominated by a rhythmically [weak] foot.

Syllables: The syllable contains the constituents onset and rhyme. The rhyme branches into nucleus and coda. Nuclei, onsets and codas can all branch. Onsets and codas contain consonants, while nuclei contain vowels. Both onsets and codas contain vocalic features which are inherited from the nucleus, which is the head of the syllable. This allows for the accurate modelling of coarticulation (Coleman, 1992; Local, 1992; Ogden, 1992).

Syllables are right-headed, rhymes left-headed. Attributes of the syllable include [weight: heavy/light], and [strength: strong/weak]: these are necessary for the correct assignment of temporal compression (Section 2). Foot-initial syllables are strong.

Weight is defined with regard to the subconstituents of the rhyme. A syllable is [heavy] if its nucleus attribute [length] has the value [long] (in segmental terms, if it contains a long vowel or a diphthong). A syllable is also [heavy] if its coda has more than one constituent, as in /rEnt/, /tVsk/, /taks/. Other syllables are [light]. In polymorphemic syllables such as cat+s, the weight of the syllable is determined according to the stem, and the suffix is treated as a syllable appendix.

There is not a direct relationship between syllable strength and syllable weight. Strong syllables need not be heavy. In loving, /lVv/ has a [short] nucleus, and the coda has only one constituent (corresponding to /v/), yet it is the strong syllable in the foot. Similarly, weak syllables need not be light. In department, the final syllable has a branching coda (i.e. more than one constituent) and therefore is [heavy] but [weak]. ProSynth does not use extrametricality: all phonological material must be dominated by an appropriate node in structure.

Fig. 2 illustrates the partial metrical structure for the syllable, foot, AG and IP nodes for the phrase in the well-known maths department, along with low-level syntactic tags.

IP

AG

[POS: 0]
[feet: 1]
[light]
[head:-]

AG

[POS: 1]
[feet: 2]
[heavy]
[head:-]

AG

[POS: 2]
[feet: 2]
[heavy]
[head:+]

F

[head:-]
[weak]
[light]

F

[head:+]
[strong]
[heavy]

F

[head:-]
[weak]
[heavy]

F

[head:+]
[strong]
[heavy]

F

[head:-]
[strong]
[heavy]

S

[head:-]
[weak]
[light]

S

[head:-]
[weak]
[light]

S

[head:+]
[strong]
[light]

S

[head:+]
[strong]
[heavy]

S

[head:+]
[strong]
[light]

S

[head:-]
[weak]
[light]

S

[head:+]
[strong]
[heavy]

S

[head:-]
[weak]
[heavy]

in

the

well-

known

maths

de-

-part-

-ment

Prep.

Det.

Adj.

N(N

N)

Ambisyllabicity: Ambisyllabicity means that a consonant can simultaneously belong to two adjacent syllables. Formally, ambisyllabicity is represented as re-entrant nodes at the terminal level: i.e. a consonant may simultaneously be ultimately dominated by two syllable nodes by being in the coda of one syllable and in the onset of the next. Constituents which are shared between syllables are marked [ambisyllabic:+]. Ambisyllabicity makes it easier to model coarticulation and is an essential piece of knowledge in the correct temporal relations between adjacent syllables. It is also used to predict spectral properties such as plosive aspiration in intervocalic clusters.

Constituents are [ambisyllabic:+] wherever this does not result in a breach of syllable structure constraints. Loving comprises the two syllables, /lVv/ and /vIN/, since /v/ is both a legitimate coda for the first syllable, and a legitimate onset for the second. Loveless has no ambisyllabicity, since /vl/ is neither a legitimate onset nor a legitimate coda. Clusters may be entirely ambisyllabic, as in risky (/rIsk/+/ski/), where /sk/ is a legitimate coda and onset cluster; partially ambisyllabic (i.e. one consonant is [ambisyllabic:+], and one is [ambisyllabic:-]), as in selfish /sElf/+/fIS/), or not ambisyllabic as in risk them (/rIsk/+/D@m/).

 

2 Temporal modelling

2.1 Modelling

One of the goals of temporal modelling is to model English rhythms accurately. The ProSynth timing model is foot-based (Ogden, Local & Carter, 1999), and for any given syllable takes into account (1) its strength (2) its weight (3) its place in the foot (4) the strength and weight of adjacent syllables. Information about word boundaries is also available, allowing e.g. word-finality to influence the temporal interpretation of any syllable.

Abercrombie (1964) describes two rhythms which are important for disyllabic words in the variety of English being modelled: (1) short-long: happy, funny, city, (2) equal-equal: hamper, funding, seedy. The words with short-long rhythm have a light first syllable, while the words with equal-equal rhythm have a heavy first syllable. The vowels in the second syllables in the two sets are durationally different. For all items in the database that have utterance-final, dusyllabic feet and short vowels in the first syllable, the duration of both the first and the second syllable is sensitive to the weight of the first syllable (Table 1). The duration of a second syllable after a heavy first syllable is 23% greater than after a light first syllable. The duration of the first syllable includes ambisyllabic consonants; so in a word like city, the /t/ is counted as belonging to both the first and the second syllables. This explains why the first syllable of words with short-long rhythm nevertheless is longer in duration than the second syllable.

Weight of 1st syll Duration of 1st syll
(ms.)
Duration of 2nd syll
(ms.)

heavy

381

330

light

276

268

 

Table 1: Syllable durations in relation to weight of the first syllable in disyllabic, utterance-final feet.

As well as durational differences, there are also qualitative differences in the second-syllable vowels. The words with short-long rhythm have diphthongized vowels, while the words with equal-equal rhythm have monophthongal vowels. The implication of these results is that when the second syllable of words like these is phonetically interpreted, it is necessary to have information available about the strength and weight of the preceding syllable. Similar, but more complex, statements must also be made for longer feet.

As well as rhythmic properties, there are ‘segmental’ durational effects which relate to smaller stretches of speech but which (perhaps paradoxically) reflect higher levels of linguistic organisation. For example, Fougéron & Keating (1997), and Keating, Cho, Fougéron & Hsu (to appear) have shown that the duration of various segment types is sensitive to at least three levels of structure in the prosodic hierarchy. Such observations provide further evidence that the accurate modelling of durations depends on having a rich phonological structure that phonetic interpretation accesses. In other words, temporal phonetic interpretation is reliant on the informational richness which is encoded in the phonological structure.

The temporal interpretation model is based on a CART (Classification and Regression Tree) analysis of the database, taking into account the phonological features in the prosodic hierarchy. CART analysis is succinctly described by van Santen (1994:107):

[…] CART-based methods […] construct a tree by making binary splits on factors so as to minimize the variance of the durations in the two corresponding subsets. […] When a CART tree encounters a feature vector not observed in the training database, it can still find a path in the tree that, up to some point, matches the new feature bundle.

This means that if nothing in the database matches the required pattern exactly then a near approximation will be found.

The labelled waveforms of the database and their XML-parsed description files are searched according to relevant feature information (e.g. syllable weight and strength), and a CART model is used to generalize across this data and generate duration statistics for feature bundles at given places in the phonological structure. The resulting duration model can be used to drive MBROLA diphone synthesis, since it predicts the durations of acoustic segments.

The analysis model works top-down–that is, it factors out first the effects of IP, then of AG, and so on, down the tree to the features at the terminal level. This reflects the assumption that the IP, AG, foot and syllable are all levels of timing, and that details of lower-level differences (such as segment type) can be overlaid on details of higher-level differences (such as syllable weight and strength; the strength and weight of an adjacent syllable; etc.). The top-down model also has the effect of constraining search spaces. For instance, nuclei in [light] syllables do not split by [long:±], since no light syllable can be [long:+]; therefore in a [light] syllable, the model does not attempt to sub-divide the data by [long:±]. The resulting timing model is such that each node in the hierarchy has a multiplicative compression factor associated with it. The fact that it is a multiplicative model means that the order in which the statements of temporal interpretation are applied is irrelevant. It also makes the model compositional.

As an example, consider the interpretation of /p/ in happy. In order to interpret the /p/ accurately, the model refers to (at least) the following pieces of information:

• /p/ is located in a rhyme whose nucleus contains a short open vowel

• /p/ is [ambisyllabic:+] and is in the coda of a [strong], [light] syllable and in the onset of a [weak] syllable

Each of these facts–along with other, higher-level ones–affects the temporal interpretation of the /p/ in happy.

This method of timing assumes that segment durations, as measured from the database, are in fact what a duration model must replicate. However, another way to look at the speech signal is to consider segments as an artefact of the temporal overlaying of phonetic parameters. This view of timing has been explored in earlier work, such as Coleman (1992), Local (1992), Ogden (1992) and Local & Ogden (1997). According to this model, higher-level constituents in the hierarchy are compressed, and their daughter nodes are compressed in the same way. The temporal interpretation of ambisyllabicity is the degree of overlap that exists between syllables, so an intervocalic consonant (typically ambisyllabic) has duration properties inherited from both the syllables it is in.

The temporal consequences of ambisyllabicity can be modelled by overlaying Syllablen on to Syllablen-1 thus setting Syllable n’s start point to be before the end of Syllablen-1 . By overlaying syllables to varying degrees and making reference to ambisyllabicity, it is possible to lengthen or shorten intervocalic consonants systematically. There are morphologically related differences which can be modelled in this way, provided that the phonological structure is sensitive to them; the spectral and temporal differences around the end of the first/beginning of the second syllable in the words mistakes and mistimes are examples. As another example, the Latinate prefix in- is fully overlaid with the stem to which it attaches and is [ambisyllabic:+], giving a short nasal in innocuous, while the roughly synonymous Germanic prefix un- is not overlaid to the same degree and is [ambisyllabic:-], giving a long nasal in unknowing. Current work focuses on integrating the segment-based and the more syllable-based approaches in the model.

 

2.2 Implementation

ProSynth is implemented in a modular fashion, using the ProXML language written and developed as part of the project. Temporal information is coded in an independent ProXML script, allowing us to test a variety of models simply by exchanging scripts.

The regression tree approach is pursued using the S-plus statistics package. ProXML scripts are used to extract structural information from our database. The top-down approach outlined above necessitates that previously modelled information be filtered out of each successive pass through the database so that, for example, rhyme information takes account of the analysis performed at syllable level. Programs have been developed to enable us to carry out the modelling in this fashion.

Each successive pass is then used to generate regression trees which impose binary splits on the data. If any split produces a statistically insignificant difference, or a difference of less than the average glottal period of our speaker (approximately 5ms), it is discarded and the information is retained as one group without being split. Since values are returned for splits at each level (not just at terminal nodes), this is a simple technique.

The resulting amended regression trees are then recoded into ProXML to produce a script which is combined with other information in our final synthesis.

An example of part of a regression tree from our current model is given here, first in graphical form, then in text:

 
regression tree - see below 
for text version


node), split, n, deviance, yval
      * denotes terminal node

1) root 1715 141.600 2.327  
  2) Strength:WEAK 1087  55.940 2.174  
    4) Finality:footfinal,nonfinal 887  33.840 2.108  
      8) Finality:footfinal 572  25.500 2.094 *
      9) Finality:nonfinal 315   8.004 2.135 *
    5) Finality:uttfinal 200   1.660 2.462 *
  3) Strength:STRONG 628  15.460 2.594  
    6) Finality:nonfinal 356   5.693 2.493 *
    7) Finality:uttfinal 272   1.385 2.726 *


The data represented here are split first into two groups (nodes 2 and 3 in the text version of the regression tree, where indentations represent successive nested levels of splits) depending on their value for the syllable attribute STRENGTH. The more important the split, the more vertical separation there is between nodes in the graphical version of the tree.

An additional variable has been added here as a shorthand to take account of syllable position in larger pieces of structure: "Finality" has the values "uttfinal" (meaning final position in the utterance), "footfinal" (meaning final position in a non-utterance-final foot) and "nonfinal" (meaning all other positions). The durations are given as base 10 logarithms so, for example, footfinal weak syllables are given the value 2.094 (that is, a duration of approximately 124.2ms).

The present model employs knowledge about foot and word position and is fully implemented for syllable, rhyme, nucleus, onset and coda nodes.

Our database is metrically designed and gives a good coverage of a variety of metrical structures. However, it is not (and was never designed to be) well-balanced with regard to what might be termed segmental information (in our terms, information at the CNS and VOC nodes).

For this reason, we are implementing models of CNS and VOC based on information from a second, segmentally-designed database. These models need to take into account more closely the maximum and minimum durations found in the database and so we are using a Z-score model at this level.

In our Z-score models, consonants are grouped in nine phonetic categories: plosives (voiced and voiceless), affricates (voiced and voiceless), fricatives (voiced and voiceless), nasals, liquids and glides. Based on observations in the database and past experience with YorkTalk synthesis, vowels are grouped in three groups: schwa, open vowels and non-open vowels (vowel length is handled by the attribute LONG at the NUC node).

Future work will explore the possibilities of extending the Z-score approach to other nodes in our structure, and also integrating the Z-score approach with regression tree modelling.

 

3 Perceptual Testing

Perceptual tests are due to be carried out in March-April. We will conduct a factorial experiment to evaluate the overall naturalness of synthetic speech which has the ProSynth model of duration, intonation and resonance. If these details are accurately modelled, we predict that the synthetic speech should sound more natural; and therefore be more easily interpretable under a high cognitive load, such as a noisy background.

The stimuli will be generated in PROCSY, using speech from the speaker on our database; and the resulting signal will be manipulated so as to contain the durations predicted by the ProSynth model. This will be compared against the durations as predicted by a leading temporal model. The speech will be played to subjects over headphones, with cafeteria noise in the background. Subjects will be asked to write down what they hear. The transcripts will then be read and marked according to the number of errors they make.

York's tests are desgined to test whether our temporal model of syllables in non-final feet as described above improves the naturalness of synthetic speech. Complete sentences will be used which have the following structure:

(defective foot) | Foot (sw/sww) | Foot (sw/sww) | Foot.

This gives four structures.

Examples:

(defective foot) Foot (sw) Foot (sw) Foot.
It's a foreign work of art.
You'll need to use the lift.
It was found in South A- -merica.

 

(defective foot) Foot (sww) Foot (sw) Foot.
The fabric has got a mark.
It's ancient and fragile parchment.
The ce- -ramics are round the back.

 

(defective foot) Foot (sw) Foot (sww) Foot.
The entrance needs to be widened.
The ship was buried in mud.
It's thirteenth century ivory.

 

(defective foot) Foot (sww) Foot (sww) Foot.
It's our latest col- -lection of armour.
It's an elegant amethyst necklace.
It dates from Vic- -toria's reign.
 

 

 

References

  1. Abercrombie, D. (1964). `Syllable quantity and enclitics in English.' In In Honour of Daniel Jones (D. Abercrombie, Fry D.B., MacCarthy P.A.D., Scott, N.C. & Trim, J.L. eds.) Longman Green, London, pp. 216-222.
  2. Broe, M. (1991). `A unification-based approach to Prosodic Analysis.' Edinburgh Working Papers in Cognitive Science 7, pp. 27-44.
  3. Coleman, J. S. (1992). `The phonetic interpretation of headed phonological structures containing overlapping constituents.' Phonology 9, pp. 1-44.
  4. Fougéron, C & P Keating (1997). `Articulatory strengthening at edges of prosodic domains.' Journal of the Acoustical Society of America 106 (6), pp. 3728-3740.
  5. House, J. & Hawkins, S. (1995). `An integrated phonological-phonetic model for text-to-speech synthesis.' In Proceedings of the XIIIth International Congress of Phonetic Sciences (Elenius, K., and Branderud, P. eds.) 2, pp. 326-329. KTH and Stockholm University, Sweden
  6. Keating, P., Cho, T., Fougéron, C. & Hsu, C-S. (to appear). `Domain-initial strengthening in four languages.' To appear in Papers in Laboratory Phonology VI. (J.K. Local, R.A. Ogden and R.A.M. Temple eds.)
  7. Kelly, J. & Local, J.K. (1989). Doing phonology. Manchester, Manchester University Press.
  8. Local, J.K. (1992). `Modelling assimilation in a non-segmental rule-free phonology.' Papers in Laboratory Phonology II: Gesture, Segment, Prosody. Docherty, G. J. & Ladd, D. R. Cambridge, Cambridge University Press, pp. 190-223.
  9. Local, J.K. (1993). `"Segmental" Intelligibility of the YorkTalk non-segmental speech synthesis system.' York Research Papers in Linguistics.
  10. Local, J.K. & Ogden R. (1997). `A model of timing for nonsegmental phonological structure.' In Progress in Speech Synthesis (van Santen, J.P.H., Sproat R.W., Olive J.P. & Hirschberg J. eds.) Springer, New York. pp. 109-122.
  11. Ogden, Richard (1992). `Parametric interpretation in YorkTalk.' York Papers in Linguistics 16, pp. 81-99.
  12. Ogden, Richard (1999). `A syllable level feature in Finnish.' In The syllable: views and facts (van der Hulst, H. & Ritter, N., eds.). pp. 651-672. Berlin, Mouton de Gruyter.
  13. Ogden, R. Local, J. & Carter, P. (1999). `Temporal interpretation in ProSynth, a prosodic speech synthesis system.' In Proceedings of the XIVth International Congress of Phonetic Sciences (Ohala, J.J., Hasegawa, Y., Ohala, M., Granville, D., and Bailey, A.C. eds.), 2, pp. 1059-1062. University of California, Berkeley, CA.
  14. Pollard, C. & Sag, I.A. (1994). Head-Driven Phrase Structure Grammar. Chicago: The University of Chicago Press.
  15. Selkirk, E. (1984). Phonology and syntax: the relation between sound and structure. Cambridge, MA: MIT Press.
  16. Tunley, A. (1999). Coarticulatory influences of liquids on vowels in English. Unpublished PhD dissertation University of Cambridge, UK.
  17. van Santen, J. (1994). `Assignment of segmental duration in text-to-speech synthesis.' Computer Speech and Language 8, pp. 95-128.

 

 


ProSynth home page.