INTRODUCTION We describe an experimental text-to-speech system that uses a deterministic parser and prosody rules to generate phrase-level pitch and duration information for English inpu
Trang 1THE C O N T R I B U T I O N OF P A R S I N G TO P R O S O D I C
P H R A S I N G IN AN E X P E R I M E N T A L
T E X T - T O - S P E E C H SYSTEM
ABSTRACT
While various aspects of syntactic structure have
been shown to bear on the determination of phrase-
level prosody, the text-to-speech field has lacked a
robust working system to test the possible relations
between syntax and prosody We describe an
implemented system which uses the deterministic
parser Fidditch to create the input for a set of prosody
rules The prosody rules generate a prosody tree that
specifies the location and relative strength of prosodic
phrase boundaries These specifications are converted
to annotations for the Bell Labs text-to-speech system
that dictate modulations in pitch and duration for the
input sentence
We discuss the results of an experiment to determine
the performance of our system We are encouraged
by an initial 5 percent error rate and we see the design
of the parser and the modularity of the system
allowing changes that will upgrade this rate
INTRODUCTION
We describe an experimental text-to-speech system
that uses a deterministic parser and prosody rules to
generate phrase-level pitch and duration information
for English input This information is used to
annotate the input sentence, which is then processed
by the text-to-speech programs currently under
development at Bell Labs In constructing the ,system,
our goal has been to test the hypotheses (i) that
information available in the syntax tree in particular
grammatical functions such as subject-predicate and
head-complement, is bv itself useful in determining
prosodic phrasing for svnthetic speech, and (ii) that it
ts possible to use a syntactic parser that specifies
grammatical functions to determine prosodic phrasing
for synthetic speech
Although certain connections between syntax and
prosody are well-known (e.g the influence of part of
speech on stress in words like progress, or the setting
off of parenthetical expressions) very little practical
knowledge is available on which aspects of syntax
might be connected to prosodic phrasing In many
studies, investigators have sought connections between
constituent structure and prosody (e.g Cooper and
Paccia-Cooper 1980 Umeda 1982 Gee and Grosjean
1983) but, with the exception of Selkirk (1984) they
tend to neglect the representation of grammatical
functions in the svntax tree Moreover, previous work
has not been specific enough to provide the basis for a
full system implementation Based on our study of
prosodic phrasing in recorded human speech, we
Joan Bachenko Eileen Fitzpatrick
C E Wright
A T & T Bell L a b o r a t o r i e s
M u r r a y H i l l , N e w J e r s e y 07974
decided to emphasize three aspects of structure that relate to phrasing: syntactic constituency, grammatical function, and constituent length These findings which we will discuss in detail, have been implemented as a collection of prosody rules in an experimental text-to-speech system
Two important features characterize our system First the input to our prosody system is a parse tree generated by a version of the deterministtc parser Fidditch (Hindle 1983) The left-corner search strategy of this parser and, in particular, its determinism, give Fidditch the speed that makes online text-to-speech production feasible 1 In building
a parse tree, Fldditch identifies the core subject-verb- object relations but makes no attempt to represent adjunct or modifier relations Thus relative clauses adverbials, and other non-argument constituents have
no specified position in the tree and no specified semantic role Second the rules in the prosody system build a prosody tree by referring both to the syntactic structure and to earlier stages of prosodic structure The result is a hierarchical representation that supports the view, also proposed in Selkirk (1984) that grammatical function information is related to prosodic phrasin.g, but indirectly, through different levels of processing
Informal tests of the system show that it is capable
of producing a significant improvement in the prosodic quality of the resulting synthesized speech, Our investigations of the system's problems, which we describe, have not revealed any serious counterexample to our basic approach In many cases
it appears that problems with the current version can
be resolved by taking our approach a step further, and including lexical information required by the parser as another factor in the determination of prosodic phrasing
TEXT-TO-SPEECH
Most text-to-speech systems comprise two components: pronunciation rules and a speech synthesizer Pronunciation rules convert the input text into a phonetic transcription; this information mav also be supplemented by a dictionary that provides information about the part of speech, stress pattern and phonetic makeup of particular words The speech
I With a ~rammar of about 600 rules and a lexicon of about 2400 words, "Fidditch parses the 25 sample sentences of Robinson (1982), averagin~ 7 words per sentence and chosen for their structural divers*t'¢, at an avera~hrate of 405 seconds per sentence on a Sv'mbolics 3670 ~ rate is approximately proportional to th~ number of words in a sentence
145
Trang 2synthesizer then converts this phonetic transcription
into a series of speech p a r a m e t e r s which are
subsequently processecl to produce digitized speech
While these systems tend to p e r f o r m quite well on
word pronunciation, they fall short when it comes to
providing good prosody for complete sentences
Current text-to-speech systems have no access to the
syntactic and semantic properties of a sentence that
influence phrase-level prosody Hence rules for
sentence prosody, when they are provided at all
typically depend on superficial aspects of text (e.g
punctuation) and on heuristics that vary widely in
sophistication Although such techniques often add a
more natural quality to the resulting synthetic speech,
!hey can fail in important ways, for example, by
xgnormg the prosodic event between a lengthy subject
and a predicate, so that there is no clear prosodic
boundary between right and m a r k in The characters on
the right m a r k the salient f e a t u r e s 2
Several authors (e.g Allen 1976; Elovitz et al
1976; Luce et al 1983) have suggested that prosodic
differences between synthetic and natural speech are
the primary, unaddressed factor leading to difficulties
in the comprehension of fluent synthetic speech The
relation between phrase-level prosody and its sources,
however, is so poorly understood that we have no
good sense of the degree to which different levels of
explanation syntactic, semantic, or pragmatic are
applicable We currently have reasonable tools for
automatic syntactic anal~,sis of a text but there is
nothing equivalently well-developed for semantic or
pragmatic textual analysis Thus an obvious goal is to
explore the extent to which phrase-level prosody can
be explained by the syntax tree and develop a detailed
description of that relation A further goal is to
convert the resulting insights about this relation into a
system that can work with a speech synthesizer This
allows us to test our description more adequately and
perhaps also produce something that will further text-
to-speech technology
SYNTACTIC S T R U C T U R E AND
PROSODIC PHRASING
Certain relations between syntax and prosody
especially at the word level, are well-known For
example, the syntactic category of a word may affect
its phonetic realization, as in the verb/adjective
distinction of separate, approximate, and the verb/noun
distinction of house, wind, lives Likewise, syntactic
category affects word stress, so that verbs such as
whereas the corresponding nouns receive penultimate
stress
Beyond the word level, however, there has been
little investigation of systematic connections between
syntactic structure and prosodic phrasing The
psycholinguistic and acoustic investigations of Cooper
and Paccia-Cooper (1980), U m e d a (1982) and G e e and
Grosjean ( 1 9 8 3 ) a n d the prosodic theory of Selkirk
(1984) are among the more notable studies and
represent the two main approaches to syntax/prosody
2 Note that without a syntactic anal,,sis that correctly identifies
~rammatical functions, it is impos'sible to determine whether
tlae word mark is a noun ending the subject phrase or the verb
of the predicate phrase Simple 'surface" parsers, such as that
described in Umeda and Teranishl (1974l will still fail to
identify, the prosodic boundar.~ correctly
relations In C o o p e r and Paccia-Cooper (1980) and
U m e d a (1982), the connection from syntax to prosodic phrasing is u n m e d i a t e d by any filtering process, i.e they propose that the details of prosodic phrasing can
be d e t e r m i n e d directly f r o m syntactic structure by associating particular syntactic nodes (or constituent boundaries) with a phonetic value, either pausing, segmental lengthening, or the blocking of the cross- word conditioning of phonological rules By contrast,
G e e and G r o s j e a n (1983) and Selkirk (1984) believe that the syntax-prosody relation is indirect: prosodic phrasing is derived by rules that refer to left-to-right ordering, length (or branching patterns), and, in the ca~e of Selkirk grammatical function, as well as constituent membership in order to infer a hierarchical prosodic structure But while their respective positions are quite clear, none of these studies is conclusive All lack a syntactic f r a m e w o r k sufficiently detailed and formalized to allow extensive testing, and most consider 9nly a small number of sentences and sentence t y p e s ?
To develop our analysis, we first examined prosodic phrasing in the speech of one of us reading prose from various texts, including four instruction manuals These texts were later a u g m e n t e d by a
~ rofessional reading of a prose story The boundaries etween prosodic phrases were identified and then classed according to their syntactic context and semantic function
O u r results, which are outlined below, indicate an organization of the prosodic phrases that supports the 'indirect relationship' approach of G e e and Grosjean (1983) and Selkirk (1984) We found that, in our corpus, prosodic phrasing depends on three aspects of structure: the b r e a k d o w n into syntactic constituents, the grammatical function of a constituent, and constxtuent length, Let us review each of these factors
Syntactic Constituency
The possible constituents recognized by our parser are Noun Phrase (NP) Verb Phrase (VP) Adjective Phrase (AdjP), A d v e r b Phrase (AdvP), and Prepositional Phrase (PP) In general, we found that syntactic constituency is partxcularly important for predicting points at which a prosodic phrase boundary
is not produced, i.e., the words within a syntactic constituent cohere For example, the italicized phrases in (1)-(5) had no perceptible boundaries at the locations indicated by # :
(1) Left-hand # p o w e r unit is connected
(2) This procedure shows # you
(3) A n extremely # n a r r o w opening
(4) To spread powerload more # evenly
(5) next # to any p o w e r e d di-group The single exception to word cohesion within syntactic
3 Gee and Grosjean (1983) use a corpus of 14 sentences Umeda (1982) considers a large corpus but like Gee and Grosjean does not distinguish among grammatical functions Althou~_h Selkirk cites r~any exam~lgs in her discussionsof phra~'al stress and word-level prosody, her description of prosodic phrasing focusses on only a single example
Trang 3constituents involved boundaries between the verb and
its first or second object when the object in question
was lengthy We discuss this exception below
Grammatical Functions
Our sample indicated that phrase boundaries are
also d e t e r m i n e d by the grammatical relations among
the syntactic constituents, i.e the argument structure
of the sentence Four grammatical relations concern
us:
(a) subject-predicate, as in T h e 4 8 - c h a n n e l m o d u l e
h a s t w o d i - g r o u p s
(b) head-complement, where the head can be a
noun, verb, or adjective and may have one
complement, e.g h a s t w o d i - g r o u p s , or two
complements, e.g s h o w s y o u h o w t o f l y y o u r k i t e
(c) sentence-adjunct, as in I n s e r t u n i t i n t o c o r r e c t
s h e l f l o c a t i o n p e r d e t a i l i n s t r u c t i o n s
(d) head-modifier, where the head can be a noun,
verb, adverb, or adjective and the modifier can be one
of several things, depending on the head (e.g., for
nouns, the modifier can be a relative clause; for verbs,
it can be a prepositional phrase; for adjectives and
adverbs, the modifier can be a comparative)
We observed a hierarchy among these relations
with respect to the strength, or perceptibility, of a
prosodic boundary, with the boundary between
sentence and adjunct receiving the highest potential
boundary strength, followed by the subject-predicate
boundary, then the head-complement and head-
modifier boundaries Thus in (6), there is a strong
boundary between subject and predicate, whereas in
(7), due to the strong boundary between adjunct and
core sentence, the subject-predicate boundary
diminishes (Dashes indicate the location of the
boundary being discussed.)
(6) The name of the character is not pronounced
(7) When this switch is off the name of the
character is not pronounced
Constituent Length
While we may view each boundary as having an
intrinsic strength based on constituency and
grammatical function, the determination of actual
strengths appears to depend on the interaction of the
intrinsic strength of a boundary with the strengths of
other boundaries in the sentence, as well as the
distance between these boundaries The most salient
of the interactions we observed was between the
placement of a boundary at the subject-predicate
junction and the placement of a boundary following
the verb-complement junction The mediating factor
in this interaction was the relative length of the
subject with respect to the length of the verb's
complements Thus a sentence such as (8) with both a
short subject and a single short object generally is
produced without a boundary in either position
(8) You have completed the task
But if, as in (9), the subject is long relative to the
object, then a break occurs between the subject and
predicate Conversely, if the subject is short relative
to the object, then a break will occur between the verb and the object, as in (10) Or, if there are two objects and the first is simple, the break will occur between them, as in (11)
(9) The materials required are one kite kit (10) How shall we judge the goodness of an algorithm?
(11) This procedure shows you how to fly your kite
AN EXPERIMENTAL PROSODY SYSTEM
O u r findings confirmed that syntactic structure plays a major role in determining prosodic structure, but the relationship is indirect the exact influence of syntactic constituency varies according to the length and grammatical function of each constituent To refine and test this idea, we implemented an experimental text-to-speech system in which rules apply to a parse tree to infer prosodic structure and then annotate the input string with phrasing information derived from the prosodic structure; this annotated input string is submitted to the Bell Labs text-to-speech programs, which convert it into a speech file O u r system comprises three components:
a parser that builds syntactic structure, rules that derive prosody information from the syntactic structure, and the Bell Labs text-to-speech programs The parser and speech programs are independent components The prosody rules act as a filter between them, converting the syntactic information generated
by the parser into prosodic information that can be supplied to the text-to-speech programs
Parsing
Our parser is a version of Fidditch (Hindle 1983), a moderate coverage parser based on the deterministic model described in Marcus (1980) To build syntactic structure, Fidditch uses a g r a m m a r that requires the representations produced by lexical and syntactic rules
to be consistent with the (semantic) predicate- argument structure The surface syntactic structures generated by the parser represent the argument structure of a phrase or sentence, i.e the "core" constituents of a sentence (its subject (NP), modality ( A U X ) , and predicate (VP)) and the complements of phrasal heads The structure is determined, for the most part, by rules that refer to argument information that is specified in the lexicon for the content words
!nouns, verbs, adjectives, adverbs), and by rules that insert null terminals such as the "trace" of wh- movement In general, the g r a m m a r is consistent with the government and binding f r a m e w o r k of C h o m s k y (1981), as adapted to the needs of a parser
The input to the parser is a phrase or sentence (punctuation is optional) Its output is a surface structure tree in which the status of a constituent with respect to the predicate-argument structure of the sentence is indicated by the constituent's attachment
to higher nodes in the tree Thus only constituents that belong to the core are attached to the S node, and only complements of a phrasal head can become righthand sisters of the head Adjuncts and modifiers
147
Trang 4whose role depends on semantic and pragmatic
i n f o r m a t i o n about the discourse domain, have no
r e p r e s e n t e d as "orphan" nodes in the tree
For example, Figure 1 shows the parse tree for
L e f t - h ' a n d p o w e r u n i t on e a c h s h e l f in 4 8 - c h a n n e l m o d u l e
c a n p o w e r o n l y t h e e c h o c a n c e l e r s t h a t a r e in t h a t s h e l f
4 The structure in Figure 1 contains a single core
e a c h s h e l f and in 4 8 - c h a n n e l m o d u l e , and the adverb
o n l y which are u n a t t a c h e d constituents This is the
significance of the u n l a b e l e d node d o m i n a t i n g each of
these constituents The PPs are not attached because
u n i t is not lexically m a r k e d to take a PP headed by on
m a r k e d to take a PP c o m p l e m e n t headed by in Nor is
argument
t h a t s h e l f In the relative clause, T is a null t e r m i n a l
that stands for the trace of the relativized subject NP;
the * in tense stands for a null Aux e l e m e n t Because
n o u n s do not select relative clauses as a r g u m e n t s (any
n o u n can be relativized), the parser does not identify
the relations of the modifier c o n s t i t u e n t to the
e l e m e n t s of the core sentence H e n c e the relative
clause is not attached to any other syntactic node in
the tree
Text-to-speech Synthesis
The programs that make up the speech c o m p o n e n t
are described in L i b e r m a n and Buchsbaum (personal
c o m m u n i c a t i o n ) These programs take English text as
a n n o t a t i n g the input text to this system, m a n y aspects
of its o p e r a t i o n can be o v e r r i d d e n or modified: e.g the
location of major and m i n o r phrase boundaries, the
stress given to words, the t r a n s c r i p t i o n of words and
the b o u n d a r i e s b e t w e e n them, the timing of segments,
and details of the pitch contour As we will show,
with our prosody system we are able to produce
strings in which four b o u n d a r y levels are identified
a n d p e r c e p t u a l l y distinguished, using the c u r r e n t text-
to-speech system annotations
Prosodic Phrasing
c o n s t i t u e n t structure, g r a m m a t i c a l role, and length to
m a p a surface structure such as that in Figure 1 onto a
prosody tree such as that in Figure 2 The prosody
tree identifies the location of phrase b o u n d a r i e s
(signified by the • nodes) and the relative strength of
each b o u n d a r y (signified by a n u m b e r in the • node)
It is this i n f o r m a t i o n that is used to a n n o t a t e the input
text with escape sequences that provide the text-to-
phrasing
In f o r m u l a t i n g our rules for building the prosodic
i m p l e m e n t i n g the model of G e e and G r o s j e a n (1983)
This model, initially proposed to predict a form of
prosodic b o u n d a r i e s from a syntactic tree, but assumes
r a t h e r than explicitly presents a syntactic c o m p o n e n t
We were initially attracted to the G e e and G r o s j e a n model because of its emphasis on relative b o u n d a r y weighting, i.e., on the d e t e r m i n a t i o n of the strength of
a given b o u n d a r y with respect to the other b o u n d a r i e s
in the sentence We found that in the data we had collected, this weighting played an i m p o r t a n t role In fact, we i n c o r p o r a t e d directly into our system one
m e t h o d of doing this weighting, n a m e l y G e e and
G r o s j e a n ' s rule to d e t e r m i n e the strengths of the
relative length (as m e a s u r e d by t e r m i n a l node count)
As we e x t e n d e d G e e and G r o s j e a n ' s model to create an algorithm a d e q u a t e for use in a g e n e r a l purpose system, our algorithm diverged from its starting point, reflecting our a t t e m p t s to correct weaknesses and l a c u n a e that we e n c o u n t e r e d in the
G e e and G r o s j e a n model That we e n c o u n t e r e d these
b e t w e e n our goals and those of G e e and Grosjean The most i m p o r t a n t d i f f e r e n c e b e t w e e n the G e e
involves the factors d e t e r m i n i n g b o u n d a r y weight
G e e a n d G r o s j e a n assume that this weighting is
d e p e n d e n t only on the n u m b e r of syntactic nodes, their left-to-right ordering and, in the case of the verb phrase, on c o n s t i t u e n t length In contrast, our data, in
a g r e e m e n t with Selkirk's (1984) theoretical analysis,
i n d i c a t e d that b o u n d a r y strength is d e p e n d e n t on the
g r a m m a t i c a l functions that the c o n s t i t u e n t s in a given
s e n t e n c e play In p a r t i c u l a r , we observed a hierarchy
a m o n g these functions with respect to b o u n d a r y strength, as discussed below 5
In addition to i n c o r p o r a t i n g g r a m m a t i c a l f u n c t i o n
i n f o r m a t i o n into our system, we fleshed out the model
of G e e and G r o s j e a n to deal with syntactic structures that they do not explicitly consider In p a r t i c u l a r , G e e and G r o s j e a n ' s strictly left-to-right building of the
5 As an example of the effect that grammatical functions have
young man left We view this sentence as consisting of two lgrammatical relations: subject-predicate and adjunct-sentence
m our hierarchy of grammatical relations, the boundary between the adjuhct and the sentence is more salient than the boundary between the subject and the predicate The system
If we exclude any effects of grammatical functions and assume a simple l.eft-to-right attachment of the three
prosody tree,.we ~,ould assigr/ a -strofiger boundary following
manGr .man Imiowing Finally It is not clear that Gee and oslean make this lett-to-rlght assumption in such examples
comi~lementizer node in the s)ntax tree and it is difficult to determine whether the)' integrate the material in the comptemennzer Wltla the material in the core sentence as they are analy.zing the material in the core bentence or after that analysis IS completed If they integrate the complementizer
with the sentence in a left-td-right manner and- predict, incorrectly, that the stronger boundary occurs after man If they complete the prosodic analysis of the core sentence before bundling the sentence with the complementizer, then they incorrectly predict that there is a strong boundary after
problems d i a y o u expect the most perceptible boundary would
Furthermore, assuming that an adjunct in sentence-initial position is dominated b~ the complementizer node and in sentence-final position "by S-bar creates an inconsistent description, which hampe?s the ~alue of the model as an experimental tool
Trang 5prosodic tree left c e r t a i n questions open, F o r
e x a m p l e , their m o d e l does not d e a l with s e n t e n c e s
e m b e d d e d in the m i d d l e of a m a i n sentence (as-in The
notion [that he would refrain f r o m such an act] was
incorrect.) W e i n c o r p o r a t e e m b e d d e d s e n t e n c e s into
the prosodic tree in a cyclic m a n n e r to insure that the
m a t e r i a l in the e m b e d d e d sentence is p r o c e s s e d b e f o r e
that in the main sentence 6 In a d d i t i o n G e e and
G r o s j e a n leave o p e n the t r e a t m e n t of the multiple
r i g h t w a r d e m b e d d i n g of n o n - s e n t e n t i a l constituents,
e.g., the NP e m b e d d i n g in The destruction o f the good
name of his f a t h e r O u r a p p r o a c h is to handle these
cases recursively, from the most d e e p l y e m b e d d e d
p h r a s e up, in o r d e r to p r e s e r v e the prosodic cohesion
of the entire NP
O u r adjunction rules are d e r i v e d for the most p a r t
from S e l k i r k ' s account We have also m a d e use of the
idea, which G e e and G r o s j e a n ([983) t a k e largely f r o m
the work of S e l k i r k , that c e r t a i n syntactic heads m a r k
off phonological phrase b o u n d a r i e s , and provide the
basic prosodic constituents for higher level analysis
O u r p r o s o d y rules run in four i n d e p e n d e n t stages
E a c h stage builds on the previous stage, so that the
rules can r e f e r to both syntactic and prosodic s t r u c t u r e
as they build successively higher levels of prosodic
structure
(i) Adjunction Rules combine o r t h o g r a p h i c a l l y
distinct words into phonological c o n s t i t u e n t s with no
internal w o r d b o u n d a r y , T h e y join a w o r d to its left
or right neighbor d e p e n d i n g on (a) the c a t e g o r y of the
word, and (b) its s t r u c t u r a l r e l a t i o n to o t h e r words In
g e n e r a l , adjoinable words are the function words
articles, c o m p l e m e n t i z e r s , auxiliary verbs,
conjunctions, p r e p o s i t i o n s and pronouns (except for
the "strong" possessives, mine, hers, theirs, yours, ours,
which are t r e a t e d as r e g u l a r NP's)
A d j u n c t i o n occurs six times for the s e n t e n c e in
F i g u r e 2 to c r e a t e six multiple word groups, all right-
adjoining: on each, in 48-channel, can power, the echo,
a p p e a r as t e r m i n a l s in the p r o s o d y tree in F i g u r e 2 In
subsequent processing the b o u n d a r i e s b e t w e e n the
words in these groups are m a r k e d so that the text-to-
speech system does not p r o d u c e the prosodic
indications of a word b o u n d a r y In addition, these
groups are t r e a t e d as single words in f u r t h e r analyses
(ii) ~-phrasing Rules construct phonological (or 6p)
phrases, which are the building blocks of the p r o s o d y
tree These rules identify groups of words that c o h e r e
strongly in speech and thus should not be s e p a r a t e d by
phrase boundaries In the p r e s e n t i m p l e m e n t a t i o n ,
each • phrase is c o n s t r u c t e d by a left-to-right process
that collects the words f o r m e d by adjunction until it
reaches a noun or verb A t this point, a • p h r a s e is
c r e a t e d that consists of the c o l l e c t e d words plus the
noun or verb, which acts as head of the phrase F o r
e x a m p l e , in that shelf, in F i g u r e 2 is a single • p h r a s e
consisting of two words
In Figure 2, the • nodes m a r k e d with a syntactic
c a t e g o r y are the minimal phonological constituents
with respect to l a t e r rules that build the prosodic
s Having taken this strona approach, we now understand the
limited exceptions to this~mechanism, which we discuss below'
phrases; these @ phrases have an internal s t r u c t u r e , but the s t r u c t u r e plays no role in f u r t h e r processing Note that n e i t h e r adjectives nor adverbs are allowed
to be the h e a d of a • p h r a s e , so that three additional open slots is a single • phrase consisting of four words
E x a m p l e s such as Someone tall walked into the room,
however, suggest that our t r e a t m e n t of these categories is not d e t a i l e d enough and that, in future versions of the system, some adjectives and adverbs should act as • heads
(iii) Prosody-phrasing rules use i n f o r m a t i o n about phrases and syntactic s t r u c t u r e to c r e a t e a new
o r g a n i z a t i o n of the sentence and to assign strength values to the b o u n d a r i e s b e t w e e n successive • phrases The process of building the prosody tree starts with the sentence node (S or Sbar) that is most deeply
e m b e d d e d in the u t t e r a n c e , t r a n s f o r m i n g it into a prosody subtree This process continues through successively higher levels of sentence nodes until all top-level sentences have b e e n t r a n s f o r m e d into prosody subtrees All the processing of each successive sentence is done b e f o r e the relation of the sentences to e a c h other is c o n s i d e r e d 7
W i t h i n a s e n t e n c e , the • phrases are p r o c e s s e d from left to right This stage of the analysis uses a window that allows access to t h r e e a d j a c e n t nodes
P a t t e r n - a c t i o n rules, which are d e s c r i b e d below, apply
to the nodes in the window and build p r o s o d y subtrees that r e p l a c e the syntax nodes T h e s e subtrees are
h e a d e d by a • node containing a n u m b e r that
r e p r e s e n t s node count; the n u m b e r is d e t e r m i n e d by counting the n u m b e r of nodes c o n t a i n e d in the
p r o s o d y a s u b t r e e , plus 1 for the • node that heads the subtree In g e n e r a l , the prosody p h r a s e rules do t h r e e things:
(a) Balance prosodic phrases by r e f e r r i n g to
c o n s t i t u e n t length This rule only applies for building the p r o s o d y subtree that contains the verb If the node count for subject plus verb is less than the node count of the verb's c o m p l e m e n t , then subject and verb are g r o u p e d t o g e t h e r in a prosodic subtree; this gives the phrasing in The characters on the right mark the salient f e a t u r e s O t h e r w i s e , the verb is g r o u p e d with its c o m p l e m e n t in a prosodic subtree; an e x a m p l e of this grouping is the s u b t r e e for can p o w e r only the echo cancelers in Figure 2,
(b) C o m b i n e the • p h r a s e d a u g h t e r s of the major constituents, excluding VP, into a prosodic subtree
A t p r e s e n t , this rule only applies to NP and PP since adjectives and adverbs are c u r r e n t l y not t r e a t e d as @ heads F o r e x a m p l e , the name o f the character, which forms two d~ phrases under NP (the name and of the
r e p l a c e s the NP
7, We have found at least one class of phrases for which this order of processing appears inappropriate In these, the head
of the top-level phrase is epistemlc e.g., believe, know, belief, knowledge andits complement is a sentence In most cases, the current processing order for embedded sentences will produce a break between a head and a following embedded sentence For this class of sentences, however, thd break does not seem to be appropriate "~Vhile it wot ld be straightforward
to handle this as an exception, we are currently examning whether there is a more principled wa? to describe what must
be done in these cases
s Onl,~ the top-level • nodes, those which contain the head of the ~ ntactic phrase, are counted in computing the node count LnU~,~'- ~y~:Lv~ ~am~lev • in Fi,,ure -, "~ the sub-phrasal branching' ot"
Left-hand and power unit c~oes not contribute to the node count
149
Trang 6(c) Bundle t o g e t h e r prosodic constituents ( ~
p h r a s e s ) from left to right if no o t h e r rules apply
This rule i n t e g r a t e s the constituents left u n a t t a c h e d by
the p a r s e r into the prosodic s t r u c t u r e It accounts for
the prosodic s t r u c t u r e of left-hand power unit on each
shelf in 48-channel module in figure 2, which is f o r m e d
by first bundling left-hand power unit with on each
48-channel module into ~ - 5 The final a p p l i c a t i o n of
bundling r e p l a c e s the Sigma node with the top level
p r o s o d y node, which is q5-13 in F i g u r e 2
(iv) Prosody conversion rules m a p the b o u n d a r y
strength indices o n t o t h r e e p h o n o l o g i c a l m e c h a n i s m s
B o u n d a r y indices in the low r a n g e , e.g the ~ - 3 nodes
in F i g u r e 2, are r e a l i z e d as a p h r a s e accent
( P i e r r e h u m b e r t 1980) M i d - r a n g e indices such as ~-5
and ~ - 9 in F i g u r e 2 are r e a l i z e d as c h a n g e s in pitch
range H i g h indices are r e a l i z e d with m o d u l a t i o n s in
b o t h p i t c h range and d u r a t i o n Thus the h i e r a r c h i c a l
o r g a n i z a t i o n of a s t r u c t u r e such as that in F i g u r e 2 can
be r e f l e c t e d d i r e c t l y in the s y n t h e s i z e d speech
P H E N O M E N A NOT T R E A T E D
Several p h e n o m e n a have b e e n o m i t t e d f r o m this
p r e l i m i n a r y version of the system Some of these
omissions arise f r o m the fact that we c o n c e n t r a t e d on
sentence analysis r a t h e r t h a n discourse analysis
O t h e r s involve p h e n o m e n a that c h a r a c t e r i z e s p o k e n
English, and thus did not occur in our original corpus
of t e c h n i c a l r e p a i r manuals
C o n t r a s t i v e stress is an e x a m p l e of prosodic
phrasing based on discourse analysis In our system's
analysis, the p h r a s e f r o m India does not r e c e i v e
c o n t r a s t i v e stress in (12)
(12) P a s s e n g e r s f r o m s e v e r a l c o u n t r i e s e n t e r e d
the t e r m i n a l
Finally a m a n f r o m I n d i a w a l k e d in
In designing the c u r r e n t s y s t e m , we have c o n c e n t r a t e d
on the level of s e n t e n c e analysis H a n d l i n g the
c o n t r a s t s involved in d a t a like (12) n e c e s s i t a t e s an
a d d i t i o n a l level of discourse analysis
In a d d i t i o n , the s y s t e m n e v e r explicitly m a n i p u l a t e s
s e g m e n t d u r a t i o n s or o v e r a l l s p e e c h rate F o r
e x a m p l e , we have vet to e x p l o r e w h e t h e r l e n g t h e n i n g
of the s e g m e n t b e f o r e a m i d - r a n g e b o u n d a r y value is
a p p r o p r i a t e , or w h e t h e r increasing the d u r a t i o n of
constituents of the core s e n t e n c e might e n h a n c e the
n a t u r a l sound of the system
RESULTS AND FUTURE RESEARCH
To date our s y s t e m has b e e n t e s t e d s y s t e m a t i c a l l y
on a set of 39 s e n t e n c e s , and its p e r f o r m a n c e has b e e n
o b s e r v e d less f o r m a l l y on a set of a p p r o x i m a t e l y 300
sentences 9 The test corpus covers a r e p a i r m a n u a l for
t e l e p h o n e switching systems and an i n t r o d u c t o r y
d e s c r i p t i o n of the Prose 2000 t e x t - t o - s p e e c h system
W e a d d e d sentences cited in U m e d a (1982) and
s e n t e n c e s that we c o m p o s e d in o r d e r to e x t e n d the
range of syntactic constructions r e p r e s e n t e d in the
test In g e n e r a l , we have o b s e r v e d a significant
i m p r o v e m e n t of prosodic quality in those test
9 The 39 sentences are listed in the appendix to this paper
s e n t e n c e s w h e r e the p a r s e r and the prosodic
c o m p o n e n t have r e t u r n e d a c c e p t a b l e results
W e have o b s e r v e d p r o b l e m s , h o w e v e r , e s p e c i a l l y in the f o r m a l test corpus, much of which we chose for its
p o t e n t i a l difficulty Of the 39 test s e n t e n c e s , 38
p a r s e d correctly Of these, the prosodic c o m p o n e n t
r e t u r n e d 26 sentences with a c o m p l e t e set of
a c c e p t a b l e p r o s o d y m a r k i n g s In t e r m s of a c t u a l
m a r k i n g s , the system m a r k e d 393 prosodic e v e n t s , of which 21 m a r k i n g s were u n a c c e p t a b l e W e can
a t t r i b u t e e r r o r s in those s e n t e n c e s with u n a c c e p t a b l e
p r o s o d i c m a r k i n g s to t h r e e distinct p r o b l e m s discussed below
Complement Sentences
F i v e of the e r r o r s that arose from the p r o s o d y
s y s t e m ' s t r e a t m e n t of the test c o r p u s result f r o m the fact that the system sets off all s u b o r d i n a t e s e n t e n c e s , including c o m p l e m e n t s e n t e n c e s , from the m a i n
s e n t e n c e I n f o r m a l testing of the p r o d u c t i o n s of four
i n f o r m a n t s on the r e l e v a n t d a t a i n d i c a t e d that this
a p p r o a c h w o r k s c o r r e c t l y for c o m p l e m e n t s e n t e n c e s such as (13)-(16) ( C o m p l e m e n t s e n t e n c e s are italicized):
(13) H e a l t h services c a u t i o n e d W e s t e r n r e s i d e n t s
that they should ask where their watermelons come f r o m before buying
(14) W e have to satisfy p e o p l e that the crisis is past
(15) The v e n d o r s e x p l a i n e d that this is the result
of illness among 281 people who ate pesticide- tainted watermelons
(16) W a t e r m e l o n g r o w e r s w o n d e r whether this will continue throughout the rest of the season
H o w e v e r the i n f o r m a n t test consistently i n d i c a t e d that the c o m p l e m e n t s e n t e n c e s in (17)-(19)" are not set off by a c o m p a r a b l e b o u n d a r y :
(17) T h e y b e l i e v e California sales are still o f f
75 percent
(18) T h e y t h i n k the Southeast is shipping half its normal load
(19) G r o w e r s and r e t a i l e r s c l a i m e d the incident hurt sales across the USA
Cases like (17)-(19) in which no b r e a k is p e r c e i v e d
b e t w e e n the v e r b and its c o m p l e m e n t s e n t e n c e , f o r m a
s y n t a c t i c a l l y distinct class in F i d d i t c h This class is
c h a r a c t e r i z e d by the fact that the v e r b a l h e a d in e a c h case is one that does not r e q u i r e that its c o m p l e m e n t
s e n t e n c e begin with a c o m p l e m e n t i z e r ( e i t h e r that, f o r ,
or a wh- word) T h e class includes e p i s t e m i c verbs, like those in (17)-(19), as well as a wide r a n g e of verbs that t a k e e i t h e r t e n s e d s e n t e n c e s , or various types of
n o n - t e n s e d s e n t e n c e s as c o m p l e m e n t s ) ° The e x a m p l e s (20)-(26) d e m o n s t r a t e the range of this class ( c o m p l e m e n t s e n t e n c e s are italicized):
l0 Fidditch, in followin~ the outlines of Chomskv's (1981) Government and Binding theory, assumes that propositions, i.e., those elements that cBntain k]oth a prkdicate and a perhaps null subject, are syntactically represented as sentences, regardless of tensing
150
Trang 7(20) We had the ship's forces make temporary
repairs
investigation impractical
advance
repairs
Sentence-Final Constituents
F i f t e e n of the errors that arose from the system's
t r e a t m e n t of the test corpus result from a high
boundary value that sets final c o n s t i t u e n t s off from
the main sentence The high value is due to the
system's purely left-to-right a t t a c h m e n t of syntactically
u n a t t a c h e d constituents (see rule iii.d above) The
high boundary value is acceptable in sentences like
examples are italicized)
(27) In these instances it may be desirable to use
p h o n e m e characters instead of text characters
in the input text
(28) Phonemic characters can also be used to
handle syntactic data such as boundaries
which can improve speech quality
to equipment failure
However the high b o u n d a r y value sets the final
constituent off u n n a t u r a l l y from the m a i n sentence in
data such as (30)-(32)
(30) The method by which you convert a word
into p h o n e m e s is provided in
Chapter 7
(31) The e x p e r i m e n t e r s instructed the i n f o r m a n t
implemented
In many cases it appears that the g r a m m a t i c a l
relation of the final constituent to the rest of the
sentence d e t e r m i n e s the boundary value that sets off
which bear no relation to any single item in a
sentence, are set off by a minor phrase b o u n d a r y
whereas final constituents that modify a p a r t i c u l a r
distinction b e t w e e n the final constituents in (27)-(29),
which are adjuncts, and those in (30)-(32), which are
modifiers However, while the distinction b e t w e e n the
( c o m p l e m e n t and subject) and those of the periphery
(adjunct and modifier) is fairly straightforward, and
handled directly bv the mechanisms of the Fidditch
e l e m e n t s of adjunct and modifier are complex a n d require the addition of costly mechanisms
T h e cost of adding a d j u n c t / m o d i f i e r distinctions is illustrated by the ambiguity that arises when both
more clearly, consider the r e a r r a n g e m e n t of this
the3: instructed the informants to speak.) The context of speech analysis prefers the f o r m e r reading However, the net benefit of adding sophisticated contextual analysis to our system, if a t t a i n a b l e , is, at best, unclear The same may be said of adding selectional restrictions, or detailed i n f o r m a t i o n on logical form
In contrast, a finer t r e a t m e n t of local syntactic
constituents is within reach From the data we have
e x a m i n e d , it appears that the character of the prosodic
e v e n t before the final c o n s t i t u e n t can be locally
d e t e r m i n e d to a great extent For the most part this
d e t e r m i n a t i o n depends on the category type of the final constituent and on the contents of the leading edge of the constituent For example, interjections
(however moreover, therefore, alas, thus, of course, etc.)
etc.) are u n i f o r m l y set off by a high b o u n d a r y value
a n d should r e m a i n so In contrast, the b o u n d a r y value
of final prepositional phrases, particularly those with
11
are c u r r e n t l y engaged in categorizing the c o n s t i t u e n t
constituents with respect to the prosodic event that precedes them
A l t e r n a t i v e l y , we are considering the play-it-safe approach of reducing the high b o u n d a r y values that set off final constituents to m i d - b o u n d a r y values
useful in conjunction with our local d e t e r m i n a t i o n approach for those constituents whose status is either undecidable or ambiguous u n d e r the latter a p p r o a c h J ~
particular, in consideration of, etc must be treated like interjections
12 Reducing the final boundary ~alue leaves ambiguities unresolved For sentences such as (i! and (ii), below, we believe this lack of resolution is appropriate:
(i) John saw a ~irl in the park with a telescope
park.liThe telesccTpe is witli John or the girl or it's in the (ii) I need a woman to fix the sink
[I need a woman so that I can fix the sink
I need a woman who can fix the sink.]
spoken Enghsh, such ambl~ulnes are not processed unless the speaker or listener is directly questioned re~,arding the
dlsamblguate are inappropriate unless such questioning occurs Other cases are less clear For example, it is difficult to imazine that, in (28) the difference between the readin~ of the
whic~'h clause as a sentence adjunct and as a noun~phrase
such cases some local distinction, such as the presence or absence of the comma in (28), obtains
151
Trang 8Sentence-Initial Constituents
W h e n a sentence contains both sentence-initial and
sentence-final adjuncts, the sentence-initial adjuncts
will be less prominently set off than the sentence-final
adjuncts due to the left-to-right a t t a c h m e n t of adjuncts
to the prosodic tree (see rule iii.b above) In data like
(33), however, a more appropriate rendering would
have the boundary after the adjunct 011 a clear day be
strong relative to the boundary before the adjunct as it
rises over the mountains
(33) On a clear day you can see the sun as it rises
over the mountains
While it would be trivial to increase the value of
the pertinent boundary, we are as yet unsure what the
critical features are which require a more perceptible
boundary For example, while a higher boundary
value after the prepositional phrase in (34) might b'e
acceptable, it is not clear that it is necessary:
(34) In the morning John left
Given the stylistically distinct nature of this data, we
have not yet considered this question in detail
Summary
While we have systematically tested our system so
far on a small set of examples, the number of prosodic
events involved in those examples, 393 is high, due to
the length of the sentences tested We find the 5
percent error rate, representing 21 prosodic events,
encouraging at this stage in the development of the
system In addition, we have delimited the problem
areas of an approach that relies solely on information
available in the syntax tree Our initial investigation
of these problems indicates that at least part of the
necessary information about phrase-level prosody is
conveyed in the lexicon per se Additionally, due to
the left-corner orientation of the Fidditch parser,
which exists independently to optimize search
strategies, the necessary lexical information is made
easily available
CONCLUSIONS
We have described an on-line experimental system
that uses prosody rules to infer prosodic phrasing from
constituent structure, grammatical functions, and
length considerations The system contains three
modules: a deterministic parser, a set of prosodic
phrasing rules, and an algorithm to convert the output
of the prosodic phrasing rules into signals for the Bell
Labs text-to-speech system
In developing the experiment, our intention was to
build a working system that would allow us to test
various hypotheses about the connections between
syntax and prosodic phrasing in human speech and to
upgrade the prosody of existing synthetic speech The
modularity of our system enables us to alter each
module independently in order to test different
hypotheses For example, the parser can be altered to
reflect the difference between verbs that require a
complementizer before a sentential complement and
those that do not 13 This alteration is independent of
13 Fidditch represents this as a difference in the level of the com-
plement sentence Verbs that require a complementizer take
an S-bar complement, while verbs that do not require a com-
plementizer take an S complement with an optional that
preceding
the workings of the prosody system or the prosody conversion rules
The existence of this prosody system makes the problem areas in the syntax-prosody relation more tractable by allowing online testing of a large body of data For example, the prosodically different character of the two classes of complement sentences discussed above became apparent after several examples from each class were run through the system We therefore feel we have built a tool that will aid in designing better approximations of sentence prosody as it relates to syntacnc structure
REFERENCES
Allen, J 1976 Synthesis of speech from unrestricted
text Proceedings of the IEEE, 4, 433-442
C h o m s k y , N 1971 Lectures on government and binding
Dordrecht: Foris Publications
Cooper, W and J Paccia-Cooper 1980 Syntax and speech Cambridge, M A : H a r v a r d University Press Elovitz, H., R Johnson, A M c H u g h , and J E Shore
1976 Letter-to-sound rules for automatic translation
of English text to phonetics I E E E Transactions on Acoustics, Speech, and Signal Processing, 6, 446-459 Gee, J P and F Grosjean 1983 P e r f o r m a n c e structures: a psycholinguistic and linguistic appraisal
Cognitive Psychology, 15, 411-458
Hindle D 1983 User manual for Fidditch, a deterministic parser N R L Technical M e m o r a n d u m
#7590-142
Luce, P.A., Feustel, T.C., and Pisoni, D.B 1983 Capacity demands in short-term m e m o r y for synthetic
and natural speech Human Factors, 25, 17-32
Marcus, M 1980 A theory of syntactic recognition f o r natural language Cambridge, M A : M I T Press
Pierrehumbert, J B 1080 The phonetics and phonology of English intonation Ph.D Dissertation, MIT
Selkirk, E O 1984 Phonology and syntax: the relation between sound and structure Cambridge, M A : M I T Press
U m e d a , N 1982 Boundary: perceptual and acoustic properties and syntactic and statistical determinants
Speech and Language, 7, 333-371
U m e d a , N and R Teranishi The parsing program for automatic text-to-speech synthesis developed at the Electrotechnical L a b o r a t o r y in 1968 I E E E Transactions on Acoustics, Speech, and Signal Processing, 23, 183-188
A P P E N D I X : T E S T S E N T E N C E S
1 T H E N A M E OF T H E C H A R A C T E R IS N O T
P R O N O U N C E D
2 L E F T - H A N D P O W E R U N I T ON E A C H S H E L F
IN F O R T Y - E I G H T
C H A N N E L M O D U L E P O W E R S O N L Y E C H O
C A N C E L L E R S IN T H A T
S H E L F
152
Trang 93 THE C O N N E C T I O N MUST BE D E T E R M I N E D
F O R THE L E F T - H A N D P O W E R UNITS ON E A C H
S H E L F
4 T H E C O N N E C T I O N MUST BE D E T E R M I N E D
F O R T H E L E F T - H A N D P O W E R UNITS W H I C H
A R E ON E A C H SHELF
5 T H E M E T H O D BY W H I C H ONE C O N V E R T S A
W O R D INTO P H O N E M E S IS P R O V I D E D IN
C H A P T E R 7.14
6 WE DISCUSSED THE T E C H N I Q U E S WE H A D
I M P L E M E N T E D
7 T H E T E C H N I Q U E S WE H A D I M P L E M E N T E D
W E R E TESTED ON A L A R G E R M A C H I N E
8 THE M A N W H O M WE SAW Y E S T E R D A Y
LIVES F A R A W A Y F R O M H E R E
9 T H E Y T O L D HIM TO W A L K SLOWLY
10 T H E D E S T R U C T I O N OF T H E G O O D N A M E
OF HIS F A T H E R B O T H E R E D HIM
11 L A T E L Y HE H A D H A S C O N T R O L O V E R T H E
S I T U A T I O N
12 I N E E D A W O M A N TO F I X T H E SINK
13 JOHN MET A W O M A N H E T H O U G H T H E
LIKED
14 THE W O M A N I S A W C A M E F R O M H E R E ,
15 IN T H E S E INSTANCES IT M A Y BE
D E S I R A B L E TO USE P H O N E M E C H A R A C T E R S
I N S T E A D O F T E X T C H A R A C T E R S TO
R E P R E S E N T A W O R D E A C H T I M E IT A P P E A R S
ON T H E INPUT TEXT
16 P H O N E M E C H A R A C T E R S G I V E M O R E
C O N T R O L O V E R THE P A R T I C U L A R SOUNDS
T H A T A R E G E N E R A T E D
17 T H E M A T E R I A L S R E Q U I R E D A R E ONE
KITE KIT
18 P H O N E M I C C H A R A C T E R S C A N A L S O BE
USED TO H A N D L E SYNTACTIC D A T A SUCH AS
THE B O U N D A R I E S W H I C H C A N I M P R O V E
SPEECH Q U A L I T Y
19 IT M A Y BE D E S I R A B L E TO G I V E J O H N A
H A N D
20 A F T E R T H E S E Q U E S T I O N S , A D E T A I L E D
D E S C R I P T I O N O F T H E USE O F P H O N E M E S
W I L L BE
P R O V I D E D IN C H A P T E R 7
21 T H E E N G L I S H T H A T IS SPOKEN IN
A M E R I C A A T THE P R E S E N T DAY H A S
R E T A I N E D A G O O D M A N Y C H A R A C T E R I S T I C S
O F E A R L I E R BRITISH E N G L I S H T H A T DO NOT
S U R V I V E IN BRITISH E N G L I S H T O D A Y
22 P H O N E M I C C H A R A C T E R S C A N A L S O BE
USED TO H A N D L E S Y N T A C T I C D A T A SUCH AS
T H E L O C A T I O N O F T H E ENDS O F P H R A S E S
W H I C H C A N I M P R O V E S P E E C H Q U A L I T Y
23 T H E STUDENTS C O N S I D E R E D THE
A S S U M P T I O N T H A T A B R E A K M I G H T O C C U R
24 F I N A L L Y YOU MUST A S S U M E T H A T Y O U R
C I G A R E T T E S W I L L B O T H E R T H E
P A S S E N G E R S ,
25 TRY TO G I V E T H E N A M E S O F THE
C H A R A C T E R S TO JOHN,
26 I P R E F E R F O R HIM TO G I V E T H E N A M E S
O F T H E C H A R A C T E R S TO JOHN
27 I B E L I E V E T H O S E P E O P L E TO BE
I N T E L L I G E N T
28 I P R O M I S E D HIM T H A T HE C O U L D COME
29 T H E Y G A V E T H E BOY A BOOK
30 T H E Y G A V E H I M A BOOK
31 T H E 4 8 - C H A N N E L M O D U L E C A N H A V E
O N L Y T W O D I - G R O U P S BUT C A N H A V E UP TO
F O U R P O W E R UNITS IF BOTH D I - G R O U P S A R E
E Q U I P P E D W I T H E C H O C A N C E L E R S
32 I T O L D HIM Y E S T E R D A Y TO C L E A N HIS
R O O M
33 M O V E T H E P O W E R OPTION J U M P E R P L U G
SO T H A T IT IS A D J A C E N T TO D I - G R O U P ONE
ON P R I N T E D W I R I N G BOARD
34 I W A N T A LOT M O R E C O O K I E S
35 THE MINUS-SIGN P R O N U N C I A T I O N SWITCH
IS IN T H E M I D D L E
36 HE A S K E D T H E C H I L D R E N TO FINISH THE JOB
37 HE A R G U E D T H A T IT WAS IMPOSSIBLE
38 IS A M A N A T THE DOOR
39 A D E T A I L E D D E S C R I P T I O N O F T H E USE OF
P H O N E M E S IS P R O V I D E D IN C H A P T E R 7
1,1 Fidditch failed here on the relative clause with a PP left edge
153
Trang 100
tO
,g
° ~
a')
2.-
i::a.,
• v,,,~
,.-1
0
it)
t ~
<
o, ~
g.r.,