The role of second formant transitions in the stop semivowel distinction

Perception & Psychophysies 1981, 29 (2), 121-128 The role of second formant transitions in the stop-semivowel distinction EILEEN C SCHWAB, JAMES R SAWUSCH, and HOWARD C NUSBAUM State University of New York, Buffalo, New York 14226 An experiment was conducted which assessed the relative contributions of three acoustic cues to the distinction between stop consonant and semivowel in syllable initial position Subjects identified three series of syllables which varied perceptually from [ba] to [wa] The stimuli differed only in the extent, duration, and rate of the second formant transition In each series, one of the variables remained constant while the other two changed Obtained identification ratings were plotted as a function of each variable The results indicated that second formant transition duration and extent contribute significantly to perception Short second formant transition extents and durations signal stops, while long second formant transition extents and durations signal semivowels It was found that second formant transition rate did not contribute significantly to this distinction Any particular rate could signal either a stop or semivowel These results are interpreted as arguing against models that incorporate transition rate as a cue to phonetic distinctions In addition, these results are related to a previous selective adaptation experiment It is shown that the "phonetic" interpretation of the obtained adaptation results was not justified A fundamental claim of bottom-up theories of 1978) does not guarantee that the human speech perspeech perception is that phonetic labeling is the ceiver takes advantage of this information direct result of analyzing a number of acoustic feaIt becomes very important, then, to determine extures of the speech waveform (e.g., see Fant, 1967) actly what acoustic information is utilized by humans Regardless of the specific mechanism employed for in the course of phonetic perception This inventory this acoustic analysis, these data-driven theories must of acoustic cues will provide a basis for evaluating specify a set of basic acoustic properties which are the psychological validity of theories of speech percoded during speech perception One problem with ception In addition, assessing the entire repertoire of this approach, pointed out by Studdert-Kennedy perceptually significant cues may constrain the types (1977), is that the choice of these features by theorists of perceptual mechanisms used in cue extraction has been entirely post hoc At present there are no (e.g., spectral templates vs formant trackers) Fiunifying auditory principles guiding this theoretical nally, the full specification of these acoustic-phonetic feature selection process In other words, bottom-up cues might allow us to determine if there exist any theories tend to choose those acoustic properties general auditory principles of speech perception (cf which can be successfully employed to perform pho- Studdert-Kennedy, 1977) Currently, such principles netic labeling (see Stevens, 1980) Thus, feature pro- (if they exist) may be obscured by an incomplete piccessing theories typically invoke sufficiency criteria ture of human acoustic information processing durwithout regard for whether or not the acoustic prop- ing speech perception Before any auditory principles erties employed are perceptually significant for hu- can be defined, it will be necessary to systematically mans (see Norman, 1980, for a related discussion) investigate the separate and conjoint effects of the It is very tempting to assume that the human per- full spectrum of acoustic information available in ceptual system uses all the acoustic information avail- speech able in making phonetic decisions However, demonThe phonetic distinctions of voicing (e.g., Lisker strating the sufficiency of a set of acoustic features & Abramson, 1964; Summerfield & Haggard, 1977) for phonetic labeling (e.g., Stevens & Blumstein, and place of articulation (e.g., Dorman, StuddertKennedy, & Raphael, 1977; Liberman, Cooper, This work was supported by NIMH Grant MH31468-01 and Shankweiler, & Studdert-Kennedy, 1967) in stop conNSF Grant BNS7817068 to SUNY/Buffalo and NINCDS Grant sonants are examples of two distinctions that have NS-12179 to Indiana University (which supported development of been studied intensively and extensively Yet, even the speech synthesizer) The authors would like to thank James for these phonetic contrasts, all possible cue specifiPomerantz for his comments on an earlier draft of this manucations and interactions have not been fully deterscript Requests for reprints should be sent to any author at the Department of Psychology, 4230 Ridge Lea Road, Buffalo, New mined For other phonetic distinctions, such as manYork 14226 ner of articulation, the research to date has not exCopyright 1981 Psychonomic Society, Inc 121 0031-5117/81/020121-08501.05/0 122 SCHWAB, SAWUSCH, AND NUSBAUM plored cue structure in sufficient depth, especially hence, rate) increased, F2 transition extent (and hence, when some of the cues are intrinsically related In rate) decreased It is possible that these extent cues many instances, manipulation of one of these cues could interact and thus increase the variance of the necessitates a change in at least one other cue For ex- boundary locations when identification functions are ample, a change in second formant transition extent plotted against transition rate Since the change in (the frequency excursion from onset to steady state) transition rate for the two formants was not consismay change the overall F2-F3 transition pattern (e.g., tent across vowels, the relative contributions of durising vs diverging), the spectrum at syllable onset, ration and rate cues could not be determined unperceptual summation of F2 and F3 onsets, and tran- equivocally sition rate Any one, or all, of these features, which The previous studies manipulated the tempo of all are intrinsically interrelated, could be perceptually formants in a stimulus and observed the effect on relevant This problem is exemplified by considering perception The next two studies manipulated the the phonetic distinction between stop consonants transition rate and extent of only one formant and (e.g., [b]) and semivowels (e.g., [wl) observed the effect on perception Suzuki (Note 1) Several earlier studies have examined acoustic cues examined the effect of F1 transition rate and extent that serve to distinguish stops and semivowels on the perception of intervocalic stops and semi(Hillenbrand, Minifie, & Edwards, 1979; Liberman, vowels It was reported that, in general, large FI freDelattre, Gerstman, & Cooper, 1956; Miller & quency extents were perceived as stops Suzuki found Liberman, 1979; O’Connor, Gerstman, Liberman, that an increase in transition rate reduced the freDelattre, & Cooper, 1957; Suzuki, Note 1) The first quency extent required to perceive a stop However, published study of stop-semivowel cues used two an examination of the data indicates that an increase formant stimuli to examine the effect of transition in transition rate was accompanied by a decrease in tempo (Liberman et al., 1956) Tempo was varied by transition duration Thus, the results could also be increasing the duration of the transitions (and de- indicating that a decrease in F1 transition duration creasing the rate of the transitions by an appropriate reduces the frequency extent required to perceive a amount) while holding transition frequency extent stop Another study examined the acoustic cues that constant Subjects identified synthetic stimuli which serve to distinguish semivowels and liquids (O’Connor ranged perceptually from [b~] to [w~] and [g~] to [j~] et al., 1957) In part of this study, subjects identified (as in "yet") Adult subjects were able to utilize the stimuli that varied in F2 frequency extent before a tempo of the F1 and F2 transitions as a cue to dis- variety of vowels O’Connor et al found a relationtinguish stop consonant from semivowel These re- ship between frequency extent and the perception of suits, indicating the usefulness of the tempo cue, semivowels When the F2 transition was in the aphave been extended to infants Hillenbrand et al propriate direction (rising for [w] and falling for (1979) examined the ability of infants to discriminate [j]), they found that a decrease in the extent of the F2 between [b~] and [w~], which were cued by changes in transition resulted in a decrease in semivowel retransition tempo The first experiment used syntheticsponses Since transition duration was held constant, stimuli similar to those of Liberman et al (1956) a decrease in transition extent resulted in a concurThe second experiment used computer-modified rent decrease in transition rate These previous studies indicate that the extent, tokens of natural speech In both experiments, infants were able to discriminate stop from semivowel on duration, and rate of consonant transitions are major the basis of the tempo cue In another experiment, cues to manner of articulation Unfortunately, we Liberman et al (1956) examined the effect of transi- cannot evaluate the relative contribution of each cue, tion tempo before a variety of vowels For all stimuli, since these cues have been confounded in previous each transition began at the same frequency (120 Hz studies Since rate is defined as frequency extent for F1 and 600 Hz for F2) So, transition extent was divided by transition duration, we cannot vary each constant within a series (vowel) but varied between cue separately A change in one of the three cues series (vowels) Liberman et al (1956) varied transi- automatically results in a change in at least one of the tion extent across vowels in order to determine whether other two cues Consequently, manipulations must transition rate or duration contributed more to the involve at least two of these three cues, or all three, perception of stops and semivowels Since there was simultaneously Previous studies varied only one of less variance in the location of the category bound- these three possible pairs In order to determine the aries when each series was plotted as a function of relative importance of each cue in perception, we transition duration (as opposed to formant transition must vary each of the three possible pairs of cues seprate), it was concluded that transition duration was arately while holding the third cue constant In the the controlling cue However, it should be noted that present study, all three pairwise comparisons were for some of their vowels, as F1 transition extent (and made for the F2 transition Thus, the present experi- STOP-SEMIVOWEL CUES 123 ment will be able to assess the relative contribution of each cue to the distinction between stop consonant and semivowel METHOD Subjects The subjects were 14 undergraduates at the State University of New York at Buffalo, who participated to fulfill a course requirement All subjects were right-handed, native speakers of English with no reported histories of either speech or hearing disorders Stimuli The experimental stimuli consisted of three sets of seven synthetic consonant-vowel syllables which varied perceptually from [ba] to [wa| All stimuli were generated using a software cascade synthesizer (Klatt, 1980a, or see Kewley-Port, Note 2) in the Speech Perception Laboratory at the State University of New York at Buffalo The three series (and all stimuli within a series) were the same in all respects except one, the F2 transition All stimuli were 245 msec in duration and contained five formants The fundamental frequency contour was the same for all stimuli, with F0 starting at 105 Hz and rising to 120 Hz over the first 120 msec and then falling to 100 Hz at syllable offset Each F1 began at 245 Hz and rose for 45 msec to a steady-state value of 700 Hz Transition rate was 13.75 Hz/msec for the first 20 msec and 7.2 Hz/msec for the next 25 msec Each F3 began at 2,115 Hz and rose for 70 msec to a steady-state value of 2,600 Hz The F3 transition rate was 16.9 Hz/msec for the first 20 msec and 2.94 Hz/msec for the next 50 msec The F2 steady-state value was 1,220 Hz The fourth and fifth formants were constant at 3,300 and 3,850 Hz, respectively The F1 bandwidth began at 60 Hz, remained constant for 15 msec, and then increased for 30 msec to a final value of 80 Hz The F2 bandwidth began at 75 Hz, remained constant for 20 msec, and then increased for 40 msec to a final value of 80 Hz The F3 bandwidth began at 90 Hz and increased for 70 msec to a final value of 140 Hz The fourth and fifth formant bandwidths remained constant at 250 and 200 Hz, respectively Amplitude of voicing began at 55 dB and increased during the course of the F2 transition to a value of 60 dB Amplitude of voicing decreased during the last 50 msec of the vowel to dB The duration, frequency extent, and rate of frequency change of the F2 transition were varied In each series, one cue was held constant and the other two varied to produce a seven element stopsemivowel series All F2 transitions were linear In the rate con- F2 TIME F2 TRANSITION RATE CONSTANT Figure Three formant schematic representation of the two endpoints for the rate constant series F2 transition rate is indicated by the double-headed arrows between the dashed lines F2 TiME F2 DURATION CONSTANT Figure Three formant schematic representation of the two endpoints for the duration constant series F2 transition duration is indicated by the double.headed arrows between the dashed lines [b,] F2 TIME F2 EXTENT CONSTANT Figure Three formant schematic representation of the two endpoints for the extent constant series F2 transition extent is indicated by the double-beaded arrows between the dashed lines stant series, F2 transition rate was held constant at 10.43 Hz/msec F2 transition duration and extent ranged from 30 msec and 313 Hz for the Stimulus end of the series to 60 msec and 626 Hz for the Stimulus end of the series in 5-msec and 52-Hz steps Schematic representations of the initial 145 msec of the first three formants of these two endpoints are shown in Figure In the extent constant series, F2 transition extent was held constant at 470 Hz F2 transition duration and rate ranged from 15 msec and 31.33 Hzimsec for the Stimulus end of the series to 75 msec and 6.26 Hz/msec for the Stimulus end of the series in 10-msecduration (and log slope) steps Schematic representations of the initial 145 msec of the first three formants of these two endpoints are shown in Figure In the duration constant series, F2 transition duration was constant at 60 msec F2 transition extent and rate ranged from 260 Hz and 4.33 Hz/msec for the Stimulus end of the series to 680 Hz and 11.33 Hz/msec for the Stimulus end of the series in 70-Hz and 1.17-Hz/msec steps Schematic representations of the initial 145 msec of the first three formants of these two endpoints are shown in Figure In addition to the three experimental sets, there was a training set of stimuli This set consisted of the two endpoints from each of the three experimental sets 124 SCHWAB, SAWUSCH, AND NUSBAUM Procedure Small groups of two to four subjects each were run at a time Each subject participated for h The stimuli were converted to analogue form and presented to subjects in real time under computer control The stimuli were presented binaurally to subjects through Telephonics TDH-39 matched and calibrated headphones The intensity of all stimuli was set to 72 dB SPL for a [ba] rate constant stimulus All subjects participated in a short training condition at the beginning of the session The subjects were informed that they would be listening to synthetic syllables that would sound like [ba] and [wal During the training set, subjects were asked to listen to the stimuli without responding These stimuli were presented in an alternating order ([ba], then [wa]) with an interstimulus interval of sec After each stimulus, feedback was provided indicating the stimulus that had been presented Subjects were presented with l0 occurrences of each endpoint After the training set, the subjects listened to the experimental sets They were asked to identify each stimulus by pushing one of six buttons on a computercontrolled response box Pushing button "l" indicated a good example of a [bal, and pushing button "6" indicated a good [wa] Variations in quality between these phonetic exemplars were indicated with the buttons "2" through "5." The experimental trials were subject-paced with a maximum 5-sec interstimulus interval Each stimulus series was presented in a block of 10 repetitions of each of the seven stimuli in random order (70 trials) Subjects listened to two blocks of trials for each series The order of presentation of the experimental sets was counterbalanced across subjects By the end of the experimental session, each subject had provided 20 identification responses to each stimulus in each series RESULTS H duration constant rate constant w Ld ~> < I I I ! I I I 100 200 300 400 500 600 700 EXTENT OF F2 TRANSFION (Hz) Figure Average identification functions for the two series that vary F2 transition extent H duration constant extent constant w The data from four subjects were eliminated from subsequent analysis, since the identification of the endpoints of one or more of the three series was inconsistent and near chance Average identification rating functions were calculated for the three series [ b] for the remaining 10 subjects In each series, the aver10 15 20 25 30 35 age ratings range from a good [ba] identification for Stimulus to a good [wal identification for Stimulus RATE OF F2 TRANSITION (Hz/mser) The identification results for the two series that Figure Average identification functions for the two series that varied F2 transition duration are shown in Figure vary F2 transition rate Data are plotted as a function of the duration of the F2 transition for these series The data from the extent constant series (the solid squares) replicates the results of Liberman et al (1956) When F2 transition extent was held constant, the proportion of [w] responses increased as transition duration increased For the rate constant series, the same result was found As transition duration increased, the proportion of [w] responses increased Each of the l0 subjects showed this same pattern of results The results for the two series that varied F2 transition extent are shown in W Figure Data are plotted as a function of F2 frequency extent The pattern of subject responses is similar to that found when F2 transition duration varied Increasing F2 frequency extent decreased the proportion of [b] responses for both the duration constant and rate constant series As the F2 transition ~ I ~ I I ~ I I extent increased, the proportion of [w] responses in70 80 20 30 40 50 60 !0 creased Again, each of the 10 subjects showed this DURATION OF F2 TRANSITION (mse¢) same pattern of results The identification results for the two series that varied F2 transition rate are shown Figure Average identification functions for the two series that in Figure Data are plotted as a function of F2 rate vary F2 transition duration STOP-SEMIVOWEL CUES 125 for these two series The pattern of subject responses can be used to differentiate stop consonants from here is very different from the data plotted as a func-semivowels The expression E ¯ D represents this relation of F2 extent or duration When frequency extenttionship, where E is the value of F2 extent and D is was held constant (the solid squares), the proportion the value of F2 transition duration The product of of [w] responses decreased as the rate of F2 transi- these values can be compared with a criterion to pertions increased In contrast, when the F2 transitionform phonetic feature assignments If the product exduration was held constant (the solid triangles), theceeds the criterion, subjects should label the test stimproportion of [w] responses increased as the rate of ulus as a semivowel If the product is less than the the F2 transitions increased As with the previouscriterion, subjects should respond using a stop label data sets, all 10 subjects show the same pattern of re- For our group data, a criterion of 23,000 (Hz ¯ msec) suits that was found for the group data would be sufficient to distinguish bilabial stops and semivowels In fact, this criterion, which is based DISCUSSION on group data, is sufficient to appropriately label 196 of 210 judgments.2 However, we assume that the acThe results indicate that F2 transition rate is not atual criterion value for individual subjects may vary sufficient cue for distinguishing stops from semidepending on individual differences It is also possivowels The use of rate as a cue seems to be totally ble that different subjects might rely on the extent dependent on the extent of the F2 transition and onand duration cues to differing degrees This would the F2 transition duration No matter what rate was cause these cues to have different (exponential) weights chosen, an appropriate choice of extent or duration in the decision rule In addition, this product rule could cancel the effect of rate and cause the stimulusmight be extended to encompass the contributions of to be identified as either [b] or [w].~ Thus, it appearsother acoustic cues, such as F1 transition duration that the significant cues for the stop-semivowel dis-and extent, and the amplitude profile of the syllable tinction are the duration and extent of the F2 transi-at onset tion Short transition durations cue a stop, while long Despite the extreme simplicity of this description durations cue a semivowel Small F2 transition ex-of phonetic labeling, it is interesting to note the form tents signal a stop, while large F2 transition extentsof the decision rule For this rule, the assignment of signal a semivowel phonetic features is based on the multiplication of It should be noted that there was, necessarily, some the values of two acoustic cues In this respect, the covariation of acoustic cues in our series Since thegeneral form of our stop-semivowel decision rule is overall duration of the syllables was constant, a changein agreement with other work on mathematical descripin the duration of the F2 transition resulted in a changetions of phonetic decision making For example, in the duration of the F2 steady state So the two Massaro and Cohen (1977) have shown that fricative series which increased the F2 transition duration de-voicing judgments can be described by a product rule creased the F2 steady state Thus, it could be argued Oden and Massaro (1978) have used a similar apthat the duration of the F2 steady state contributed proach to modeling the classification of stop consoto the perception of manner However, while it hasnants on the dimensions of voicing and place of articbeen found that vowel duration affects the percep-ulation However, the use of a product rule does not tion of manner (Miller & Liberman, 1979), the pres-provide a process description of speech perception ent stimuli did not vary vowel duration The F3 steady-Consequently, we now turn to considering the implistate frequency was reached after the F2 had reached cations of present data for a number of bottom-up proits steady state for 20 of the 21 stimuli In addition,cess models of speech perception (Klatt, 1980b; Searle, the F1 steady-state duration was constant for all Jacobson, & Kimberley, 1980; Sawusch, Note 3) stimuli Consequently, if F2 steady-state duration Given that F2 transition rate is not perceptually was a contributing cue to the manner distinction, itrelevant to human classification of stops and semiwould probably not have been through an effect onvowels, the human speech processor must operate vowel duration The frequency at onset of F2 alsounder one of two possible constraints The first poscovaried with frequency extent Separate variation ofsibility is that transition rate is never explicitly exthese two acoustic aspects of the stimulus would re-tracted during speech perception If the cue is not quire using different vowel series, as was done byavailable, it simply cannot be used If human speech Liberman et al (1956) Consequently, the criticalperception operates under this constraint, transition variable could be either the extent of the F2 transitionrate could not be used as a cue to any phonetic disor the spectrum at onset tinction The alternative is that transition rate is exIf we assume that duration and extent are the per- plicitly extracted, but is not generally available for all ceptually relevant cues, the phonetic labeling be-phonetic feature decisions This alternative would be havior of our subjects can be described using a simplesupported if transition rate were shown to be perdecision rule The product of the F2 transition extentceptually relevant for other phonetic distinctions, (in hertz) and F2 transition duration (in milliseconds) such as place of articulation In this case, it would be 126 SCHWAB, SAWUSCH, AND NUSBAUM expected that transition rate would be extracted by a the Searle et al model would classify these stimuli as "sealed channel" mechanism (cf Pomerantz, 1978),[b] In contrast, stimuli with slow transition rates and specific to a particular phonetic contrast (e.g., place) small extents were labeled [b] by our subjects but Through the operation of a sealed channel device,would be labeled as [w] by the model Thus, the Searle transition rate would appear to be interpreted holis- et al (1980) model clearly violates the constraints we tically with other cues This might be demonstratedplaced on human speech processing in perceptual research by showing that rate was ex- An alternative feature detector model has been tracted in a phonetic feature dependent fashion proposed by Sawusch (Note 3) This computer simuSince human speech perception must operate un- lation was designed to model both the psychological der one of these two constraints, it seems reasonableprocesses of speech perception and the perceptual efto apply these constraints to theories of speech per- fects of selective adaptation for the place of articulaception This provides one criterion that can be usedtion feature in stops According to this theory, speech to evaluate bottom-up theories of speech perception perception is divided into a sequence of information A second test is determining whether these theoriestransformation stages At the earliest level of feature utilize (or could implement) extent and duration asextraction, termed "peripheral auditory analysis," cues to the stop-semivowel distinction Thus, we haveauditory cues are extracted in a frequency-specific, two criteria for evaluating the adequacy of data-ear-specific fashion Four classes of feature detectors driven theories for explaining stop-semivowel per-were implemented in this stage to signal transition rise, transition fall, steady state, and low-frequency ception Recently, Searle et al (1980) have proposed a fea- energy onset-offset (voicing) In this model, transture detector model of speech perception which has ition rate is not explicitly coded by feature detectors been instantiated as a computer program This modelRather, for each frequency region, there are two rising of speech perception operates in two distinct modes transition detectors and two falling transition detecThe first is a learning mode in which phonetic proto-tors One each of the rise and fall detectors respond t.vpes are constructed An acoustic feature analysis isto extreme frequency changes (extents), while the reperformed on the waveforms of "known" utterances.maining two respond to gradual changes (short exThe results of this feature extraction process are thentents) By only implementing two sets of rise and fall detectors, rate distinctions are too grossly coded for submitted to a discriminant analysis to classify the any possible use in stop-semivowel judgments Howknown utterances into categories In the second mode, novel utterances are also ana- ever, this distinction is sufficient for making place of lyzed by the feature detectors The discriminant analy-articulation decisions sis is then used to locate these feature-analyzed utter- At the second level of feature analysis, called "inances in the multidimensional prototype space Thetegrative auditory analysis," frequency-specific feaproximity of the novel utterances to known categor-tures from peripheral auditory analysis are combined ies in this space is used as the basis for phonetic labeling.to form frequency-independent auditory patterns Two of the acoustic features used by this model areThis second level of processing is implemented as a transition slope (i.e., rate) and the duration of acous-set of integrative decision rules that take into account tic events (e.g., voice onset time) There is, however,both auditory features and the auditory context in which those features occur Within this model, decision no explicit representation of transition extent If, in the learning stage, this model was given a set of knownrules only analyze feature outputs that are directly natural speech stops (e.g., [b]) and semivowels (e.g., relevant This means that, even if rate were coded at [w]), the program should learn to classify [b]s asthe peripheral level, decision rules at the integrative having short transition durations and rapid transitionlevel could selectively ignore or employ this feature rates The model should also learn to identify [w]sas required Since this simulation does not explicitly as utterances with long transition durations and slowcode rate in sufficient detail for distinguishing stops transition rates If a [b]-[w] series of stimuli wereand semivowels, this model exists within the conconstructed such that transition rate and extent werestraints dictated by our results In order to differentially label stops and semivowels, the simulation would the only features varying in the series (i.e., duration was constant), the program should classify theseneed to utilize the extent information encoded at the sounds on the basis of rate alone This model wouldfirst level of feature extraction The outputs of these detectors could be accumulated over the transition identify any stimulus with a rapid transition rate as duration (onset to steady state) This analysis repre[b] and any stimulus with a slow transition rate as sents a product of transition extent and duration and [w] Clearly, this classification scheme is radically therefore would be consistent with our data different from the procedure used, under similar circumstances, by our subjects, who classified this type One model which does predict our results has been of stimulus series according to transition extent Ourproposed by Klatt (1980b) Klatt has described a bottom-up approach to speech perception which uses subjects classified stimuli with rapid transition rates and large frequency extents as [w] (see Figure 6), whilestatic spectral templates as fundamental auditory fea- STOP-SEMIVOWEL CUES tures These spectral templates are nodes in a discrimination-recognition network Sample short-term spectra, taken from an input waveform, are compared with these nodes and are scored for closeness of fit The highest-scoring sequence through the network indicates the recognition path In this model, transition extent would be indicated by the amount of change in the F2 spectral peak across templates from transition onset to steady-state vowel Duration is analyzed as a cue by counting the number of times a spectral template is iteratively matched by looping through the same node With both duration and extent cues being interpreted by the recognition network, this model should emulate human labeling of stops and semivowels Even more important than the extraction of extent and duration cues by this model is the lack of any means for computing rate in this theory Indeed, Klatt (1980b) has explicitly stated that evidence demonstrating the perceptual significance of transition rate for the [b]-[w] distinction would be a strong disconfirmation of his model Since our research demonstrates that F2 transition rate is not utilized by humans making this distinction, the present study supports Klatt’s (1980b) proposal Our results also have important consequences for the "phonetic" interpretation of a previous selective adaptation study (Cooper, Ebert, & Cole, 1976) In their experiment, subjects identified a speech series under two conditions In the control condition, the subjects identified stimuli from a [ba]-[wa] series In the adaptation condition, the subjects listened to repeated occurrences of an adapting stimulus and then identified the test series A comparison was made between identification of the speech series before and after adaptation When the adaptor was an endpoint of the test series, the postadaptation identification function shifted towards the adaptor end of the series For example, using the [ba] endpoint as the adaptor, fewer of the [ba]-[wa] test stimuli were identified as [b] after adaptation Two loci for this adaptation effect have been proposed One locus is an auditory level of speech processing (Aries, 1976; Bailey, 1975; Diehl, 1976) If adaptation occurs at this level, then spectral similarity between the adaptor and test series would predict the direction and magnitude of any adaptation effect Alternatively, adaptation could occur at a phonetic level of processing where phonetic similarity would predict the direction of the effect (Cooper et al., 1976) Cooper et al tried to determine the locus of the selective adaptation effect by examining the effect of a velar stop adaptor [ga] on their bilabial stopsemivowel series ([ba]-[wa]) In order to differentiate between the auditory and phonetic explanations of selective adaptation, Cooper et al tried to create a [ga] adaptor that had an acoustic structure more similar to [wa] than to [ba] Their [ga] was similar to the [ba]-[wa] stimuli, except that the initial F2 transition 127 was falling rather than rising All |ga] transitions were 35 msec in duration From their [ba]-[wa] data, this 35-msec duration falls within the semivowel category Cooper et al found that a normal [ga] adaptor had second and third formant starting frequencies sufficiently close to one another to simulate a burst, which is an acoustic cue for a stop consonant In order to remove this burst-like effect, the transition frequency extent was reduced for both second and third formants The F2 transition extent was reduced by approximately 250 Hz Cooper et al (1976) hoped to determine the locus of the selective adaptation effect, since the [ga] adaptor had transition durations similar to the [wa] end of their test series while phonetically it was similar to the [ba] end of the series They hypothesized that an auditory locus would predict a [wa]-like adapting effect while a phonetic locus would predict a [ba]-like adapting effect The effect of the [ga] adaptor was in the same direction as a [ba] adaptor This led Cooper et al (1976) to the conclusion that selective adaptation has an effect at a phonetic level of processing Given the results of the present experiment, the results of Cooper et al can be explained without recourse to a phonetic locus for adaptation By removing one acoustic cue for a stop, namely a burst, they substituted another stop cue, short frequency extent Their [ga] adaptor had transition durations only somewhat appropriate for a semivowel, since the [ba]-[wa] stimulus with 35-msec transitions was still labeled as [ba] 20°7o of the time However, their [ga] adaptor had a relatively small F2 transition extent which is a strong stop cue (see Figure 5) Consequently, an auditory level explanation, based on the adaptation of both duration and extent detectors, would seem to be adequate to account for the Cooper et al data In summary, the present experiment explored three acoustic cues to the stop-semivowel distinction Both F2 transition ~luration and frequency extent were found to lead to a reliable stop-semivowel distinction Short transition durations and short frequency extents lead to more stop responses, while long transition durations and large frequency extents lead to more semivowel responses By comparison, F2 transition rate was found to be an insufficient cue to the stopsemivowel distinction These results place certain constraints on theories of speech perception Any theory purporting to explain human speech perception must either extract transition rate in a phonetic feature dependent (sealed channel) fashion or ignore it entirely This provides us with a test for the psychological validity of data-driven theories of speech perception Further, given the present results, it does not appear to be necessary to involve a phonetic level of adaptation to explain the adaptation results found for the stop-semivowel manner distinction Rather, multiple auditory detectors or channels, tuned to the various cues, are sufficient to explain the existing data 128 SCHWAB, SAWUSCH, AND NUSBAUM REFERENCE NOTES Suzuki, H Mutually complementary effect of rate and amount of formant transition in distinguishing vowel, semivowel, and stop consonant (Quarterly Progress Report of the MIT Research Laboratory of Electronics, No 96) Boston: MIT, 1970 Kewley-Port, D KL TEXC: Executive program to implement the KLA TT software speech synthesizer (Research on Speech Perception, Progress Report 4) Bloomington: Indiana University, 1978 Sawusch, J R The structure and flow of information in speech perception (Research on Speech Perception, Tech Rep 2) Bloomington: Indiana University, 1976 REFERENCES mental frequency as cues to the/zi/-/si/distinction Perception & Psychophysics, 1977, 22, 373-382 MILLER, J L., & LmERMAN, A M Some effects of later occurring information on the perception of stop consonant and semivowel Perception & Psychophysics, 1979, 25, 457-465 NOnMAN, D A Copycat science or does the mind realty work by table look-up? In R A Cole (Ed.), Perception and production of fluent speech Hillsdale, N.J: Erlbaum, 1980 O’CONNOR, J D., GERSTMAN, L J., LIBERMAN, A M., DELATTRE, P C., & COOPER, F S Acoustic cues for the perception of initial/w,j,r,I/in English Word, 1957, 13, 24-43 ODEN, G C., & MASSARO, D W Integration of featural information in speech perception Psychological Review, 1978, 115, 172-191 POMERANTZ, J R Are complex visual features derived from simple ones? In E L J Leeuwenberg & H F J M Buffart (Eds.), Formal theories of visual perception New York: Wiley, 1978 SE~RLE, C L., J~consoN, J Z., & KIMBERLEY, B P Speech as patterns in the 3-space of time and frequency In R A Cole (Ed.), Perception and production o f fluent speech Hillsdale, N.J: Erlbaum, 1980 STEVENS, K N Property-detecting mechanisms and eclectic processors In R A Cole (Ed.), Perception and production of fluent speech Hillsdale, N.J: Erlbaum, 1980 STEVENS, K N., & BLUMSTZlN, S E lnvariant cues for place of articulation in stop consonants Journal of the Acoustical Society of America, 1978, 64, 1358-1368 STUDDERT-KENNEDY, M Universals in phonetic structure and their role in linguistic communication In T H Bullock (Ed.), Recognition of complex acoustic signals Berlin: Dahlem Konferenzen, 1977 SUMbIERFIELD, Q., & HAGGARD, M On the dissociation of spectral and temporal cues to the voicing distinction in initial stop consonants Journal of the Acoustical Society of America, of 1977, 62,435-448 ADES, A E Adapting the property detectors for speech perception In R J Wales & E Walker (Eds.), New approaches to language mechanisms Amsterdam: North-Holland, 1976 BAILEY, P J Perceptual adaptation in speech: Some properties of detectors for acoustical cues to phonetic distinctions Unpublished doctoral dissertation, University of Cambridge, Cambridge, England, 1975 COOPER, W E., EnERT, R R., & COLE, R A Perceptual analysis of stop consonants and glides Journal of Experimental Psychology: Human Perception and Performance, 1976, 2, 92-104 DIEHL, R Feature analyzers for the phonetic dimension stop vs continuant Perception & Psychophysics, 1976, 19, 267-272 DORMAN, M F., STUDDERT-KENNEDY, M., & RAPHAEL, L J Stop consonant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues Perception & Psychophysics, 1977, 22, 109-122 FANT, G Auditory patterns of speech In W Wathen-Dunn (Ed.), Models for the perception of speech and visual form Cambridge, Mass: M.1.T Press, 1967 HILLENBRAND, J., MINIFIE, F D., & EDWARDS, T J Tempo spectrum change as a cue in speech-sound discrimination by NOTES infants Journal of Speech and Hearing Research, 1979, 22, 147-165 KLA~r, D H Software for a cascade/parallel formant synthesizer In the present experiment, the fastest F2 transition rate for a Journal of the Acoustical Society of America, 1980, 67, 971- [w] stimulus was less than 12 Hzimsec The conclusion that any F2 transition rate can signal either stop or semivowel gains further 995 (a) KLATT, D H Speech perception: A model of acoustic-phonetic support from six subjects’ identification of an additional stopanalysis and lexical access In R A Cole (Ed.), Perception and semivowel series ([bl]-[wl]) In this series, all transition durations production o f fluent speech Hillsdale, N.J: Erlbaum, 1980 (b) were constant at 40 msec F2 transition extent and rate ranged LIBERMAN, A M., COOPER, F S., SHANKWEILER, D P., & from 300 Hz and 7.5 Hz/msec for the [b] end of the series to STUDDERT-KENNEDY, M Perception of the speech code 1,200 Hz and 30 Hz/msec for the [w] end of the series in 150-HE and 3.75-Hz/msec steps In the group data, the stimuli with exPsychologicalReview, 1967, 74, 431-461 LIBERMAN, A M., DELATTRE, P C., GERSTMAN, L J., & tents of 900, 1,050, and 1,200 Hz (with respective rates of 22.5, COOPER, F S Tempo of frequency change as a cue for distin26.25, and 30.0 Hz/msec) were all identified as [w] on better than guishing classes of speech sounds Journal of Experimental 90% of the trials Each of the six subjects identified these three stimuli as [w] Psychology, 1956, 52, 127-137 This rule also fits the [bI]-[wI] data for 37 out of 42 points LlSKER, L., & ABRAMSON, A S A cross-language study of voicing in initial stops: Acoustical measurements Word, 1964, (Received for publication July 17, 1980; 20, 384-422 revision accepted October 17, 1980.) MASSARO, D W., & CONES, M M Voice onset time and funda- ... three formants of these two endpoints are shown in Figure In addition to the three experimental sets, there was a training set of stimuli This set consisted of the two endpoints from each of the. .. participated in a short training condition at the beginning of the session The subjects were informed that they would be listening to synthetic syllables that would sound like [ba] and [wal During the. .. perceive a stop However, published study of stop- semivowel cues used two an examination of the data indicates that an increase formant stimuli to examine the effect of transition in transition

Định dạng
Số trang	8
Dung lượng	781,39 KB