PerceptionofLanguage語音識別

上傳人：1*** IP屬地：湖北上傳時間：2022-02-24 格式：PPT 頁數：40 大?。?21.50KB 積分：30 舉報 版權申訴

已閱讀5頁，還剩35頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

1、Perception of LanguageContent 2.1 Three problems of speech discrimination 2.2 Methods of speech discrimination 2.3 the structure of speech 2.4 perception of isolated speech segments 2.5 Perception of continuous Speech 2.6 Perception of Written Language 2.1Three problems of speech discrimination Disc

2、ussion 言語聽辨過程并不復雜。我們按時間順序聽到的是一系列的詞，而這些詞又是一系列聲音組成的。這些聲音相當于音位分段按一定的次序出現。因此，在語言的聽辨中，我們是按音素出現的次序依次處理的。如：在聽辨cat 時，先是聽到輔音k，然后是元音，最后是輔音t。因此，只要我們把語音切分成音素后，我們就從最小的單位開始，逐級聽辨出詞素、詞、短語、子句、句子、語段。當我們說話的時候，我們的發(fā)音是流暢而連續(xù)的，聽話人不大可能把它切分成分離的聲音單位。在詞與詞、句與句之間好像是有停頓的，在音段之間好像是有界限的，但是在實際上卻不是很明顯。如：如果我們按言語的聲音寫下來：spokenwordsa

3、renotseperatedbyspaceslikewordsareinprint.spoken words are not seperated by spaces like words are in print. 對言語聽辨的研究者來說，人們怎樣從復雜的言語信號中孤立（切分）出個別聲音，并進而聽辨，始終是一個值得研究的問題。2.1.1 音段的辨認音段的辨認如果語言的每一個聲音都能夠和一個標準的形式相聯系，那么建立言語感知過程模型就比較簡單。然而，由于種種原因，語音并沒有不變式或標準的形式。（1）同一個音段的產生往往視其所出現的語境（Context-consitioned Variation

4、)有所不同。在bill, ball, bull, able, rob 這幾個詞中的b，發(fā)音受到后面元音的影響而略有差異。因此，我們不能把音素看成是一條鏈條上的珠子，一個音素的聲音接著另一個音素的聲音。我們沒有辦法把言語信號切分成單一的、只表示一個音素的所有特征又不表示另一個音素的任何特征的音段。（2）性別、年齡、場合也會導致發(fā)音的不同。男、女性，大人、小孩因聲帶的大小和配置不同發(fā)出的元音有明顯差異；我們自己產生的語音在不同場合也不是完全一樣的，言語聽辨研究必須解釋人們?yōu)槭裁丛谔幚碛胁町惖难哉Z信號時那么輕而易舉。（3）言語信號的差異來自言語速度很快地口語。在流利的口語中，言語音段的聲音特

5、征被消弱，變化很大。言語聽辨的研究必須解釋聽話人是怎樣處理這些“凌亂的”語言樣本的。 2.1.2 “2.1.2 “缺乏不變式缺乏不變式”的問題的問題口頭會話的聲音特征變化很大。有些時候說話人在說話時發(fā)音不足（underarticulate),即失去發(fā)音目標，以致詞語丟失了它們大部分的信息。但是聽話人通常對言語的理解仍然不會有任何困難。言語聽辨模型需要解釋語言處理不同層面（詞匯、語法、語境）的知識怎樣幫助言語理解。2.13言語在不理想環(huán)境中的聽辨2.2言語聽辨的研究手段 Willis(1829) 和 Helmholtz(1859)在19世紀就開始研究聲音的物理性能，但是研究人怎樣感知語言是在

6、第二次世界大戰(zhàn)前才展開。 20世紀中葉才具備了研制研究言語聽辨的儀器的技術。 vocoder（聲音記錄儀），把言語分析和記錄為更簡單的信號，傳遞少一點的信息。 sound spectrograph(聲音攝譜儀），按照聲音的分布來分析語音信號，用y軸表示頻率，x軸表示時間，用標記的深淺來表示振幅（amplitude）。聲音攝譜儀的出現是言語研究的里程碑，因為語音學家第一次可以通過簡單而又花費不多的手段獲得范圍廣闊的，客觀而量化的語音信息。 5060年代，研究中心是聲學語音學（acoustic phonetics) 70年代，研究的興趣轉為發(fā)音語音學（articulatory phonetics

7、) 電子肌動記錄儀（electromyography)的出現。它可以記錄肌肉收縮時所產生的細微電壓變化，把記錄儀和一臺電子計算機相連就能處理儀器所記錄的大量數據。電子記波儀（electromography),用以記錄說話時口腔和鼻腔的氣流變化情況。腭位記錄儀（electropalatography), 提供舌頭和腭位在嘴里如何接觸的信息。2.3 The Structure of SpeechProsodic FactorsArticulatory PhoneticsAcoustic PhoneticsAcoustic Phonetics Acoustic phonetics is to e

8、xamine acoustic properities of speech sounds. One of the most common ways of describing the acoustic energy of speech sounds is called a sound spectrogram. the vertical axis the frequency of the speech sounds; the horizonal axis time darkness intensity Each of the spectrogram contains a series of da

9、rk bands,called formants.共振峰(formant)共振峰(formant)是用來描述聲學共振現象的一種概念，在語音科學及語音學中，描述的是人類聲道中的共振情形。常用的測量方法是在頻譜分析或聲譜圖（spectrogram)中，尋找頻譜中的峰值。共振峰是指在聲音的頻譜中能量相對集中的一些區(qū)域，共振峰不但是音質的決定因素，而且反映了聲道（共振腔）的物理特征。聲音在經過共振腔時，受到腔體的濾波作用，使得頻域中不同頻率的能量重新分配，一部分因為共振腔的共振作用得到強化，另一部分則受到衰減，得到強化的那些頻率在時頻分析的語圖上表現為濃重的黑色條紋。由于能量分布不均勻，強的部分猶如山

10、峰一般，故而稱之為共振峰。在語音聲學中，共振峰決定著元音的音質，而在計算機音樂中，它們是決定音色和音質的重要參數。 Formant Transition Two aspects of formants have been found to be important in speech perception. Formant transitions are the large rises or drops in formant frequency that occur over short durations of time. In card, the first formant is risi

11、ng and the second one falling in frequency near the end of the word. These transitions nearly always occur either at the beginning or the end of a syllable. In between is the formants steady state, during which formant frequency is relatively stable.美式英語元音美式英語元音i, u, i, u, 的聲譜圖的聲譜圖圖中顯示了共振峰圖中顯示了共振峰f1

12、 f1 和和 f2 f2。頻率最低的共振峰頻率最低的共振峰頻率稱為頻率稱為f1f1，第二，第二低的是低的是f2f2，而第三，而第三低的是低的是f3f3。絕大多。絕大多部分的情形是，前部分的情形是，前兩個共振峰，兩個共振峰，f1 f1 和和 f2 f2就足以劃分就足以劃分不同元音。這兩個不同元音。這兩個共振峰可以描述元共振峰可以描述元音的開音的開/ /閉、前閉、前/ /后后兩個維度（過去傳兩個維度（過去傳統上把這和舌頭的統上把這和舌頭的位置聯結在一起，位置聯結在一起，不過這不是完全精不過這不是完全精確）確）。元音平均共振峰音(IPA)共振峰f1共振峰f2i240Hz2400Hzy235Hz210

13、0Hze390Hz2300Hz370Hz1900Hz610Hz1900Hz 585Hz1710Hza850Hz1610Hz 820Hz1530Hz750Hz940Hz700Hz760Hz600Hz1170Hz500Hz700Hz 460Hz1310Hzo360Hz640Hz 300Hz1390Hzu250Hz595Hz元音的聽辨元音的信號里有哪些最主要部分可供辨認？早期研究使用的方法是把元音的穩(wěn)定狀態(tài)作為刺激。例：1952年Delattre等人使用Cooper等人的儀器合成語音，然后觀察受試者如何對一個或兩個共振峰的穩(wěn)定狀態(tài)語音做出反應。受試者可以辨認出某些只有一個共振峰的元音：低頻的單共振峰

14、一般和后元音如u和a有關；高頻的單共振峰則和前元音有關i和e有關。這個發(fā)現說明，只靠單共振峰的頻率信息也足以使人辨認元音。如果顯示的是兩個共振峰的刺激（根據自然語言的共振峰信息來合成），受試者對辨認這個語音就很一致。這個實驗和其他實驗都說明元音的第一和第二共振峰的穩(wěn)定狀態(tài)部分是語音辨認所必需的，提供足夠的聲學提示。研究缺陷：一般詞語所包含的元音是在輔音的環(huán)境下產生的。從聲學上看，這意味著元音都標有進出鄰近輔音的共振峰過渡，都帶有一些不是很短的，就是聽不見的穩(wěn)定狀態(tài)的音段。1983年，Jenkins等人把聽辨上突出的元音穩(wěn)定狀態(tài)和共振峰過渡加以比較。 Jenkins等人使用了以b開始和結束的CV

15、C(輔音+元音+輔音)的音節(jié)作為測試刺激，有9個不同的元音（如beeb,bib,bab,bob,等等）。他們讓一個男性的受試發(fā)出這9個音節(jié)，然后使用計算機把這些音節(jié)數碼化并加以編輯。每一個音節(jié)分成三個組成部分：（a）從開始的輔音到元音的共振峰過渡；（b）中央的元音部分；（c）從元音到后面輔音的過渡。這些音段都用（a)、(b)、(c)來表示。受試需要對下列刺激進行評估。1.沒有經過改造過的原來的元音，稱為控制音節(jié)（control syllables）,由音段（a)+(b)+(c)組成。2. 沉默中央音節(jié)（silent center syllables）,包括音段（a）+沉默的間歇+（c）。沉默的

16、間歇的長度和音節(jié)中的音段（b）一樣。這種刺激保留了共振峰過渡和元音的實際時間，但沒有包括穩(wěn)定狀態(tài)的信息。3. 變異中央音節(jié)（variable center syllables）,只包括每個音節(jié)的（b）部分。這些刺激保留了目標音節(jié)的穩(wěn)定狀態(tài)部分和它的內在的時間。4.固定中央刺激（fixed center）刺激，把每個音節(jié)的音段（b）予以修剪，使之與最短的目標元音的時間相匹配。這些刺激保留了元音的穩(wěn)定狀態(tài)部分，但不包括時間的信息。5.最后，受試還評估鄰近音節(jié)（abutted syllables）,包括音段（a）+(c)。這些刺激并不包括任何關于音段（b）的信息。也就是說，它們并沒有任何穩(wěn)定狀態(tài)部分

17、。實驗者讓受試組辨認每一類刺激中的元音和未加修改的、原來的控制音節(jié)。結果表明，辨認沉默中央刺激和辨認原來的控制音節(jié)的準確率一樣高，但是辨認變異中央刺激（穩(wěn)定狀態(tài)和時間信息）和鄰近音節(jié)的錯誤則明顯增加。固定中央刺激（只有穩(wěn)定狀態(tài)）的準確率最低。實驗者認為，這些結果表明，在辨認元音時，共振峰過渡和元音時間是比固定樣本的穩(wěn)定狀態(tài)信息更為重要的提示。2.4PerceptionofIsolatedSpeechSegmentsl We may roughly distinguish the process of speech perception into three levels:auditory le

18、velphonetic levelphonological levelauditory level At the auditory level, the signal is represented in terms of its frequency, intensity, and temporal attributes(as, for example, shown on a spectrogram), as with any auditory stimulus.在聽覺階段，聽話人接觸到語流，并把聽到的聲音信號分析為聲學提示。這些提示提供了某一個音素的部分的信息，例如“清音”、“鼻音”、“雙唇音

19、”等等。這些信息存放在聽覺記憶力，供第二階段語音階段調用。phonetic level At the phonetic level,we identify individual phones by a combination of acoustic cues, such as formant transition.在語音階段，人們把提示集中起來，從而辨認一個個的音素。然后再把它們放在語音記憶里，在語音記憶里不再保存聲學提示。phonological level At the phonological level,the phonetic segments is converted into a

20、 phoneme and phonological rules are applied to the sound sequence.在音位階段，聽話人參照一種語言對音段系列的制約，對語音階段的辨認進行調整。以英語為例，我們聽到的可能是fpin，但在英語里不存在這種系列，于是就改為spin.SpeechasamodularSystem A controversial issue in the study of speech perception is whether and to what extent general principles of auditory perception can

21、explain what we have learned about speech sounds. Viewpoints: 1. the language-processing system is a unique set of cognitive abilities that cannot be reduced to general principles of cognition. 2.linguistic subsystems, such as semantics and syntax, operate independently rather than interactively. th

22、e question of modularity is related to the question of the organization of the brain for language. If a speech is a modular system, then we may might expect it to have a specialized neurological representation. This representation would not be based on general cognitive founctioning (that is working

23、 memory, episodic memory, and so on) but would be specific to language (or, possibly, specific to phonetic processing). This module might be the basis for the perception of language in young infants and, if damaged, the reason that certain individuals suffer quite specific breakdowns in language fou

24、nctioning.CategoricalPerception Categorical perception is a reflection of the phonetic level of processing in which a phonetic identity is imposed. To comprehend speech, we must impose an absolute or categorical identifation on the incoming speech signal rather than simply a relative determination o

25、f the various physical characteristics of the signal. That is, our job is to identify whether a sound is a p or b, not whether the frequency or the intensity is relatively high or low.VOT (voice onset time) Voice onset time: the time between when the sound is released at the lips and when the vocal

26、cords begin vibrating. it is an important cue in the perception of the voicing feature. ba: with voiced sounds, vibration occurs immediately, 0-ms. pa: with voiceless sound, vibration occurs after a short delay, 40-ms. Motor theory Liberman & Haskin Viewpoint: listerners use implicit articulator

27、y knowledge knowledge about how sounds are produced as an aid in perception. Eg. du di Critisms: 1. infants are perceptually sensitive to certain phonetic contrasts, including those not in their native language. 2. articulatory motions for a given phoneme will vary with preceding and following vowel

28、s.2.5 Perception of continuous speech Under normal listening conditions, speech sounds are embedded in a context of fluent speech. The acoustic structure of speech sound varies with its immidiate phonetic context. This context consists of two main factors: 1. Prosodic Factors in Speech Recognition (

29、stress, rate) 2. Semantic and Syntactic Factors in Speech PerceptionSemantic and Syntactic Factors in Speech Perception listerners were able to use the syntactic and semantic constraints of continuous speech to limit the number of possibilities to consider. Miller & Isard, 1963 isolated the infl

30、uence of syntactic and semantic information in their study: three different types of sentences were presented in continous speech: (1)grammatical strings; (2)anomalous（異常的） strings that preserved grammatical word order; (3) ungrammatical strings Accidents kill motorists on the highways. Accidents ca

31、rry honey between the house. Around accidents country honey the shoot.1. People were most accurate with grammatical strings, less accurate with anomalous strings, and even less able to recognize ungrammatical strings. Threemodelsofprocessing Top-down processing: beginning with a concept and matching

32、 it to the incoming data. You have contextual and linguistic information about what the word is going to be before you hear it. Bottom-up processing: this process starts with the raw sensory data and works toward finding the concept or idea that the data represent to the perceiver. Interactive Model

33、: combines the above two models.Experiment: phonemic restoration The state government met with their respective legi*latures convening in the capital city. *eel was heard as wheel, heel, peel, or meal, depending on the sentence. It was found that the *eel was on the shoe. It was found that the *eel

34、was on the orange. It was found that the *eel was on the axle. It was found that the *eel was on the table.Experiment: mispronouncitaion detection It has been zuggested that students be required to preregister. Cole(1973): the likelyhood of detection depends on the place in a word or sentence. Detec

35、tion performance was better for mispronounciations at the beginning of a word compared with those later in a word, and better earlier in a sentence than later on.Experiment: combining mispronouncitaion detection task with a shadowing task shadowing task: Repeat immediately what you hear. Wilson and

36、Welsh examined the conditions under which listerners would repeat a mispronounced word exactly or restore the “intended” pronunciation? Restoration.2.6 Perception of written languagean orthography(正字法） is a method of mapping the sounds of a language onto a set of written symbols.Three main tyoes of

37、orthography: Logography（使用單語活字）：takes the word or morpheme as the linguistic unit and pairs the unit with some pictorial symbol. ( Chinese) Syllabary（音節(jié)文字）: takes the syllable as the linguistic unit and associates it with some visual representation. ( Japanese) Alphabet（字母系統）: a system in which each

38、 letter is supposed to represent a phoneme. (English)LevelsofWrittenLanguageProcessing Saccades(掃視）: The movement of eyes during reading is called saccades.They take about 10-20 ms in duration. our eyes are moving too quickly for us to pick up any visual information from the printed page during thes

39、e saccades, rather, we just percieve a blur. these movement traverse 10 letters on average. Regressions（回歸）: about 10% -15% of the eye movement of mature readers are Regressions. they are an indication that a reader has mispercieved or misunderstood some portion of a text and has gone back to reanal

40、yze it. Fixations（固視）: the time we spend at a given location between eye movements. Fixation duration is one index of difficulty of information processing during reading. PerceptionofLettersinIsolation A tachistoscope(視速儀）: is a device that permits the rapid visual representation of a stimulus. In a

41、 typical study, a stimulus might be presentd dor 50 milliseconds or less, with subjects asked to report what they see. the stimulus are presented briefly and in isolation. R? P? K? Studies of tachistoscope perception have shown that： 1. the constituent features of letters are a significant determina

42、nt of performance. 2. under conditions of brief presentation without word context, we can extract some but not all of the features associated with that letter.ODUGQRQCDUGOCQOGRDURDGQOGRUQDODUZGROUCGRODDQRCGUQDOCGUCGUROQOCDURQUOCGQDRGQCOUGRUDQOGODUCQQCURDODUCOQGCGRDQUUDRCOQGQCORUGOQUCDGDQUOCURDCGOGOD

43、RQCIVMXEWEWVMXEXWMVIIXEMWVVXWEMIMXVEWIXVWMEIMWXVIEVIMEXWEXVWIMVWMIEXVMWIEXXVWMEIWXVEMIXMEWIVMXIVEWVEWMIXEMVXWIIVWMEXIEVMWXWVZMXEXEMIVWWXIMEVEMWIVX Perceptions of Letters in Word Context The Word - Superiority Effect: superior performance for words over nonword letter strings. Reicher(1969): individu

44、als were tachistoscopically presented with a word (word), a nonword(owrd), or a letter (d or k) Finding: accuracy was greater when a word was presented than when a nonword or a single letter was presented. It suggests that we process letters more effciently within words; word processing aids letter identification, rather than the other way around.TwoModelsofReadingDual-Route Model: It is proposed by Coltheart, Curtis, Atkins, HallerIt propses that we have two different ways of converting print to speech, one is lexical route and the other is the nonlexica

人人文庫> 全部分類> 教育資料 > 備課教案

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
5. 人人文庫網僅提供信息存儲空間，僅對用戶上傳內容的表現方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

PerceptionofLanguage語音識別

文檔簡介

溫馨提示

最新文檔

評論

PerceptionofLanguage語音識別

文檔簡介

溫馨提示

最新文檔

評論

相關文檔