版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
1、Landmark-Based Speech RecognitionThe Marriage of High-Dimensional Machine Learning Techniques with Modern Linguistic Representations What are Landmarks?Time-frequency regions of high mutual information between phone and signal (maxima of I(phone label; acoustics(t,f) )Acoustic events with similar im
2、portance in all languages, and across all speaking stylesAcoustic events that can be detected even in extremely noisy environments Syllable Onset Consonant Release Syllable Nucleus Vowel Center Syllable Coda Consonant ClosureWhere do these things happen?I(phone;acoustics) experiment: Hasegawa-Johnso
3、n, 2000Landmark-Based Speech RecognitionONSETNUCLEUSCODANUCLEUSCODAONSETPronunciationVariants: backed up backtup . back up backt ihp wackt ihpLattice hypothesis: backed up SyllableStructureScoresWordsTimesTalk OutlineOverviewAcoustic ModelingSpeech data and acoustic featuresLandmark detectionEstimat
4、ion of real-valued “distinctive features” using support vector machines (SVM)Pronunciation ModelingA Dynamic Bayesian network (DBN) implementation of Articulatory PhonologyA Discriminative Pronunciation model implemented using Maximum Entropy (MaxEnt)Technological EvaluationRescoring of word lattice
5、 output from an HMM-based recognizerErrors that we fixed: Channel noise, Laughter, etceteraNew errors that we caused: Pronunciation models trained on 3 hours cant compete with triphone models trained on 3000 hours.Future PlansOverviewHistoryResearch described in this talk was performed between June
6、30 and August 17, 2004, at the Johns Hopkins summer workshop WS04Scientific GoalTo use high-dimensional machine learning technologies (SVM, DBN) to create representations capable of learning, from data, the types of speech knowledge that humans exhibit in psychophysical speech perception experiments
7、Technological GoalLong-term: To create a better speech recognizerShort-term: lattice rescoring, applied to word lattices produced by SRIs NN/HMM hybridAcoustic Model: SVMsp(landmark|SVM)MFCC (5ms & 1ms frame period), Formants, Phonetic & Auditory Model Parameters concatenate 4-15 framesPronunciation
8、 Model (DBN or MaxEnt)First-Pass ASR Word Latticep(SVM|word)Rescoring: Log-Linear Score Combinationp(MFCC,PLP|word), p(word|words)word label, start & end timesOverview of Systems to be DescribedI. Acoustic ModelingGoal: Learn precise and generalizable models of the acoustic boundary associated with
9、each distinctive feature.Methods: Large input vector space (many acoustic feature types) Regularized binary classifiers (SVMs) SVM outputs “smoothed” using dynamic programming SVM outputs converted to posterior probability estimates once/5ms using histogramSpeech DatabasesSizePhonetic Transcr.Word L
10、atticesNTIMIT14hrsmanual-WS96&973.5hrsmanual-SWB1 WS04 subset12hrsauto-SRIBBNEval0110hrs-BBN & SRIRT03 Dev6hrs-SRIRT03 Eval6hrs-SRIAcoustic and Auditory FeaturesMFCCs, 25ms window (standard ASR features)Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecondNoise-robust MUS
11、IC-based formant frequencies, amplitudes, and bandwidths (Zheng & Hasegawa-Johnson, ICSLP 2004)Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures; Bitar & Espy-Wilson, 1996)Rate-place model of neural response fields in the cat auditory cortex (Carlyon & S
12、hamma, JASA 2003)What are Distinctive Features? What are Landmarks?Distinctive feature = a binary partition of the phonemes (Jakobson, 1952) that compactly describes pronunciation variability (Halle) and correlates with distinct acoustic cues (Stevens)Landmark = Change in the value of a Manner Featu
13、re+sonorant to sonorant, sonorant to +sonorant5 manner features: consonantal, continuant, syllabic, silencePlace and Voicing features: SVMs are only trained at landmarks Primary articulator: lips, tongue blade, or tongue bodyFeatures of primary articulator: anterior, stridentFeatures of secondary ar
14、ticulator: nasal, voiced Landmark Detection using Support Vector Machines (SVMs)False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM(3) Linear SVM: EER = 0.15%(4) Radial Basis Function SVM: Equal Error Rate=0.13%Niyogi & Burges, 1999,
15、 2002(1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2%(2) HMM (*): False Rejection Error=0.3%Dynamic Programming Smooths SVMs Maximize Pi p( features(ti) | X(ti) ) p(ti+1-ti | features(ti) Soft-decision “smoothing” mode: p( acoustics | landmarks ) computed, fed to pronunciation modelCues for Place
16、 of Articulation:MFCC+formants + ratescale, within 150ms of landmarkKernel:Transform toInfinite-DimensionalHilbertSpaceNiyogi & Burges, 2002: p(class|acoustics) Sigmoid Model in Discriminant DimensionSoft-Decision Landmark ProbabilitiesSVM Extracts a Discriminant DimensionSVM Discriminant Dimension
17、=argmin(error(margin)+1/width(margin)Juneja & Espy-Wilson, 2003: p(class|acoustics) Histogram in Discriminant DimensionORSoft Decisions once/5ms:p ( manner feature d(t) | Y(t) )p( place feature d(t) | Y(t), t is a landmark )SVMHistogram2000-dimensional acoustic feature vectorDiscriminant yi(t)Poster
18、ior probability of distinctive featurep(di(t)=1 | yi(t)II. Pronunciation ModelingGoal: Represent a large number of pronunciation variants, in a controlled fashion, using distinctive features. Pick out the distinctive features that are most important for each word recognition task.Methods: Distinctiv
19、e feature based lexicon + dynamic programming alignmentDynamic Bayesian Network model of Articulatory Phonology (articulation-based pronunciation variability model)MaxEnt search for lexically discriminative features (perceptually based “pronunciation model”)1. Distinctive-Feature Based LexiconMerger
20、 of English Switchboard and Callhome dictionariesConverted to landmarks using Hasegawa-Johnsons perl transcription tools Landmarks in blue, Place and voicing features in green.AGO(0.441765) +syllabic +reduced +back AX +continuant + sonorant +velar +voiced G closure +continuant +sonorant +velar +voic
21、ed G release +syllabic low high +back +round +tense OWAGO(0.294118) +syllabic +reduced back IX + continuant +sonorant +velar +voiced G closure +continuant +sonorant +velar +voiced G release +syllabic low high +back +round +tense OWDynamic Programming Lexical Search Choose the word that maximizes Pi
22、p( features(ti) | X(ti) ) p(ti+1-ti | features(ti) p(features(ti)|word)LIP-OPTT-OPENTT-LOCTB-LOCTB-OPENVELUMVOICINGwarmth w ao r m p th - Phone insertion?I dont know ah dx uh_n ow_n - Phone deletion?several s eh r v ax l - Exchange of two phones?2. Articulatory PhonologyMany pronunciation phenomena
23、can be parsimoniously described as resulting from asynchrony and reduction of sub-phonetic featuresinstruments ih_n s ch em ih_n n severybody eh r uw ayOne set of features based on articulatory phonology Browman & Goldstein 1990:Dynamic Bayesian Network Model(Livescu and Glass, 2004)The model is imp
24、lemented as a dynamic Bayesian network (DBN):A representation, via a directed graph, of a distribution over a set of variables that evolve through timeExample DBN with three features:= 1= 1)|Pr(|)Pr(212;1aindindaasync=-= .1 0 0 4 .2 .7 0 0 2 .1 .2 .7 0 1 0 .1 .2 .7 0 3 2 1 0given by baseform pronunc
25、iations. . . The DBN-SVM Hybrid Developed at WS04Tongue frontPalatalTongue closedSemi-closedGlideATongue MidTongue openTongue openWordCanonical FormSurface FormPlace x:Multi-FrameObservationincludingSpectrum,Formants,& AuditoryModelTongue frontFrontVowelLIKETongue Frontp( gPGR(x) | palatal glide rel
26、ease)p( gGR(x) | glide release )SVM OutputsManner3. Discriminative Pronunciation ModelRationale: baseline HMM-based system already provides high-quality hypotheses 1-best error rate from N-best lists: 24.4% (RT-03 dev set)oracle error rate: 16.2% Method: Use landmark detection only where necessary,
27、to correct errors made by baseline recognition system Example:Ref: that cannot be that hard to sneak onto an airplane Hyp: they can be a that hard to speak on an airplanefsh_60386_1_0105420_0108380Identifying Confusable HypothesesUse existing alignment algorithms for converting lattices into confusi
28、on networks (Mangu, Brill & Stolcke 2000)Hypotheses ranked by posterior probabilityGenerated from n-best lists without 4-gram or pronunciation model scores ( higher WER compared to lattices)Multi-words (“I_dont_know”) were split prior to generating confusion networksairplaneanonontosneakspeaktohard*
29、DEL*bethat cancantatheyIdentifying Confusable HypothesesHow much can be gained from fixing confusions?Baseline error rate: 25.8%Oracle error rates when selecting correct word from confusion set:# hypotheses to select from Including homophonesNot including homophones223.9% 23.9%323.0%23.0%422.4%22.5%
30、522.0%22.1%Selecting Relevant LandmarksNot all landmarks are equally relevant for distinguishing between competing word hypotheses (e.g. vowel features irrelevant for sneak vs. speak) Using all available landmarks might deteriorate performance when irrelevant landmarks have weak scores (but: redunda
31、ncy might be useful)Automatic selection algorithmShould optimally distinguish set of confusable words (discriminative) Should rank landmark features according to their relevance for distinguishing words (i.e. output should be interpretable in phonetic terms)Should be extendable to features beyond la
32、ndmarksMaximum-Entropy Landmark SelectionConvert each word in confusion set into fixed-length landmark-based representation using idea from information retrieval:Vector space consisting of binary relations between two landmarksManner landmarks: precedence, e.g. V Son. Cons.Manner & place features: o
33、verlap, e.g. V o +highpreserves basic temporal information Words represented as frequency entries in feature vectorNot all possible relations are used (phonotactic constraints, place features detected dependent on manner landmarks)Dimensionality of feature space: 40 - 60Word entries derived from pho
34、ne representation plus pronunciation rulesVector-Space Word RepresentationStart FricFric StopFric SonFric VowelStop VowelVowel o highVowel o frontFric o stridentspeak11001111sneak10100111seek10010111he10010110she10010111steak11001011.Maximum-Entropy DiscriminationUse maxent classifier Here: y = word
35、s, x = acoustics, f = landmark relationshipsWhy maxent classifier?Discriminative classifierPossibly large set of confusable wordsLater addition of non-binary features Training: ideally on real landmark detection output Here: on entries from lexicon (includes pronunciation variants) Maximum-Entropy D
36、iscriminationExample: sneak vs. speakDifferent model is trained for each confusion set landmarks can have different weights in different contexts speakSC +blade -2.47 FR SC -2.47FR SIL 2.11SIL ST 1.75.sneakSC +blade 2.47 FR SC 2.47FR SIL -2.11SIL ST -1.75.Landmark QueriesSelect N landmarks with high
37、est weightsAsk landmark detection module to produce scores for selected landmarks within word boundaries given by baseline systemExample: sneak 1.70 1.99 SC +blade ? sneak 1.70 1.99 SC +blade 0.75 0.56 ConfusionnetworksLandmarkdetectorsIII. EvaluationAcoustic Feature Selection1. Accuracy per Frame (
38、%), Stop Releases only, NTIMIT2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)MFCCs+ShapeMFCCs+Formants
39、KernelLinearRBFLinearRBF+/- lips78.390.792.795.0+/- blade73.487.179.685.1+/- body73.085.285.787.2SVM Training: Mixed vs. Targeted DataTrainNTIMITNTIMIT&SWBNTIMITSwitchboardTestNTIMITNTIMIT&SWBSwitchboardSwitchboardKernelLinearRBFLinearRBFLinearRBFLinearRBFspeech onset95.196.286.989.971.462.281.681.6
40、speech offset79.688.576.386.465.378.668.483.7consonant onset 94.595.591.493.570.372.795.897.7consonant offset91.793.794.396.880.386.292.896.8continuant onset89.494.187.395.069.181.986.292.0continuant offset90.894.990.494.669.368.889.694.3sonorant onset95.697.297.896.785.286.596.396.3sonorant offset9
41、5.396.494.097.475.675.295.296.4syllabic onset90.795.291.495.569.578.987.992.6syllabic offset90.188.987.192.954.460.888.289.7DBN-SVM: Models Nonstandard PhonesI dont know/d/ becomesflap/n/ becomes a creakynasal glideDBN-SVM Design DecisionsWhat kind of SVM outputs should be used in the DBN?Method 1 (
42、EBS/DBN): Generate landmark segmentation with EBS using manner SVMs, then apply place SVMs at appropriate points in the segmentationForce DBN to use EBS segmentationAllow DBN to stray from EBS segmentation, using place/voicing SVM outputs whenever availableMethod 2 (SVM/DBN): Apply all SVMs in all f
43、rames, allow DBN to consider all possible segmentationsIn a single passIn two passes: (1) manner-based segmentation; (2) place+manner scoringHow should we take into account the distinctive feature hierarchy?How do we avoid “over-counting” evidence?How do we train the DBN (feature transcriptions vs.
44、SVM outputs)?DBN-SVM Rescoring ExperimentsFor each lattice edge:SVM probabilities computed over edge duration and used as soft evidence in DBNDBN computes a score S P(word | evidence)Final edge score is a weighted interpolation of baseline scores and EBS/DBN or SVM/DBN scoreDateExperimental setup 3-
45、speaker WER (# errors)RT03 dev WER - Baseline27.7 (550)26.8 Jul31_0EBS/DBN, “hierarchically-normalized” SVM output probabilities, DBN trained on subset of ICSI transcriptions27.6 (549)26.8 Aug1_19 + improved silence modeling27.6 (549) Aug2_19EBS/DBN, unnormalized SVM probs + fricative lip feature27.
46、3 (543)26.8 Aug4_2 + DBN trained using SVM outputs27.3 (543) Aug6_20 + full feature hierarchy in DBN27.4 (545) Aug7_3 + reduction probabilities depend on word frequency27.4 (544) Aug8_19 + retrained SVMs + nasal classifier + DBN bug fixes27.4 (544) Aug11_19SVM/DBN, 1 passMiserable failure! Aug14_0SV
47、M/DBN, 2 pass27.3 (542) Aug14_20SVM/DBN, 2 pass, using only high-accuracy SVMs27.2 (541)Discriminative Pronunciation ModelWERInsertionsDeletionsSubstitutionsBaseline25.8% 2.6% (982)9.2% (3526)14.1% (5417)Rescored 25.8% 2.6% (984)9.2% (3524)14.1% (5408)RT-03 dev set, 35497 Words, 2930 Segments, 36 Sp
48、eakers(Switchboard and Fisher data)Rescored: product combination of old and new prob. distributions, weights 0.8 (old), 0.2 (new)Correct/incorrect decision changed in about 8% of all cases Slightly higher number of fixed errors vs. new errors AnalysisWhen does it work? Detectors give high probabilit
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 市政工程班組勞動合同
- 農(nóng)業(yè)設(shè)施變壓器投標(biāo)書模板
- 互聯(lián)網(wǎng)公司代持股安全承諾書
- 2024年人民版二年級語文上冊階段測試試卷
- 蛋類加工特種垃圾管理辦法
- 手工藝品市場租賃合同
- 2024版全新軟件代理商合作協(xié)議下載
- 2024年魯教版七年級科學(xué)上冊月考試卷含答案
- 2024房地產(chǎn)典當(dāng)合同模板
- 2024年中圖版選修5地理下冊階段測試試卷
- 【8地RJ期末】安徽省蕪湖市無為市2023-2024學(xué)年八年級上學(xué)期期末地理試題(含解析)
- 中國AI+Agent應(yīng)用研究報告
- 五級(程控交換)職業(yè)技能鑒定理論考試題及答案
- 醫(yī)療救護(hù)合作協(xié)議
- 2024年人教版初二道德與法治上冊期末考試卷(附答案)
- 2024至2030年中國工控安全行業(yè)發(fā)展?fàn)顩r及投資潛力分析報告
- DL-T5153-2014火力發(fā)電廠廠用電設(shè)計技術(shù)規(guī)程
- 文件袋、檔案袋密封條模板
- 甲型H1N1流感防治應(yīng)急演練方案(1)
- LU和QR分解法解線性方程組
- 漏油器外殼的落料、拉深、沖孔級進(jìn)模的設(shè)計【畢業(yè)論文絕對精品】
評論
0/150
提交評論