語音信號識別及處理中英文翻譯文獻(xiàn)綜述

上傳人：共*** IP屬地：四川上傳時(shí)間：2023-04-15 格式：DOC 頁數(shù)：17 大?。?.37MB 積分：20 舉報(bào) 版權(quán)申訴

已閱讀5頁，還剩12頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

語音識別在計(jì)算機(jī)技術(shù)中，語音識別是指為了達(dá)到說話者發(fā)音而由計(jì)算機(jī)生成的功能，利用計(jì)算機(jī)識別人類語音的技術(shù)。（例如，抄錄講話的文本，數(shù)據(jù)項(xiàng);經(jīng)營電子和機(jī)械設(shè)備;電話的自動(dòng)化處理），是通過所謂的自然語言處理的計(jì)算機(jī)語音技術(shù)的一個(gè)重要元素。通過計(jì)算機(jī)語音處理技術(shù)，來自語音發(fā)音系統(tǒng)的由人類創(chuàng)造的聲音，包括肺，聲帶和舌頭，通過接觸，語音模式的變化在嬰兒期、兒童學(xué)習(xí)認(rèn)識有不同的模式，盡管由不同人的發(fā)音，例如，在音調(diào)，語氣，強(qiáng)調(diào)，語調(diào)模式不同的發(fā)音相同的詞或短語，大腦的認(rèn)知能力，可以使人類實(shí)現(xiàn)這一非凡的能力。在撰寫本文時(shí)（2008年），我們可以重現(xiàn)，語音識別技術(shù)不只表現(xiàn)在有限程度的電腦能力上，在其他許多方面也是有用的。語音識別技術(shù)的挑戰(zhàn)古老的書寫系統(tǒng),要回溯到蘇美爾人的六千年前。他們可以將模擬錄音通過留聲機(jī)進(jìn)行語音播放，直到1877年。然而，由于與語音識別各種各樣的問題，語音識別不得不等待著計(jì)算機(jī)的發(fā)展。首先,演講不是簡單的口語文本——同樣的道理,戴維斯很難捕捉到一個(gè)note-for-note曲作為樂譜。人類所理解的詞、短語或句子離散與清晰的邊界實(shí)際上是將信號連續(xù)的流,而不是聽起來:Iwenttothestoreyesterday昨天我去商店。單詞也可以混合,用Whaddayawa嗎?這代表著你想要做什么。第二,沒有一對一的聲音和字母之間的相關(guān)性。在英語,有略多于5個(gè)元音字母——a,e,i,o,u,有時(shí)y和w。有超過二十多個(gè)不同的元音,雖然,精確統(tǒng)計(jì)可以取決于演講者的口音而定。但相反的問題也會(huì)發(fā)生,在那里一個(gè)以上的信號能再現(xiàn)某一特定的聲音。字母C可以有相同的字母K的聲音，如蛋糕，或作為字母S，如柑橘。此外,說同一語言的人使用不相同的聲音,即語言不同,他們的聲音語音或模式的組織，有不同的口音。例如“水”這個(gè)詞,wadder可以顯著watter，woaderwattah等等。每個(gè)人都有獨(dú)特的音量——男人說話的時(shí)候,一般開的最低音，婦女和兒童具有更高的音高(雖然每個(gè)人都有廣泛的變異和重疊)。發(fā)音可以被鄰近的聲音、說話者的速度和說話者的健康狀況所影響，當(dāng)一個(gè)人感冒的時(shí)候，就要考慮發(fā)音的變化。最后,考慮到不是所有的語音都是有意義的聲音組成。通常語音自身是沒有任何意義的，但有些用作分手話語以傳達(dá)說話人的微妙感情或動(dòng)機(jī)的信息:哦,就像,你知道,好的。也有一些聽起來都不認(rèn)為是字，這是一項(xiàng)詞性的：呃,嗯,嗯。嗽、打噴嚏、談笑風(fēng)生、嗚咽,甚至打嗝的可以成為上述的內(nèi)容之一。在噪雜的地方與環(huán)境自身的噪聲中，即使語音識別也是困難的?！拔易蛱烊チ松痰辍钡牟ㄐ螆D“我昨天去了商店”的光譜圖語音識別的發(fā)展史盡管困難重重，語音識別技術(shù)卻隨著數(shù)字計(jì)算機(jī)的誕生一直被努力著。早在1952年，研究人員在貝爾實(shí)驗(yàn)室就已開發(fā)出了一種自動(dòng)數(shù)字識別器，取名“奧黛麗”。如果說話的人是男性，并且發(fā)音者在詞與詞之間停頓350毫秒并把把詞匯限制在1—9之間的數(shù)字，再加上“哦”，另外如果這臺機(jī)器能夠調(diào)整到適應(yīng)說話者的語音習(xí)慣，奧黛麗的精確度將達(dá)到97℅—99℅，如果識別器不能夠調(diào)整自己，那么精確度將低至60℅.奧黛麗通過識別音素或者兩個(gè)截然不同的聲音工作。這些因素與識別器經(jīng)訓(xùn)練產(chǎn)生的參考音素是有關(guān)聯(lián)的。在接下來的20年里研究人員花了大量的時(shí)間和金錢來改善這個(gè)概念，但是少有成功。計(jì)算機(jī)硬件突飛猛進(jìn)、語音合成技術(shù)穩(wěn)步提高，喬姆斯基的生成語法理論認(rèn)為語言可以被程序性地分析。然而，這些似乎并沒有提高語音識別技術(shù)。喬姆斯基和哈里的語法生成工作也導(dǎo)致主流語言學(xué)放棄音素概念，轉(zhuǎn)而選擇將語言的聲音模式分解成更小、更易離散的特征。1969年皮爾斯坦率地寫了一封信給美國聲學(xué)學(xué)會(huì)的會(huì)刊，大部分關(guān)于語音識別的研究成果都發(fā)表在上面。皮爾斯是衛(wèi)星通信的先驅(qū)之一，并且是貝爾實(shí)驗(yàn)室的執(zhí)行副主任，貝爾實(shí)驗(yàn)室在語音識別研究中處于領(lǐng)先地位。皮爾斯說所有參與研究的人都是在浪費(fèi)時(shí)間和金錢。如果你認(rèn)為一個(gè)人之所以從事語音識別方面的研究是因?yàn)樗艿玫浇疱X，那就太草率了。這種吸引力也許類似于把水變成汽油、從海水中提取黃金、治愈癌癥或者登月的誘惑。一個(gè)人不可能用削減肥皂成本10℅的方法簡單地得到錢。如果想騙到人，他要用欺詐和誘惑。皮爾斯1969年的信標(biāo)志著在貝爾實(shí)驗(yàn)室持續(xù)了十年的研究結(jié)束了。然而，國防研究機(jī)構(gòu)ARPA選擇了堅(jiān)持下去。1971年他們資助了一項(xiàng)開發(fā)一種語音識別器的研究計(jì)劃，這種語音識別器要能夠處理至少1000個(gè)詞并且能夠理解相互連接的語音，即在語音中沒有詞語之間的明顯停頓。這種語音識別器能夠假設(shè)一種存在輕微噪音背景的環(huán)境，并且它不需要在真正的時(shí)間中工作。到1976年，三個(gè)承包公司已經(jīng)開發(fā)出六種系統(tǒng)。最成功的是由卡耐基麥隆大學(xué)開發(fā)的叫做“Harpy”的系統(tǒng)?！癏arpy”比較慢，四秒鐘的句子要花費(fèi)五分多鐘的時(shí)間來處理。并且它還要求發(fā)音者通過說句子來建立一種參考模型。然而，它確實(shí)識別出了1000個(gè)詞匯，并且支持連音的識別。研究通過各種途徑繼續(xù)著，但是“Harpy”已經(jīng)成為未來成功的模型。它應(yīng)用隱馬爾科夫模型和統(tǒng)計(jì)模型來提取語音的意義。本質(zhì)上，語音被分解成了相互重疊的聲音片段和被認(rèn)為最可能的詞或詞的部分所組成的幾率模型。整個(gè)程序計(jì)算復(fù)雜，但它是最成功的。在1970s到1980s之間，關(guān)于語音識別的研究繼續(xù)進(jìn)行著。到1980s，大部分研究者都在使用隱馬爾科夫模型，這種模型支持著現(xiàn)代所有的語音識別器。在1980s后期和1990s，DARPA資助了一些研究。第一項(xiàng)研究類似于以前遇到的挑戰(zhàn)，即1000個(gè)詞匯量，但是這次要求更加精確。這個(gè)項(xiàng)目使系統(tǒng)詞匯出錯(cuò)率從10℅下降了一些。其余的研究項(xiàng)目都把精力集中在改進(jìn)算法和提高計(jì)算效率上。2001年微軟發(fā)布了一個(gè)能夠與0fficeXP同時(shí)工作的語音識別系統(tǒng)。它把50年來這項(xiàng)技術(shù)的發(fā)展和缺點(diǎn)都包含在內(nèi)了。這個(gè)系統(tǒng)必須用大作家的作品來訓(xùn)練為適應(yīng)某種指定的聲音，比如埃德加愛倫坡的厄舍古屋的倒塌和比爾蓋茨的前進(jìn)的道路。即使在訓(xùn)練之后，該系統(tǒng)仍然是脆弱的，以至于還提供了一個(gè)警告：“如果你改變使用微軟語音識別系統(tǒng)的地點(diǎn)導(dǎo)致準(zhǔn)確率將降低，請重新啟動(dòng)麥克風(fēng)”。從另一方面來說，該系統(tǒng)確實(shí)能夠在真實(shí)的時(shí)間中工作，并且它確實(shí)能識別連音。語音識別的今天技術(shù)當(dāng)今的語音識別技術(shù)著力于通過共振和光譜分析來對我們的聲音產(chǎn)生的聲波進(jìn)行數(shù)學(xué)分析。計(jì)算機(jī)系統(tǒng)第一次通過數(shù)字模擬轉(zhuǎn)換器記錄了經(jīng)過麥克風(fēng)傳來的聲波。那種當(dāng)我們說一個(gè)詞的時(shí)候所產(chǎn)生的模擬的或者持續(xù)的聲波被分割成了一些時(shí)間碎片，然后這些碎片按照它們的振幅水平被度量，振幅是指從一個(gè)說話者口中產(chǎn)生的空氣壓力。為了測量振幅水平并且將聲波轉(zhuǎn)換成為數(shù)字格式，現(xiàn)在的語音識別研究普遍采用了奈奎斯特—香農(nóng)定理。奈奎斯特—香農(nóng)定理奈奎斯特—香農(nóng)定理是在1928年研究發(fā)現(xiàn)的，該定理表明一個(gè)給定的模擬頻率能夠由一個(gè)是原始模擬頻率兩倍的數(shù)字頻率重建出來。奈奎斯特證明了該規(guī)律的真實(shí)性，因?yàn)橐粋€(gè)聲波頻率必須由于壓縮和疏散各取樣一次。例如，一個(gè)20kHz的音頻信號能準(zhǔn)確地被表示為一個(gè)44.1kHz的數(shù)字信號樣本。工作原理語音識別系統(tǒng)通常使用統(tǒng)計(jì)模型來解釋方言，口音，背景噪音和發(fā)音的不同。這些模型已經(jīng)發(fā)展到這種程度，在一個(gè)安靜的環(huán)境中準(zhǔn)確率可以達(dá)到90℅以上。然而每一個(gè)公司都有它們自己關(guān)于輸入處理的專項(xiàng)技術(shù)，存在著4種關(guān)于語音如何被識別的共同主題。1.基于模板：這種模型應(yīng)用了內(nèi)置于程序中的語言數(shù)據(jù)庫。當(dāng)把語音輸入到系統(tǒng)中后，識別器利用其與數(shù)據(jù)庫的匹配進(jìn)行工作。為了做到這一點(diǎn)，該程序使用了動(dòng)態(tài)規(guī)劃算法。這種語音識別技術(shù)的衰落是因?yàn)檫@個(gè)識別模型不足以完成對不在數(shù)據(jù)庫中的語音類型的理解。2.基于知識：基于知識的語音識別技術(shù)分析語音的聲譜圖以收集數(shù)據(jù)和制定規(guī)則，這些數(shù)據(jù)和規(guī)則回饋與操作者的命令和語句等值的信息。這種識別技術(shù)不適用關(guān)于語音的語言和語音知識。3.隨機(jī)：隨機(jī)語音識別技術(shù)在今天最為常見。隨機(jī)語音分析方法利用隨機(jī)概率模型來模擬語音輸入的不確定性。最流行的隨機(jī)概率模型是HMM（隱馬爾科夫模型）。如下所示：Yt是觀察到的聲學(xué)數(shù)據(jù)，p（W）是一個(gè)特定詞串的先天隨機(jī)概率，p（Yt∣W）是在給定的聲學(xué)模型中被觀察到的聲學(xué)數(shù)據(jù)的概率，W是假設(shè)的詞匯串。在分析語音輸入的時(shí)候，HMM被證明是成功的，因?yàn)樵撍惴紤]到了語言模型，人類說話的聲音模型和已知的所有詞匯。1.聯(lián)結(jié)：在聯(lián)結(jié)主義語音識別技術(shù)當(dāng)中，關(guān)于語音輸入的知識是這樣獲得的，即分析輸入的信號并從簡單的多層感知器中用多種方式將其儲存在延時(shí)神經(jīng)網(wǎng)絡(luò)中。如前所述，利用隨機(jī)模型來分析語言的程序是今天最流行的，并且證明是最成功的。識別指令當(dāng)今語音識別軟件最重要的目標(biāo)是識別指令。這增強(qiáng)了語音軟件的功能。例如微軟Sync被裝進(jìn)了許多新型汽車?yán)锩?，?jù)說這可以讓使用者進(jìn)入汽車的所有電子配件和免提。這個(gè)軟件是成功的。它詢問使用者一系列問題并利用常用詞匯的發(fā)音來得出語音恒量。這些常量變成了語音識別技術(shù)算法中的一環(huán)，這樣以后就能夠提供更好的語音識別。當(dāng)今的技術(shù)評論家認(rèn)為這項(xiàng)技術(shù)自20世紀(jì)90年代開始已經(jīng)有了很大進(jìn)步，但是在短時(shí)間內(nèi)不會(huì)取代手控裝置。聽寫關(guān)于指令識別的第二點(diǎn)是聽寫。就像接下來討論的那樣，今天的市場看重聽寫軟件在轉(zhuǎn)述醫(yī)療記錄、學(xué)生試卷和作為一種更實(shí)用的將思想轉(zhuǎn)化成文字方面的價(jià)值。另外，許多公司看重聽寫在翻譯過程中的價(jià)值，在這個(gè)過程中，使用者可以把他們的語言翻譯成為信件，這樣使用者就可以說給他們母語中另一部分人聽。在今天的市場上，關(guān)于該軟件的生產(chǎn)制造已經(jīng)存在。語句翻譯中存在的錯(cuò)誤當(dāng)語音識別技術(shù)處理你的語句的時(shí)候，它們的準(zhǔn)確率取決于它們減少錯(cuò)誤的能力。它們在這一點(diǎn)上的評價(jià)標(biāo)準(zhǔn)被稱為單個(gè)詞匯錯(cuò)誤率（SWER）和指令成功率（CSR）。當(dāng)一個(gè)句子中一個(gè)單詞被弄錯(cuò)，那就叫做單個(gè)詞匯出錯(cuò)。因?yàn)镾WERs在指令識別系統(tǒng)中存在，它們在聽寫軟件中最為常見。指令成功率是由對指令的精確翻譯決定的。一個(gè)指令陳述可能不會(huì)被完全準(zhǔn)確的翻譯，但識別系統(tǒng)能夠利用數(shù)學(xué)模型來推斷使用者想要發(fā)出的指令。商業(yè)主要的語音技術(shù)公司隨著語音技術(shù)產(chǎn)業(yè)的發(fā)展，更多的公司帶著他們新的產(chǎn)品和理念進(jìn)入這一領(lǐng)域。下面是一些語音識別技術(shù)領(lǐng)域領(lǐng)軍公司名單（并非全部）NICESystems（NASDAQ:NICEandTelAviv：Nice），該公司成立于1986年，總部設(shè)在以色列，它專長于數(shù)字記錄和歸檔技術(shù)。他們在2007年收入5.23億美元。欲了解更多信息，請?jiān)L問Verint系統(tǒng)公司（OTC：VRNT），總部設(shè)在紐約的梅爾維爾，創(chuàng)立于1994年把自己定位為“勞動(dòng)力優(yōu)化智能解決方案，IP視頻，通訊截取和公共安全設(shè)備的領(lǐng)先供應(yīng)商。詳細(xì)信息，請?jiān)L問Nuance公司（納斯達(dá)克股票代碼：NUAN）總部設(shè)在伯靈頓，開發(fā)商業(yè)和客戶服務(wù)使用語音和圖像技術(shù)。欲了解更多信息，請?jiān)L問Vlingo，總部設(shè)在劍橋，開發(fā)與無線/移動(dòng)技術(shù)對接的語音識別技術(shù)。Vlingo最近與雅虎聯(lián)手合作，為雅虎的移動(dòng)搜索服務(wù)—一鍵通功能提供語音識別技術(shù)。欲了解更多信息，請?jiān)L問在語音技術(shù)領(lǐng)域的其他主要公司包括：Unisys，ChaCha，SpeechCycle，Sensory，微軟的Tellme公司，克勞斯納技術(shù)等等。

專利侵權(quán)訴訟考慮到這兩項(xiàng)業(yè)務(wù)和技術(shù)的高度競爭性，各公司之間有過無數(shù)次的專利侵權(quán)訴訟并不奇怪。在開發(fā)語音識別設(shè)備所涉及的每個(gè)元素都可以作為一個(gè)單獨(dú)的技術(shù)申請專利。使用已經(jīng)被另一家公司或個(gè)人申請專利的技術(shù)，即使這項(xiàng)技術(shù)是你自己獨(dú)立研發(fā)的，你也可能被要求賠償，并并可能不公正地禁止你以后使用該項(xiàng)技術(shù)。語音產(chǎn)業(yè)中的政治和商業(yè)緊緊地與語音技術(shù)的發(fā)展聯(lián)系在一起，因此，必須認(rèn)識到可能阻礙該行業(yè)的進(jìn)一步發(fā)展的政治和法律障礙。下面是對一些專利侵權(quán)訴訟的敘述。應(yīng)當(dāng)指出，目前有許多這樣的訴訟立案，許多訴訟案被推上法庭。語音識別未來的發(fā)展今后的發(fā)展趨勢和應(yīng)用醫(yī)療行業(yè)醫(yī)療行業(yè)有多年來一直在宣傳電子病歷(EMR)。不幸的是,產(chǎn)業(yè)遲遲不能夠滿足EMRs,一些公司斷定原因是由于數(shù)據(jù)的輸入。沒有足夠的人員將大量的病人信息輸入成為電子格式,因此,紙質(zhì)記錄依然盛行。一家叫Nuance(也出現(xiàn)在其他領(lǐng)域,軟件開發(fā)者稱為龍指令)相信他們可以找到一市場將他們的語音識別軟件出售那些更喜歡聲音而非手寫輸入病人信息的醫(yī)生。軍事國防工業(yè)研究語音識別軟件試圖將其應(yīng)用復(fù)雜化而非更有效率和親切。為了使駕駛員更快速、方便地進(jìn)入需要的數(shù)據(jù)庫，語音識別技術(shù)是目前正在飛機(jī)駕駛員座位下面的顯示器上進(jìn)行試驗(yàn)。軍方指揮中心同樣正在嘗試?yán)谜Z音識別技術(shù)在危急關(guān)頭用快速和簡易的方式進(jìn)入他們掌握的大量資料庫。另外，軍方也為了照顧病員涉足EMR。軍方宣布，正在努力利用語音識別軟件把數(shù)據(jù)轉(zhuǎn)換成為病人的記錄。附：英文原文SpeechRecognitionIncomputertechnology,SpeechRecognitionreferstotherecognitionof\o"Spokenlanguage"humanspeechbycomputersfortheperformanceofspeaker-initiatedcomputer-generatedfunctions(e.g.,transcribingspeechtotext;dataentry;operatingelectronicandmechanicaldevices;automatedprocessingoftelephonecalls)—amainelementofso-called\o"Naturallanguage"naturallanguageprocessingthroughcomputerspeechtechnology.Speechderivesfromsoundscreatedbythe\o"Human"human\o"Articulatoryphonetics(notyetwritten)"articulatorysystem,includingthe\o"Lung"lungs,\o"Vocalcord"vocalcords,and\o"Tongue"tongue.Throughexposuretovariationsinspeechpatternsduringinfancy,achildlearnstorecognizethesamewordsorphrasesdespitedifferentmodesofpronunciationbydifferentpeople—e.g.,pronunciationdifferinginpitch,tone,emphasis,intonationpattern.Thecognitiveabilityofthe\o"Brain"brainenableshumanstoachievethatremarkablecapability.Asofthiswriting(2008),wecanreproducethatcapabilityincomputersonlytoalimiteddegree,butinmanywaysstilluseful.TheChallengeofSpeechRecognition\o"Writingsystem"Writingsystemsareancient,goingbackasfarasthe\o"Sumerians(notyetwritten)"Sumeriansof6,000yearsago.The\o"Phonograph(notyetwritten)"phonograph,whichallowedtheanalogrecordingandplaybackofspeech,datesto1877.Speechrecognitionhadtoawaitthedevelopmentofcomputer,however,duetomultifariousproblemswiththerecognitionofspeech.First,speechisnotsimplyspokentext--inthesamewaythatMilesDavisplayingSoWhatcanhardlybecapturedbyanote-for-noterenditionassheet\o"Music"music.Whathumansunderstandasdiscrete\o"Word"words,\o"Phrase(notyetwritten)"phrasesor\o"Sentence(linguistics)(notyetwritten)"sentenceswithclearboundariesareactuallydeliveredasacontinuousstreamofsounds:Iwenttothestoreyesterday,ratherthanIwenttothestoreyesterday.Wordscanalsoblend,withWhaddayawa?representingWhatdoyouwant?Second,thereisnoone-to-onecorrelationbetweenthesoundsand\o"Letter(alphabet)"letters.In\o"Englishlanguage"English,thereareslightlymorethanfive\o"Vowel"vowelletters--a,e,i,o,u,andsometimesyandw.Therearemorethantwentydifferentvowelsounds,though,andtheexactcountcanvarydependingontheaccentofthespeaker.Thereverseproblemalsooccurs,wheremorethanonelettercanrepresentagivensound.Theletterccanhavethesamesoundastheletterk,asincake,orastheletters,asincitrus.Inaddition,peoplewhospeakthesamelanguagedonotusethesamesounds,i.e.languagesvaryintheir\o"Phonology"phonology,orpatternsofsoundorganization.Therearedifferent\o"Accent(linguistics(notyetwritten)"accents--theword'water'couldbepronouncedwatter,wadder,woader,wattah,andsoon.Eachpersonhasadistinctive\o"Pitch(linguistics)(notyetwritten)"pitchwhentheyspeak--mentypicallyhavingthelowestpitch,womenandchildrenhaveahigherpitch(thoughthereiswidevariationandoverlapwithineachgroup.)\o"Pronunciation(notyetwritten)"Pronunciationisalsocoloredbyadjacentsounds,thespeedatwhichtheuseristalking,andevenbytheuser'shealth.Considerhowpronunciationchangeswhenapersonhasacold.Lastly,considerthatnotallsoundsconsistofmeaningfulspeech.Regularspeechisfilledwithinterjectionsthatdonothavemeaninginthemselves,butservetobreakup\o"Discourse(notyetwritten)"discourseandconveysubtleinformationaboutthespeaker'sfeelingsorintentions:Oh,like,youknow,well.Therearealsosoundsthatareapartofspeechthatarenotconsideredwords:er,um,uh.Coughing,sneezing,laughing,sobbing,andevenhiccuppingcanbeapartofwhatisspoken.Andtheenvironmentaddsitsownnoises;speechrecognitionisdifficultevenforhumansinnoisyplaces.HistoryofSpeechRecognitionDespitethemanifolddifficulties,speechrecognitionhasbeenattemptedforalmostaslongastherehavebeendigitalcomputers.Asearlyas1952,researchersatBellLabshaddevelopedanAutomaticDigitRecognizer,or"Audrey".Audreyattainedanaccuracyof97to99percentifthespeakerwasmale,andifthespeakerpaused350millisecondsbetweenwords,andifthespeakerlimitedhisvocabularytothedigitsfromonetonine,plus"oh",andifthemachinecouldbeadjustedtothespeaker'sspeechprofile.Resultsdippedaslowas60percentiftherecognizerwasnotadjusted.Audreyworkedbyrecognizing\o"Phoneme"phonemes,orindividualsoundsthatwereconsidereddistinctfromeachother.Thephonemeswerecorrelatedtoreferencemodelsofphonemesthatweregeneratedbytrainingtherecognizer.Overthenexttwodecades,researchersspentlargeamountsoftimeandmoneytryingtoimproveuponthisconcept,withlittlesuccess.Computerhardwareimprovedbyleapsandbounds,speechsynthesisimprovedsteadily,and\o"NoamChomsky"NoamChomsky'sideaof\o"Generativegrammar"generativegrammarsuggestedthatlanguagecouldbeanalyzedprogrammatically.Noneofthis,however,seemedtoimprovethestateoftheartinspeechrecognition.ChomskyandHalle'sgenerativeworkinphonologyalsoledmainstreamlinguisticstoabandontheconceptofthe"phoneme"altogether,infavourofbreakingdownthesoundpatternsoflanguageintosmaller,morediscrete"features".\o""In1969,\o"JohnR.Pierce(notyetwritten)"JohnR.PiercewroteaforthrightlettertotheJournaloftheAcousticalSocietyofAmerica,wheremuchoftheresearchonspeechrecognitionwaspublished.Piercewasoneofthepioneersinsatellitecommunications,andanexecutivevicepresidentatBellLabs,whichwasaleaderinspeechrecognitionresearch.Piercesaideveryoneinvolvedwaswastingtimeandmoney.Itwouldbetoosimpletosaythatworkinspeechrecognitioniscarriedoutsimplybecauseonecangetmoneyforit....Theattractionisperhapssimilartotheattractionofschemesforturningwaterintogasoline,extractinggoldfromthesea,curingcancer,orgoingtothemoon.Onedoesn'tattractthoughtlesslygivendollarsbymeansofschemesforcuttingthecostofsoapby10%.Tosellsuckers,oneusesdeceitandoffersglamor.Pierce's1969lettermarkedtheendofofficialresearchatBellLabsfornearlyadecade.ThedefenseresearchagencyARPA,however,chosetopersevere.In1971theysponsoredaresearchinitiativetodevelopaspeechrecognizerthatcouldhandleatleast1,000wordsandunderstandconnectedspeech,i.e.,speechwithoutclearpausesbetweeneachword.Therecognizercouldassumealow-background-noiseenvironment,anditdidnotneedtoworkinrealtime.By1976,threecontractorshaddevelopedsixsystems.Themostsuccessfulsystem,developedbyCarnegieMellonUniversity,wascalledHarpy.Harpywasslow—afour-secondsentencewouldhavetakenmorethanfiveminutestoprocess.Italsostillrequiredspeakersto'train'itbyspeakingsentencestobuildupareferencemodel.Nonetheless,itdidrecognizeathousand-wordvocabulary,anditdidsupportconnectedspeech.Researchcontinuedonseveralpaths,butHarpywasthemodelforfuturesuccess.ItusedhiddenMarkovmodelsandstatisticalmodelingtoextractmeaningfromspeech.Inessence,speechwasbrokenupintooverlappingsmallchunksofsound,andprobabilisticmodelsinferredthemostlikelywordsorpartsofwordsineachchunk,andthenthesamemodelwasappliedagaintotheaggregateoftheoverlappingchunks.Theprocedureiscomputationallyintensive,butithasproventobethemostsuccessful.Throughoutthe1970sand1980sresearchcontinued.Bythe1980s,mostresearcherswereusinghiddenMarkovmodels,whicharebehindallcontemporaryspeechrecognizers.Inthelatterpartofthe1980sandinthe1990s,DARPA(therenamedARPA)fundedseveralinitiatives.Thefirstinitiativewassimilartothepreviouschallenge:therequirementwasstillaone-thousandwordvocabulary,butthistimearigorousperformancestandardwasdevised.Thisinitiativeproducedsystemsthatloweredtheworderrorratefromtenpercenttoafewpercent.Additionalinitiativeshavefocusedonimprovingalgorithmsandimprovingcomputationalefficiency.In2001,\o"Microsoft"MicrosoftreleasedaspeechrecognitionsystemthatworkedwithOfficeXP.Itneatlyencapsulatedhowfarthetechnologyhadcomeinfiftyyears,andwhatthelimitationsstillwere.Thesystemhadtobetrainedtoaspecificuser'svoice,usingtheworksofgreatauthorsthatwereprovided,suchasEdgarAllenPoe'sFalloftheHouseofUsher,andBillGates'TheWayForward.Evenaftertraining,thesystemwasfragileenoughthatawarningwasprovided,"IfyouchangetheroominwhichyouuseMicrosoftSpeechRecognitionandyouraccuracydrops,runtheMicrophoneWizardagain."Ontheplusside,thesystemdidworkinrealtime,anditdidrecognizeconnectedspeech.SpeechRecognitionTodayTechnologyCurrentvoicerecognitiontechnologiesworkontheabilitytomathematicallyanalyzethesoundwavesformedbyourvoicesthroughresonanceandspectrumanalysis.Computersystemsfirstrecordthesoundwavesspokenintoamicrophonethroughadigitaltoanalogconverter.Theanalogorcontinuoussoundwavethatweproducewhenwesayawordisslicedupintosmalltimefragments.Thesefragmentsarethenmeasuredbasedontheiramplitudelevels,thelevelofcompressionofairreleasedfromaperson’smouth.TomeasuretheamplitudesandconvertasoundwavetodigitalformattheindustryhascommonlyusedtheNyquist-ShannonTheorem.Nyquist-ShannonTheorem

TheNyquist–Shannontheoremwasdevelopedin1928toshowthatagivenanalogfrequencyismostaccuratelyrecreatedbyadigitalfrequencythatistwicetheoriginalanalogfrequency.Nyquistprovedthiswastruebecauseanaudiblefrequencymustbesampledonceforcompressionandonceforrarefaction.Forexample,a20kHzaudiosignalcanbeaccuratelyrepresentedasadigitalsampleat44.1kHz.HowitWorks

Commonlyspeechrecognitionprogramsusestatisticalmodelstoaccountforvariationsindialect,accent,backgroundnoise,andpronunciation.Thesemodelshaveprogressedtosuchanextentthatinaquietenvironmentaccuracyofover90%canbeachieved.Whileeverycompanyhastheirownproprietarytechnologyforthewayaspokeninputisprocessedthereexists4commonthemesabouthowspeechisrecognized.1.Template-Based:Thismodelusesadatabaseofspeechpatternsbuiltintotheprogram.Afterreceivingvoiceinputintothesystemrecognitionoccursbymatchingtheinputtothedatabase.TodothistheprogramusesDynamicProgrammingalgorithms.Thedownfallofthistypeofspeechrecognitionistheinabilityfortherecognitionmodeltobeflexibleenoughtounderstandvoicepatternsunlikethoseinthedatabase.2.Knowledge-Based:Knowledge-basedspeechrecognitionanalyzesthespectrogramsofthespeechtogatherdataandcreaterulesthatreturnvaluesequalingwhatcommandsorwordstheusersaid.Knowledge-Basedrecognitiondoesnotmakeuseoflinguisticorphoneticknowledgeaboutspeech.3.Stochastic:Stochasticspeechrecognitionisthemostcommontoday.Stochasticmethodsofvoiceanalysismakeuseofprobabilitymodelstomodeltheuncertaintyofthespokeninput.ThemostpopularprobabilitymodelisuseofHMM(HiddenMarkovModel)isshownbelow.Ytistheobservedacousticdata,p(W)isthea-prioriprobabilityofaparticularwordstring,p(Yt|W)istheprobabilityoftheobservedacousticdatagiventheacousticmodels,andWisthehypothesisedwordstring.WhenanalyzingthespokeninputtheHMMhasproventobesuccessfulbecausethealgorithmtakesintoaccountalanguagemodel,anacousticmodelofhowhumansspeak,andalexiconofknownwords.4.Connectionist:WithConnectionistspeechrecognitionknowledgeaboutaspokeninputisgainedbyanalyzingtheinputandstoringitinavarietyofwaysfromsimplemulti-layerperceptronstotimedelayneuralnetstorecurrentneuralnets.Asstatedabove,programsthatutilizestochasticmodelstoanalyzespokenlanguagearemostcommontodayandhaveproventobethemostsuccessful.RecognizingCommands

Themostimportantgoalofcurrentspeechrecognitionsoftwareistorecognizecommands.Thisincreasesthefunctionalityofspeechsoftware.SoftwaresuchasMicrosostSyncisbuiltintomanynewvehicles,supposedlyallowinguserstoaccessallofthecar’selectronicaccessories,hands-free.Thissoftwareisadaptive.Itaskstheuseraseriesofquestionsandutilizesthepronunciationofcommonlyusedwordstoderivespeechconstants.Theseconstantsarethenfactoredintothespeechrecognitionalgorithms,allowingtheapplicationtoprovidebetterrecognitioninthefuture.Currenttechreviewershavesaidthetechnologyismuchimprovedfromtheearly1990’sbutwillnotbereplacinghandcontrolsanytimesoon.Dictation

Secondtocommandrecognitionisdictation.Today'smarketseesvalueindictationsoftwareasdiscussedbelowintranscriptionofmedicalrecords,orpapersforstudents,andasamoreproductivewaytogetone'sthoughtsdownawrittenword.Inadditionmanycompaniesseevalueindictationfortheprocessoftranslation,inthatuserscouldhavetheirwordstranslatedforwrittenletters,ortranslatedsotheusercouldthensaythewordbacktoanotherpartyintheirnativelanguage.Productsofthesetypesalreadyexistinthemarkettoday.ErrorsinInterpretingtheSpokenWord

Asspeechrecognitionprogramsprocessyourspokenwordstheirsuccessrateisbasedontheirabilitytominimizeerrors.ThescaleonwhichtheycandothisiscalledSingleWordErrorRate(SWER)andCommandSuccessRate(CSR).ASingleWordErrorissimplyput,amisunderstandingofonewordinaspokensentence.WhileSWERscanbefoundinCommandRecognitionPrograms,theyaremostcommonlyfoundindictationsoftware.CommandSuccessRateisdefinedbyanaccurateinterpretationofthespokencommand.Allwordsinacommandstatementmaynotbecorrectlyinterpreted,buttherecognitionprogramisabletousemathematicalmodelstodeducethecommandtheuserwantstoexecute.BusinessMajorSpeechTechnologyCompaniesAsthespeechtechnologyindustrygrows,morecompaniesemergeintothisfieldbringwiththemnewproductsandideas.Someoftheleadersinvoicerecognitiontechnologies(butbynomeansallofthem)arelistedbelow.NICESystems(NASDAQ:NICEandTelAviv:Nice),headquarteredin\o"Israel"Israelandfoundedin1986,specializeindigitalrecordingandarchivingtechnologies.In2007theymade$523millioninrevenuein2007.Formoreinformationvisit\o"".VerintSystemsInc.(OTC:VRNT),headquarteredinMelville,NewYorkandfoundedin1994self-definethemselvesas“Aleadingproviderofactionableintelligencesolutionsforworkforceoptimization,IPvideo,communicationsinterception,andpublicsafety.”\o""[9]Formoreinformationvisit\o"".Nuance(NASDAQ:NUAN)headquarteredinBurlington,developsspeechandimagetechnologiesforbusinessandcustomerserviceuses.Formoreinformationvisit\o"/"/.Vlingo,headquarteredinCambridge,MA,developsspeechrecognitiontechnologythatinterfaceswithwireless/mobiletechnologies.VlingohasrecentlyteamedupwithYahoo!providingthespeechrecognitiontechnologyforYahoo!’smobilesearchservice,\o"OneSearch(notyetwritten)"oneSearch.Formoreinformationvisit\o""OthermajorcompaniesinvolvedinSpeechTechnologiesinclude:\o""Unisys,\o"/"ChaCha,\o"/"SpeechCycle,\o""Sensory,Microsoft

人人文庫> 全部分類> 行業(yè)資料 > 機(jī)電工程

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

語音信號識別及處理中英文翻譯文獻(xiàn)綜述

文檔簡介

溫馨提示

最新文檔

評論

相關(guān)文檔