開放語言典藏組織OLAC與語言典藏后設(shè)之標(biāo)準(zhǔn)課件_第1頁
開放語言典藏組織OLAC與語言典藏后設(shè)之標(biāo)準(zhǔn)課件_第2頁
開放語言典藏組織OLAC與語言典藏后設(shè)之標(biāo)準(zhǔn)課件_第3頁
開放語言典藏組織OLAC與語言典藏后設(shè)之標(biāo)準(zhǔn)課件_第4頁
開放語言典藏組織OLAC與語言典藏后設(shè)之標(biāo)準(zhǔn)課件_第5頁
已閱讀5頁,還剩31頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、開放語言典藏組織(OLAC)與語言典藏後設(shè)資料之標(biāo)準(zhǔn)黃居仁、張如瑩1Outline Introduction to OLACDublin Core & OAIOLAC StandardsOLAC Metadata Set OLAC and Asian LanguagesExamplesSome Relative Web SiteOLAC Launch2The Open Language Archives Community 3OLAC AimsOLAC, the Open Language Archives Community, is an international partnership

2、of institutions and individuals who are creating a worldwide virtual library of language resources by:developing consensus on best current practice for the digital archiving of language resources;developing a network of interoperating repositories and services for housing and accessing such resource

3、s.4OLAC OrganizationCoordinators: Steven Bird & Gary SimonsAdvisory Board: Helen Aristar Dry, Susan Hockey, Chu-Ren Huang, Mark Liberman, Brian MacWhinney, Michael Nelson, Nicholas Ostler, Henry Thompson, Hans Uszkoreit, Antonio ZampolliParticipating Archives & Services: LDC, ELRA, DFKI, CBOLD, ANLC

4、, LACITO, Perseus, SIL, APS, UtrechtProspective Participants: ASEDA, Academia Sinica, AISRI, INALF, LCAAJ, Linguist, MPI, NAA, OTA, Rosetta, Tibetan Digital Library (UVA) Individual Members: 1205Introduction to OLAC許多協(xié)會(huì)需要語言資源,如:語言學(xué)家、工程師、教師、演說家許多機(jī)構(gòu)提供片段性的架構(gòu),如:檔案管理員、軟體發(fā)展者和出版者。前所未有的契機(jī):延伸性標(biāo)誌語言(Extensible

5、 Markup Language,XML)和 Unicode題供以結(jié)構(gòu)化方式彈性呈現(xiàn)以及長(zhǎng)期儲(chǔ)存資料。線上或非線上的數(shù)位化出版品有效且實(shí)際上達(dá)到分享語言資源涵義Dublin Core 後設(shè)資料集(資源分類標(biāo)準(zhǔn)模組)連同Open Archives Initiative所提供的交換方法,可建立一個(gè)跨越多個(gè)儲(chǔ)存器與檔案櫃的架構(gòu)。6The Vision for an Open Language Archives Community使用者透過一個(gè)OLAC的服務(wù)題供者網(wǎng)站搜尋與呈現(xiàn)OLAC的metadata欄位。7The Vision for an Open Language Archives Commu

6、nity#2理論上使用者可取得任何需要的資源DATA:任何描述語言的相關(guān)資訊。問卷結(jié)果:25%數(shù)位化,但並未採(cǎi)用相同的後設(shè)資料欄位。TOOLS:有助於創(chuàng)造、瀏覽、查詢或使用語言資料的計(jì)算機(jī)資源。ADVICE:什麼資源是可靠的?什麼工具適用於此情境?創(chuàng)造新資料時(shí)該如何作?8The Vision for an Open Language Archives Community #3實(shí)際上無法得到想要的資源在不同網(wǎng)站擁有不同名字(Name)造成召回率低 (low recall).在其他領(lǐng)域有相同意義,造成正確率低(precision).是否運(yùn)用適當(dāng)軟體以及判斷ADVICE的價(jià)值?許多語言資源並非以文字

7、為基礎(chǔ)。語言資源散佈在不同的網(wǎng)站.9The Vision for an Open Language Archives Community Bridging the gap through community infrastructure Gateway:使用者可獲得data,tool,advice的單一入口網(wǎng)站。Metadata: data,tool,advice的統(tǒng)一描述,包含所有項(xiàng)目的連結(jié)以及解釋如何存取。Review:瀏覽 data,tool,advice的評(píng)價(jià)。Standards:上述各項(xiàng)過程與協(xié)定的基礎(chǔ),例如:metadata schema,harvesting protocol.1

8、0The Vision for an Open Language Archives Community Summary: Seven Layers Complete the BridgeCONVERTCREATECREATEEXPORTDELIVERFORMAT OAICONTENTMETADATAOLAC REPOSITORIESOLAC SERVICESUSER SERVICESOLACPROCOLAC MHP OAI MS DCSoftwareRecommendationsInitiativesStandards11Dublin Core Metadata Initiative起於199

9、5挖掘web資源的一個(gè)會(huì)議 /Dublin Core後設(shè)資料元素一個(gè)廣泛跨學(xué)科的核心元素,有效廣泛支援資源挖掘,適用於任何以數(shù)位化或傳統(tǒng)型態(tài)存在的資源描述.包含十五個(gè)可任選與重複的元素(elements): Title, Creator, Subject, Description, Publisher, Contributor, Date,Type, Format, Identifier, Source, Language, Relation, Coverage and Rights.2002/01/07-以RDF/XML呈現(xiàn): http:/dublin

10、/documents/2001/11/28/dcmes-xml/12The Open Archives Initiative #11999/10成立,一般性的跨電子印刷品的檔案櫃(Archives)架構(gòu),不論是哪一種學(xué)術(shù)性媒材的數(shù)位儲(chǔ)存器(repositories)OAI基礎(chǔ)建設(shè)必須有的兩個(gè)標(biāo)準(zhǔn):OAI Shared Metadata Set (Dublin Core): 使內(nèi)部跨儲(chǔ)存器運(yùn)作容易.OAI Metadata Harvesting Protocol: http協(xié)定下使用軟體查詢儲(chǔ)存器.13The Open Archives Initiative #2The Rela

11、tionship Between an OAI Repository and an Archive14Applying the OAI to Language ResourceOAI特色透過單一介面以metadata為基礎(chǔ)搜尋各data provider.Web分散式與由下而上的特色集中式資料庫結(jié)構(gòu)化的本質(zhì)適合使用者獲取成長(zhǎng)迅速的資源和大量使用者導(dǎo)向的資源描述.支援以Dublin Code延伸的後設(shè)資料(metadata). 收集meta-archives在單一地方,使用者同時(shí)搜尋多個(gè)檔案館.OAI的ArchiveOAI的SERVICEPROVIDER15The Open Language A

12、rchives Community2000年十二月在workshop on Web-Based Language Documentation and Description由來自北美、南美、歐洲、非洲、中東、亞洲、澳洲的語言學(xué)家與軟體發(fā)展者所創(chuàng)。OLAC gateway:/16Foundation: OLAC & OAIRecall: OAI data providers must support:Dublin Core MetadataOAI Metadata harvesting protocolBUT: OAI data providers can support:a more spec

13、ialized metadata formata more specialized harvesting protocolWhat OLAC does:specialized metadata for language resourcesspecialized harvesting (extra validation)17OLAC StandardsAside:standards = the protocols and interfaces that allow the community to functionrecommendations = standards for represent

14、ing linguistic contentOLAC has three primary standards:OLACMS: the OLAC Metadata Set (Qualified DC)OLAC MHP: refinements to the OAI protocolOLAC Process: a procedure for identifying Best Common Practice Recommendations18OLAC Metadata Set #1以Dublin Core的15個(gè)元素(elements)為基礎(chǔ),元素經(jīng)進(jìn)一步組織與定義,元素的限制準(zhǔn)則為DC-Q,釋例D

15、CQ-HTML可由XML DTD或Schema編碼驗(yàn)證.OLAC最新版的XML Schema: /OLAC/0.4/olac.xsd 例子:/OLAC/0.4/olac.xml 19The OLAC Metadata Set #2The three categories of metadata:Work language: describes information entities and their intellectual attributes e.g. names of works and their creatorsDocument language: describes and p

16、rovides access to the physical manifestation of information e.g. format, publisher, date, rightsSubject language: describes what a document is about e.g. subject, description20OLAC Metadata Set #3refine::其element較精細(xì)或更多含意的規(guī)格.code : encoding scheme精準(zhǔn)的控制後設(shè)資料的值scheme : 規(guī)範(fàn)元素內(nèi)容文字其標(biāo)準(zhǔn)化的名稱lang :元素內(nèi)容(element

17、content)所使用的語言langs :屬於這元素的屬性,規(guī)範(fàn)後設(shè)資料(metadata)閱讀時(shí)的語言ElementrefinecodeschemelangControl VocabularyControl VocabularyControl VocabularyControl VocabularyelementattributescontrolvocabularySmith21OLAC Metadata Set #3Name:標(biāo)籤的正式名稱。Definition:以一行說明描述如何使用該元素(element).Comments:詳細(xì)描述如何使用該元素.包括DCMS和OLAC如何使用.Att

18、ributes: XML中該元素的屬性.Examples:例子.每個(gè)元素可重複出現(xiàn).22OLAC Metadata SetLanguage #1Name: Audience LanguageDefinition:資源內(nèi)容所使用的語言.Comments:創(chuàng)造者讓觀眾了解作品所使用的語言.請(qǐng)與Subject.language比較.例如:文學(xué)作品或僅使用一種語言的文件,演講者輔助的特殊語言,聲音記錄所使用的語言,句法描述所使用的語言,註解文字和雙語字典的解釋所使用的語言,但被註解的文字以及雙語字典中被定義的文字都要以Subject.language標(biāo)註.Attributes:code:控制詞彙請(qǐng)參見

19、OLAC-Language.控制詞彙不足或與控制詞彙用語不同時(shí),則以元素內(nèi)容加以描述.23OLAC Metadata SetLanguage #2ExamplesA resource in English about the Sikaiana language:A Yemba-French dictionary, where the alternate name Dschang is preferred.DschangThe American Heritage Dictionary, which is both in and about American English:A resource

20、about a language for which the controlled vocabulary does not yet provide a code:Ancient Sumerian24OLAC and Asian LanguagesTWO IssuesLanguage IdentificationIs current OLAC/Enthnologue vocabulary rich enough to describe all Asian languages?Multilingual ResourcesIs current OLACMS and Processes compreh

21、ensive enough to describe multilingual resources?25Language IdentificationThe DC two letter code (e.g. en for English) is not enough to describe all the languages in the worldEnthnologue () is currently the most comprehensive description of the worlds languages Potential Prob

22、lems of using Ethnologue (or any existing language list)over-splitting over-chunking omission 26Solution LI Problems #1Use controlled vocabulary for elaboration:Northern/TakituduhNorthern/TakibakhaCentral/TakbanuazCentral/TakivatanSouthern/Isbukun 27Solution LI Problems #2Registering language groups

23、 with an OLAC registration service :OLAC language classification server would house a comprehensive list of language family names (defined by users) and their extensional definitions (i.e. sets of Ethnologue codes) AS:Amis = ALV, AIS 28Multilingual Resources #1Directionality is crucial in multilingu

24、al resourcesHowever, OLAC metadata is flat and unordered In MT systems: lost information but sufficient for resource harvestingBi-directional MT 29Multilingual Resources #2One-to-many MT: Many-to-one MT: 30Multilingual Resources #3Text: languageBitext (bilingual aligned corpus) There is always an directionalityOriginal-language Tra

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論