講義-hive英文名稱為DataWarehouse可簡寫或DWH數(shù)據(jù)_第1頁
講義-hive英文名稱為DataWarehouse可簡寫或DWH數(shù)據(jù)_第2頁
講義-hive英文名稱為DataWarehouse可簡寫或DWH數(shù)據(jù)_第3頁
講義-hive英文名稱為DataWarehouse可簡寫或DWH數(shù)據(jù)_第4頁
講義-hive英文名稱為DataWarehouse可簡寫或DWH數(shù)據(jù)_第5頁
已閱讀5頁,還剩50頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

數(shù)據(jù)倉庫-英文名稱為Data Warehoue,可簡寫為DW或DWH。數(shù)據(jù)倉庫的目的是構(gòu)建面向分析的集成化數(shù)據(jù)環(huán)境,為企業(yè)提供決策支持(Deciinpt)。可以理解為 數(shù)據(jù)倉庫是面向的(Subject-Oriented)、集成的(Integrated)、非易失的(Non-Volatile)和時(shí)變的(Time-Variant)數(shù)據(jù)集合,用以支持管理決策。面向集非易失時(shí)變數(shù)據(jù)庫與數(shù)據(jù)倉庫的區(qū)別實(shí)際講的 OLTP(On-LineTransactionProcessing,),也可以稱面向交易分析型處理,叫聯(lián)機(jī)分析處理OLAP(On-Line yticalProcessing)一般針對(duì)某些的歷史數(shù)據(jù)進(jìn)行分析,支持管理決策。數(shù)據(jù)庫設(shè)計(jì)是盡量避免冗余,一般針對(duì)某一業(yè)務(wù)應(yīng)用進(jìn)行設(shè)計(jì),比如一張簡單的Use記錄用戶名、等簡單數(shù)據(jù)即可,符合業(yè)務(wù)應(yīng)用,但是不符合分析。數(shù)據(jù)倉庫在設(shè)計(jì)是有意引入冗余,依照分析需求,分析維度、分析指標(biāo)進(jìn)行設(shè)計(jì)。按照數(shù)據(jù)流入流出的過程,數(shù)據(jù)倉庫架構(gòu)可分為三層——源數(shù)據(jù)、數(shù)據(jù)倉庫、數(shù)據(jù)應(yīng)用源數(shù)據(jù)層

數(shù)據(jù)倉庫層

:也稱為細(xì)節(jié)層,DW層的數(shù)據(jù)應(yīng)該是一致的、準(zhǔn)確的、干凈的數(shù)據(jù)即對(duì)源系統(tǒng)數(shù)據(jù)進(jìn)行 (去除了雜質(zhì))后的數(shù)據(jù)

Extra,轉(zhuǎn)化Transfer裝載Load)的過程,ETL是數(shù)據(jù)倉庫的流水線,也可以認(rèn)為是數(shù)據(jù)倉庫的庫的數(shù)據(jù)狀態(tài)及ETL的任務(wù)運(yùn)行狀態(tài)。一般會(huì)通過元數(shù)據(jù)資料庫(MetadataRepository)來統(tǒng)元數(shù)據(jù)可分為技術(shù)元數(shù)據(jù)和業(yè)務(wù)元數(shù)據(jù)。技術(shù)元數(shù)據(jù)為開發(fā)和管理數(shù)據(jù)倉庫的IT人員使用,HiveHiveHiveCLI、CODC、WeGUI。其中,CLI(mmandlineierae為s命令行;CDC是ive的JA實(shí)現(xiàn),與傳統(tǒng)數(shù)據(jù)庫DC類似;WeGI是通過瀏覽器訪問ive。元數(shù)據(jù):通常是在關(guān)系數(shù)據(jù)庫如mysql/derby中。Hive將元數(shù)據(jù)在數(shù)據(jù)庫中。Hive中的元數(shù)據(jù)包括表的名字,表的列和分區(qū)及其屬性,表的屬性(是否為外部表等),表的數(shù)據(jù)所在等。解釋器、編譯器、優(yōu)化器、執(zhí)行器完成HQL查詢語句從詞法分析、語法分析、編譯、優(yōu)化以及查詢計(jì)劃的生成。生成的查詢計(jì)劃在HDFS中,并在隨后有MapReduce調(diào)用執(zhí)Hive與Hadoop的關(guān)Hive這里我們選用hive的版本是2.1.1地址為:apache-hive-2.1.1-之后,將我們的安裝包上傳到第三臺(tái)機(jī)器的/export/sowares 將我們的hive的安裝包上傳到第三臺(tái)服務(wù)器的/export/sowares112cdzxvfapachehive2.1.1C第二步:安裝yumyuminstallmysqlmysql-servermysql-/etc/init.d/mysqld/etc/init.d/mysqldgrantgrantallprivilegeson*.*grantroot@identifiedby123456flushcdcd/export/servers/apachehive2.1.1cphiveenv.sh.templatehiveHADOOP_HOME=/export/servers/hadoopexportHIVE_CONF_DIR=/export/servers/apachehive2.1.1修改hive-cdcd/export/servers/apachehive2.1.1vimhive<?xmlversion="1.0"encoding="UTF8"<?xmlstylesheettype="text/xsl" 將我們準(zhǔn)備好的mysql-connector-java-5.1.38.jar這個(gè)jar sudovimexportHIVE_HOME=/export/servers/apachehive2.1.1exportHivecdcd/export/servers/apachehive2.1.1 createdatabaseifnotexists112cd/export/servers/apachehive2.1.1e"createdatabaseifnotexists1212cdcreatedatabaseifnotexistsusecreatetablestu(idint,name通過hive-f來執(zhí)行我們的 bin/hivefHive創(chuàng)建數(shù)據(jù)庫112createdatabaseifnotexists createdatabasemyhive2location createdatabasefoowithdbproperties describedatabaseextended alterdatabasefoosetdbproperties查看數(shù)據(jù)庫詳細(xì)信 descdatabaseextended刪除數(shù)據(jù)庫 dropdatabase dropdatabasemyhive創(chuàng)建表的語法createcreate[external]table[ifnotexists]table_namecol_namedata_type[comment'字段描述信息col_namedata_type[comment'字段描述信息[comment'表的描述信息[partitionedby(col_name[clusteredby[sortedby(col_name[ascdesc],...)intonum_buckets[rowformat[storted [location'指定表的路徑createEXISTS選項(xiàng)來忽略這個(gè)異常。(LOCATION),Hive創(chuàng)建內(nèi)部表時(shí),會(huì)將數(shù)據(jù)移動(dòng)到數(shù)據(jù)倉庫指向的路徑;若創(chuàng)建外部表示注釋,默認(rèn)不能使用中文partitioned 下 clusteredbyHive可以進(jìn)一步組織成桶,也就是說桶是更為細(xì)粒度的數(shù)據(jù)范圍劃分。Hive也是針對(duì)某一列進(jìn)行桶的組織。sorted stortedas指定表文件的格式,常用格式:SEQUENCEFILE,TEXTFILE,RCFILE,如果文件數(shù)據(jù)是純文本,可以使用STOREDASTEXTFILE。如果數(shù)據(jù)需要壓縮,使用stortedas創(chuàng)建表時(shí),如果沒有使用external關(guān)鍵字,則該表是內(nèi)部表(managedtable)1字節(jié)的有符號(hào)整數(shù)-4個(gè)字節(jié)的帶符號(hào)整數(shù)18字節(jié)帶符號(hào)整數(shù)4字節(jié)單精度浮點(diǎn)數(shù)8字節(jié)雙精度浮點(diǎn)數(shù)‘2016-03-key-value,key必須為原始類型,value可以任意類字段集合,類型可以不同建表入門11234usecreatetablestu(idint,namestring);insertintostuvalues(1,"zhangsan");select*fromstu; createtableifnotexistsstu2(idint,namestring)rowformatdelimitedfieldsterminatedby'\t'; createtableifnotexistsstu2(idint,namestring)rowformatdelimitedfieldsterminatedby'\t'location'/user/stu2'; createtablestu3asselect*fromstu2;# createtablestu4like descformatted. droptable每天將收集到的日志定期流入HDFS文本文件。在外部表(原始日志表)的基礎(chǔ)上做大量 createexternaltableteacher(t_idstring,t_namestring)rowformatdelimitedfieldsterminatedby'\t'; createexternaltablestudent(s_idstring,s_namestring,s_birthstring,s_sexstring)rowformatdelimitedfieldsterminatedby'\t'; loaddatalocalinpath'/export/servers/hivedatas/student.csv'intotable loaddatalocalinpath'/export/servers/hivedatas/student.csv'overwriteintotablestudent;11234cdhdfshdfspputtecher.csvloaddatainpath'/hivedatas/techer.csv'intotable件,這樣每次操作一個(gè)小的文件就會(huì)很容易了,同樣的道理,在hive當(dāng)中也是支持這種思想unionunion createtablescore(s_idstring,c_idstring,s_scoreint)partitionedby(monthstring)rowformatdelimitedfieldsterminatedby'\t'; createtablescore2(s_idstring,c_idstring,s_scoreint)partitionedby(yearstring,monthstring,daystring)rowformatdelimitedfieldsterminatedby'\t'; loaddatalocalinpath'/export/servers/hivedatas/score.csv'intotablescorepartition(month='201806'); loaddatalocalinpath'/export/servers/hivedatas/score.csv'intotablescore2partition(year='2018',month='06',day='01'); select*fromscorewheremonth='201806'unionallselect*fromscorewheremonth='201806'; showpartitions altertablescoreadd altertablescoredroppartition(month=: :12hdfs12hdfshdfsmkdirpputscore.csv createexternaltablescore4(s_idstring,c_idstring,s_scoreint)partitionedby(monthstring)rowformatdelimitedfieldsterminatedby'\t'location'/scoredatas'; 建立表與數(shù)據(jù)文件之間的一個(gè)關(guān)系映射 msck table分桶表操作分桶,就是將數(shù)據(jù)按照指定的字段進(jìn)行劃分到多個(gè)文件當(dāng)中去,分桶就是MapReduce中的分區(qū)開啟Hive 開啟Hive set設(shè)置Reduce set createtablecourse(c_idstring,c_namestring,t_idstring)clusteredby(c_id)into3bucketsrowformatdelimitedfieldsterminatedby'\t';桶表的數(shù)據(jù)加載,由于通標(biāo)的數(shù)據(jù)加載通過hdfsdfsput文件或者通過loaddata均不好使,只能通過insertoverwrite創(chuàng)建普通表,并通過insertoverwriter的方式將普通表的數(shù)據(jù)通過查詢的方式加載到桶表當(dāng)中 create createmon(c_idstring,c_namestring,t_idstring)formatdelimitedfieldsterminatedby loaddatalocalinpath'/export/servers/hivedatas/course.csv'intotable通過insertoverwrite1insertoverwritetablecourseselect*mon重命名 altertableold_table_namerenameto altertablescore4renameto desc altertablescore5addcolumns(mycolstring,mysco altertablescore5changecolumnmyscomysconew droptable1.8hive1123createtablescore3likeinsertintotablescore3partition(month='201807')4 loaddatalocalinpath'/export/servers/hivedatas/score.csv'overwriteintotablescorepartition(month='201806');createcreatetablescore4likeinsertoverwritetablescore4partition(month='201806')selects_id,c_id,s_scorefromscore;Hive112345678SELECTDISTINCT]select_expr,select_expr,FROMtable_reference[WHEREwhere_condition][GROUPBYcol_list[HAVINGcondition]][CLUSTERBYcol_list[DISTRIBUTEBYcol_list][SORTORDERBY][LIMIT sortby不是全局排序,其在數(shù)據(jù)進(jìn)入reducer前完成排序。因此,如果用sortby進(jìn)行排序,并且設(shè)置mapred.reduce.tasks>1,則sortby只保證每個(gè)reducer的輸出有序,不保證全distributeby(字段)根據(jù)指定的字段將數(shù)據(jù)分到不同的reducer,且分發(fā)算法是hashclusterby(字段)除了具有distributeby的功能外,還會(huì)對(duì)該字段進(jìn)行排序clusterby=distributeclusterby=distributebysortsort select*from selects_id,c_idfrom重命名一個(gè)列。便于計(jì)算。 selects_idasmyid,c_idfrom求總行數(shù) selectcount(1)from求分?jǐn)?shù)的最大值 selectmax(s_score)from求分?jǐn)?shù)的最小值 selectmin(s_score)from求分?jǐn)?shù)的總和 selectsum(s_score)from求分?jǐn)?shù)的平均值 selectavg(s_score)from select*fromscorelimitWHERE子句緊隨FROM查詢出分?jǐn)?shù)大于60的數(shù)據(jù) select*fromscorewheres_score>A[NOT]BANDC如果A,或者C任一為NLL,則結(jié)果為NLL。如果A而且小于或等于C,則結(jié)果為TR,反之為AS。如果使用關(guān)鍵字則可達(dá)到相反的效果。AIS數(shù)值LIKEB是一個(gè)SQL下的簡單正則表達(dá)式,如果AR;反之返回AS。的表達(dá)式說明如下:‘x%’表示A以字母‘x’‘%x’表示A必須以字母’x’‘%x%’表示A包含有字母’關(guān)鍵字則可達(dá)到相反的效果。ARLIKEB,A 查詢分?jǐn)?shù)等于80的所有的數(shù)據(jù) select*fromscorewheres_score=查詢分?jǐn)?shù)在80到100的所有數(shù)據(jù) select*fromscorewheres_scorebetween80and select*fromscorewheres_scoreis查詢成績是80和90的數(shù)據(jù) select*fromscorewheres_scoreLIKE和選擇條件可以包含字符或數(shù)字%%代表零個(gè)或多個(gè)字符(任意個(gè)字符)_ 查找以8開頭的所有成績 select*fromscorewheres_scorelike1.查找第二個(gè)數(shù)值為9 select*fromscorewheres_scorelike1.查找s_id中含1 select*fromscorewheres_idrlike'[1]';#like select*fromscorewheres_score>80ands_id=查詢成績大于80,或者s_id是01 select*fromscorewheres_score>80ors_id=查詢s_id01和02 select*fromscorewheres_idnotinGROUPBY語GROUPBY語句通常會(huì)和聚合函數(shù)一起使用,按照一個(gè)或者多個(gè)列隊(duì)結(jié)果進(jìn)行分組,然后對(duì)每個(gè)組執(zhí)行聚合操作。案例: selects_id,avg(s_score)fromscoregroupby selects_id,max(s_score)fromscoregroupbyHAVING having只用于groupby案例 selects_id,avg(s_score)fromscoregroupby求每個(gè)學(xué)生平均分?jǐn)?shù)大于85 selects_id,avg(s_score)avgscorefromscoregroupbys_idhavingavgscore>85;JOINHive支持通常的SQLJOIN語句,但是只支持等值連接,不支持非等值連接。案例操作:查詢分?jǐn)?shù)對(duì)應(yīng)的11selectstuons.s_id=fromscoresjoin表的別 select*fromtechertjoincoursecont.t_id=內(nèi)連 select*fromtechertinnerjoincoursecont.t_id=左外連左外連接:JOIN操作符左邊表中符合WHERE select*fromtechertleftjoincoursecont.t_id=右外連 select*fromteachertrightjoincoursecont.t_id=多表連selectselect*fromteacherleftjoincourseont.t_id=leftjoinscoreons.c_id=leftjoinstudentons.s_id=MapReducejob對(duì)表techer和表course進(jìn)行連接操作,然后會(huì)再啟動(dòng)一個(gè)MapReducejob將第一個(gè)MapReducejob的輸出和表score;進(jìn)行連接操作。全局排OrderBy:全局排序,一個(gè)ORDERBY子句排序ASC(ascend):升序(默認(rèn))DESC(descend):ORDERBY子句在SELECT SELECT*FROMstudentsLEFTJOINscorescoONs.s_id=sco.s_idORDERBYsco.s_scoreDESC; SELECT*FROMstudentsLEFTJOINscorescoONs.s_id=sco.s_idORDERBYsco.s_scoreasc; selects_id,avg(s_score)avgfromscoregroupbys_idorderby多個(gè)列排序 selects_id,avg(s_score)avgfromscoregroupbys_idorderby每個(gè)MapReduce內(nèi)部排序(SortBy)局部排SortBy:每個(gè)MapReduce set1.查看設(shè)置reduce set1. select*fromscoresortby1.將查詢結(jié)果導(dǎo)入到文件中(按照成績降序排列 insertoverwritelocaldirectory'/export/servers/hivedatas/sort'select*fromscoresortbys_score;分區(qū)排序(DISTRIBUTEDistributeBy:類似MR中partition,進(jìn)行分區(qū),結(jié)合sortby使用。注意,Hive要求DISTRIBUTEBY語句要寫在SORTBY語句之前。對(duì)于distributeby進(jìn)試,一定要分配多reduce進(jìn)行處理,否則無法看到distributeby的效 set通過distributeby insertoverwritelocaldirectory'/export/servers/hivedatas/sort'select*fromscoredistributebys_idsortbys_score;CLUSTER當(dāng)distributeby和sortby字段相同時(shí),可以使用clusterbyclusterby除了具有distributeby的功能外還兼具sortby的功能。但是排序只能是倒序排序,不selectselect*fromscoreclusterbyselect*fromscoredistributebys_idsortbyHive bin/hive[hiveconfx=y]*[<ifilename>]*[<ffilename><equerystring>][S]1-i從文件初始化HQL-f-f執(zhí)行4-v輸出執(zhí)行的HQL5-pconnecttoHiveServeronport6-hiveconfx=yUsethistosethive/hadoopconfigurationvariables.設(shè)置hive運(yùn)行時(shí)候的參數(shù)IC_DIRivesite.xl默認(rèn)配置文件:IVCOFDI/ive-efalt.ml啟動(dòng)Hive(客戶端或Server方式)時(shí),可以在命令行添加-hiveconf bin/hivehiveconf set參數(shù)>命令行參數(shù)>配置文件參數(shù)Hive內(nèi)容較多,見《Hive文檔11 hive>show hive>descfunction hive>descfunctionextended4:常用1123456789#字符串連接函數(shù):#帶分隔符字符串連接函數(shù):concat_wsselectcast(1.5as#get_json_object(json解析函數(shù),用來處理json,必須是json格式selectselect 概述 Hive自帶了一些函數(shù),比如:max/min等,當(dāng)Hive提供的內(nèi)置函數(shù)你的業(yè)務(wù)處UDAF(User-DefinedAggregationFunction)類似于 UDTF(User-DefinedTable-GeneratingFunctions)如繼承UDF開發(fā)實(shí)Step1Maven112345678/artifact/org.apache.hive/hive <artifactId>hive>9<artifactId>hadoop<artifactId>mavencompiler<encoding>UTFStep2Java類集112345publicclassextendspublicTextevaluate(finalstr){Stringtmp_str=if(str!=null&&Stringstr_rettmp_str.substring(0,1).toUpperCase()6789new}new}}Step3項(xiàng)目打包,并上傳到hive的 Step4添加jarcdcd/export/servers/apachehive2.7.5mvoriginalday_10_hive_udf1.0SNAPSHOT.jarhive的客戶端添們的jar addjar/export/servers/apachehive2.7.5Step5 createtemporaryfunctionmy_upperasStep6 select在實(shí)際工作當(dāng)中,hive當(dāng)中處理的數(shù)據(jù),一般都需要經(jīng)過壓縮,前期我們?cè)趯W(xué)習(xí)hadoop的時(shí)無否否是否無否無否對(duì)應(yīng)的編碼器500OnasinglecoreofaCorei7processorin64-bitmode,Snappycompressesat500ormore pressesat

cor參默認(rèn)階建pression.codecs(在core-site.xml中配置 Hadoop使用文件擴(kuò)展名判斷是否支輸 輸編器在輸 press.輸具或者編,gzip和輸開啟map輸出階段壓縮可以減少job中map和Reducetask間數(shù)據(jù)傳輸量。具體配置如下: set set selectcount(1)from當(dāng)ive將輸出寫入到表中時(shí),輸出內(nèi)容同樣可以進(jìn)行壓縮。屬性 pess.otpt控制著這個(gè)功能。用戶可能需要保持默認(rèn)設(shè)置文件中的默認(rèn)值alse,這樣默認(rèn)的輸出就是非壓縮的純文本文件了。用戶可以通過在查詢語句或執(zhí)行中設(shè)置這個(gè)值為te,來開啟果壓縮功能。案例 set 11setpress.codec set insertoverwritelocaldirectory'/export/servers/snappy'select*fromscoredistributebys_idsortbys_iddesc;Hive支持的數(shù)的格式主要有:TEXTFILE(行式)、SEQUENCEFILE(行式)、列式和行式行的特點(diǎn):查詢滿足條件的一整行數(shù)據(jù)的時(shí)候,列則需要去每個(gè)的字段找到對(duì)列的特點(diǎn):因?yàn)槊總€(gè)字段的數(shù)據(jù),在查詢只需要少數(shù)幾個(gè)字段的時(shí)候,能大大常用數(shù)據(jù)格Orc(OptimizedRowColumnar)是hive0.11版里引入的新的格式組成,分別是IndexData,RowData,StripeFooter:rowData:真正的數(shù)據(jù) 通常情況下,在Parquet數(shù)據(jù)的時(shí)候會(huì)按照Block大小設(shè)置行組的大小,由于一般情況下每創(chuàng)建表,數(shù)據(jù)格式為createcreatetablelog_texttrack_timeurlsession_idrefereripend_user_idcity_id)ROWFORMATDELIMITEDFIELDSTERMINATEDBYSTOREDASTEXTFILE loaddatalocalinpath'/export/servers/hivedatas/log.data'intotablelog_text; dfsduh創(chuàng)建表,數(shù)據(jù)格式為createcreatetabletrack_timeurlsession_idrefereripend_user_idcity_id)ROWFORMATDELIMITEDFIELDSTERMINATEDBYSTOREDASorc insertintotablelog_orcselect*fromlog_text dfsduh創(chuàng)建表,數(shù)據(jù)格式為createcreatetabletrack_timeurlsession_idrefereripend_user_idcity_id)ROWFORMATDELIMITEDFIELDSTERMINATEDBYSTOREDASPARQUET insertintotablelog_parquetselect*fromlog_text dfsduhORC>Parquet> hive(default)>selectcount(*)fromlog_text;Timetaken:21.54seconds,Fetched:1hive(default)>selectcount(*)fromlog_orc;Timetaken:20.867seconds,Fetched:1row(s)3)Parquethive(default)>selectcount(*)fromTimetaken:22.922seconds,Fetched:1ORC>TextFile>highlevelcompression(oneofNONE,ZLIB,numberofbytesineachcompressionnumberofbytesineachnumberofrowsbetweenindexentries(must>=whethertocreaterowcommaseparatedlistofcolumnnamesforwhichbloomfiltershouldbecreatedfalsepositiveprobabilityforbloomfilter>0.0and1123456789createtablelog_orc_none(track_timestring,urlstring,session_idstring,refererstring,ipstring,end_user_idstring,city_idstring)ROWFORMATDELIMITEDFIELDSTERMINATEDBYSTOREDASorctblproperties insertintotablelog_orc_noneselect*fromlog_text dfsduh1123456789createtablelog_orc_snappy(track_timestring,urlstring,session_idstring,refererstring,ipstring,end_user_idstring,city_idstring)ROWFORMATDELIMITEDFIELDSTERMINATEDBYSTOREDASorctblproperties insertintotablelog_orc_snappyselect*fromlog_text dfsduh/user/hive/warehouse/myhive.db/log_orc_snappyFetchHive中對(duì)某些情況的查詢可以不必使用MapReduce計(jì)算。例如:SELECT*FROMscore;在這種情況下,Hive可以簡單地score對(duì)應(yīng)的下的文件,然后輸出查詢結(jié)果到控制臺(tái)。案例112345setselect*fromscore;selects_scorefromscore;selects_scorefromscorelimit112345setselect*fromscore;selects_scorefromscore;selects_scorefromscorelimit大多數(shù)的HadoopJob是需要Hadoop提供的完整的可擴(kuò)展性來處理大數(shù)據(jù)集的。不過,有時(shí)案例setsetselect*fromscoreclusterbysetsetselect*fromscoreclusterby如果不指定MapJi或者不符合MapJi的條件,那么ive解析器會(huì)在ede階段完成ji,容易發(fā)生數(shù)據(jù)傾斜。可以用MapJi把小表全部加載到內(nèi)存在map端進(jìn)行i,避免eer處理。設(shè)置自動(dòng)選擇 sethive.auto.convert.join=大表小表的閾值設(shè)置(默認(rèn)25M以下認(rèn)為是小表 set Group默認(rèn)情況下,Ma階段同一Key數(shù)據(jù)分發(fā)給一個(gè)ee,當(dāng)一個(gè)keyede端完成,很多聚合操作都可以先在Ma端進(jìn)行部分聚合,最后在ede端得出最終結(jié)果。開啟Map端聚合參數(shù)設(shè)置是否在Map端進(jìn)行聚合,默認(rèn)為 sethive.map.aggr=在Map sethive.groupby.mapaggr.checkinterval=有數(shù)據(jù)傾斜的時(shí)候進(jìn)行負(fù)載均衡(默認(rèn)是 sethive.groupby.skewindata=true,生成的查詢計(jì)劃會(huì)有兩個(gè)MRJob第一個(gè)MRJob中,Map的輸出結(jié)果會(huì)隨機(jī)分布到Reduce中,每個(gè)Reduce做部分聚合操作,并輸出結(jié)果,這樣處理的結(jié)果是相同的GroupByKey有可能被分發(fā)到不同的Reduce中,從而達(dá)到第二個(gè)MRJob再根據(jù)預(yù)處理的數(shù)據(jù)結(jié)果按照GroupByKey分布到Reduce中(這個(gè)過程可以保證相同的GroupByKey被分布到同一個(gè)Reduce中),最后完成最終的聚合操作。數(shù)據(jù)量小的時(shí)候無所謂,數(shù)據(jù)量大的情況下,由于COUNTDISTINCT操作需要用一個(gè)Reduce

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論