




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Pyramid:the distributed infrastructure陽(yáng)振坤2009-06-17 Why distributed system?Massive data sizeTrillions of web pages indexed by Baidu/GoogleThousands of terabytes of logs.Challenges from machine failurePractically no fault-free machinesInflexibility of manual partitioning data among machinesWhy NOT Ha
2、doop?Baidu goes toe to toe with GoogleHuge gaps between Hadoop and Google on availability (fault tolerance), performance, scalability, featuresSingle point failureReal time fault tolerance, data/task migrationConcurrent appendingAutomatic rebalanceMaintenanceCopy-on-write + instant snapshot/checkpoi
3、ntB-tree + pressed file namespaceWhy will Pyramid beat Google?Standing on the shoulders of Google etcNo legacy burdensBetter infrastructureBetter design to fit modern machine metadataHigher localization rateWhat can Pyramid do for you?Scaling up to thousands of machines30GB/s and 3GB/s read/write pe
4、rformance in a 100-machine clusterAutomatic data partition among machinesTransparent machine failure handlingFailed machine is automatically detected and removedData and task are migrated to other machines automatically and quicklyMachine failure is transparent to appsAutomatic parallelizationNo par
5、allel or distributed system background are requiredDynamic load balance among machinesLow operation and maintenance costPyramids infrastructureDCS: Distributed Computing SystemDTS: Distributed Table SystemDFS: Distributed File SystemTo simplify design and implementationFor stage resultsDFSDTSDCSDFSs
6、 goalCapable of storing thousands of TB of dataHigh and sustainable aggregate IO bandwidthHundreds of GB/s read performanceTens of GB/s write performanceUninterruptible serviceBuilt-in fault tolerance and high availabilityAutomatic machine managementPlug and playDFSs infrastructureSingle master + ma
7、ny workersFiles: divided into fixed-size chunks (256MB) stored by workersChunks: replicated on multiple workersDFSDTSDCSDM333444555666777888000222111Infrastructure of DTSSingle DTS master + many workersSorted and partitioned by row keyEach partition is about 256MBPartitions can be split or merged du
8、e to insertion or deletionB+ tree hierarchyDTSs goalUp to trillions of rows and billions of columnsSorted and partitioned by row keyDistributed among hundreds or thousands of machinesEach machine serves a few to a few thousand partitions Queries are done by corresponding machinesDFSDTSDCSDCSs goalA
9、restricted programming model (Map/Reduce)Widely adaptableAutomatically parallel and transparently fault-tolerantA distributed execution frameworkPartitioning job to tasksScheduling tasks among machinesDealing with machine failureBenefits to engineersAutomatic parallelization Transparent fault-tolera
10、nceBoth native C+ and scriptNo experience on parallel or distributed system requiredDFSDTSDCSMap & reduce prototypeMap: (k1,v1) = list (k2, v2)k1, v1 and k2, v2 are all binary stringsUsually (k2,v2) are drawn from a different domain than the inputWritten by the userReduce: (k2, v2) = list(v)Both k2
11、and v2 are stringsUsually v is draw from the same domain as the inputWritten by the userHow MapReduce worksMapReduce transforms and rearranges dataMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RInput data for map #1Input data for map #2Input data for map #3Input data for map #MMapReduce dia
12、gramRead dataMap: extract something from each recordShuffle and SortReduce: aggregate, summarize, filter, or transformWrite the resultsScheduling & load balancingJob splittingM map tasksR Reduce tasksTask assignmentThe master obtains a machine pool from the underlying systemA task is assigned to a w
13、orker whenever it es idleData shuffling begins before all map tasks are doneReduce tasks will not start until all map tasks are doneDealing with stragglersA few stragglers are commonBad disksResource consumed by other jobsBackup taskStarting pointResult arbitrationCommutative and associative propert
14、yTask control policyEnable/disable backup tasksLimit map tasks number per machine during reduce timeCountering against failed workersDone/undergoing map tasks on failed worker Done/undergoing reduce tasks on failed workerMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RMap worker of word coun
15、tingRetrieve input data from DFS fileInterpret input data (text)Call map() to retrieve each word and emit(word, “1”)Deliver the emitted pair to corresponding reduce buckets: md5(key) mod R, R: number of reduce tasksSort each bucket and combine duplicated to and emitKeep final results to local disk a
16、s temporary fileReduce worker of word countingRetrieve corresponding bucket (set of ) from each map worker through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall reduce() to sum up under the same “word” and emit Process the emitted content by output_format_tDeli
17、ver final results to output_target_t (DFS file)Map worker of distributed sortRetrieve input data through input_source_t: e.g., DFS fileInterpret input data according to input_format_t: e.g., textCall user defined map() functionMap() retrieve every and emit themCollect pairs emitted by map()Deliver t
18、hese pairs to corresponding reduce bucket according to the partitioner_te.g., R-1 separator key strings, R: the number of reduce tasksKeep final results to local disk as temporary filestr1str2str(R-1)123RReduce worker of distributed sortRetrieve corresponding bucket (set of pairs) from each map work
19、er through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall user defined reduce() function: emit(key, value)Collect results emitted by reduce() Process the above results according to output_format_tDeliver final results to output_target_tstr1str2str(R-1)123RHow ma
20、p worksMap collects its output by bucketMulti-way mergingKept as local temporary files123R213R31232R123R123R512MB512MB512MBHow reduce worksReduce shuffles data from each map workerEach shuffle implies a disk read on a map machineReduce sorts its input dataMulti-way mergingOften assign reduce worker
21、more memoryMap 1:3Map 2:3Map 3:3333Map M:3A3,F3,B3,U3C3,K3,A3,A3,V3U3,C3,B3,L3V3,K3,L3,A31GBA3,F3,B3,U3,C3,K31GBA3,A3,V3,U3,C3,B31GBL3,1GBBatch lookup from a tableAssumptionsA large table (e.g., tens of GBs or more)A relatively large volume of source keys (a few GBs or more)TargetFor every key in th
22、e source, retrieve its value in the tableCan MapReduce offer a solution to it?One solutionInput sources: the table and the source keysThe map() functionFor the table input, emit(key+0, value)For the source keys input, emit(key+1, null)The reduce() functionFor , store key and value to member variable
23、s of the classFor , compare current key with stored key, emit if equal, emit otherwiseHow to make these two items in the same bucketPartition(key+x) = md5(key) mod RA few more MapReduce examplesSobar log analysis at 2008.11First DCS appRetrieve source data by another DCS app firstLinkuniq at 2008.12
24、1TB+ single reduce task input before applying combinerWebinfodb backup: retrieve and sortMap worker: retrieve webpage from webinfodbReduce worker: gunzip and press webpage by lsrPartitioner: a list of urlsASP log: retrieve and split by products PLSA30M docs, 700K words and 300/1000 topicsChallenges
25、during developmentComplexity500,000+ C+ code linesAsynchronous RPC callsFault toleranceFailed/slow machinesLarge clock skewTransient failed DNSMomentary network breakUndetected network transferring errorLost、disordered、delayed network packetsRoadmap to PyramidPreparationDFS designDFS development & t
26、estDCS designDCS development & testDTS designDTS development & test200720082009Q3Q4Q1Q2Q3Q4Q1Q2Q3Q4TodayDFS trial appDCS trial appDTS trial appReferencesReference:Pyramid Wiki: DCS SDK: Q & ADFS WorkerData are divided into chunks (256MB)Chunks are distributed among workersChunks are replicated on mu
27、ltiple workers for reliability as well as performanceClients communicates with workers for data directlyDFS MasterKeeps all metadata in memoryFile namespaceChunk namespaceLogs all metadata mutationsSends/receives heartbeat/response to/from all workersDetects machine failure and migrates chunks autom
28、aticallyBalances load between workers automaticallyShadow masters to counter against master failureClients communicate with the master for metadata onlyDFS Master failurePrimary master + shadow master(s)A shadow master keeps itself synchronized by replaying commit log produced by the primary masterA
29、 shadow master lags the primary master, typically fractions of a secondA shadow master es the primary after the latter failsDFS: Performance considerationsLarge chunk sizeChunk replicas distribution policyConcurrent appendingNearest network data transferringMasters batch commit logMasters in-memory
30、metadata policyMasters copy-on-write metadata structure10:1 read/write performance gapDFS: 3 replicasReliability and PerformanceDCS: The partition functionDefault partitionermd5(key) mod RMetamorphosis: Hash(f(key)Hash(url) = f(site(url)Hash(url) = f(domain(url)Hash(key+x) = f(key)Other partitioners
31、SeparatorsHow a worker performs a mutationA worker always writes a mutation to a memory patchReads and writes can be proceeded in parallel by adopting copy-on-write techniqueThe in-memory portion is written to disk and reset to empty when its size reaches a thresholdThe original tablet and its disk patches are immutable to accelerate reading, writing and splittingThe tablet is rew
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 穩(wěn)定運(yùn)營(yíng)財(cái)務(wù)成本管理試題及答案
- 專業(yè)保潔員職業(yè)發(fā)展培訓(xùn)計(jì)劃
- 小學(xué)三年級(jí)校外教學(xué)活動(dòng)計(jì)劃
- 城市排水管道成品保護(hù)措施
- 童趣動(dòng)畫角色授權(quán)衍生品設(shè)計(jì)、生產(chǎn)及全球分銷合同
- 父母子女共同保障的少兒未來教育金保險(xiǎn)費(fèi)分擔(dān)協(xié)議
- 抖音網(wǎng)紅IP授權(quán)終止及品牌形象維護(hù)合同
- 微信視頻號(hào)電商直播運(yùn)營(yíng)合作協(xié)議
- 獨(dú)家代理影視音樂版權(quán)授權(quán)與市場(chǎng)推廣合作協(xié)議
- 抖音直播火花零售行業(yè)品牌戰(zhàn)略合作協(xié)議
- 《電力安全工器具預(yù)防性試驗(yàn)規(guī)程》
- 貴陽(yáng)市普通住宅小區(qū)物業(yè)管理服務(wù)收費(fèi)參考標(biāo)準(zhǔn)
- MOOC 地學(xué)景觀探秘·審美·文化-重慶大學(xué) 中國(guó)大學(xué)慕課答案
- 安全生產(chǎn)事故報(bào)告處理制度范本
- (高清版)WST 311-2023 醫(yī)院隔離技術(shù)標(biāo)準(zhǔn)
- 2024年電梯安裝與維修工理論考試題庫(kù)及答案(通用版)
- 天耀中華合唱簡(jiǎn)譜大劇院版
- 【《我國(guó)互聯(lián)網(wǎng)企業(yè)價(jià)值評(píng)估現(xiàn)狀與問題探析11000字》(論文)】
- 智慧農(nóng)業(yè)的無人機(jī)技術(shù)應(yīng)用
- 建筑裝飾裝修工程消耗量定額
- 北京市2023年中考備考語(yǔ)文專題復(fù)習(xí) 名著閱讀題(解析)
評(píng)論
0/150
提交評(píng)論