




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、Pyramid:the distributed infrastructure陽振坤2009-06-17 Why distributed system?Massive data sizeTrillions of web pages indexed by Baidu/GoogleThousands of terabytes of logs.Challenges from machine failurePractically no fault-free machinesInflexibility of manual partitioning data among machinesWhy NOT Ha
2、doop?Baidu goes toe to toe with GoogleHuge gaps between Hadoop and Google on availability (fault tolerance), performance, scalability, featuresSingle point failureReal time fault tolerance, data/task migrationConcurrent appendingAutomatic rebalanceMaintenanceCopy-on-write + instant snapshot/checkpoi
3、ntB-tree + pressed file namespaceWhy will Pyramid beat Google?Standing on the shoulders of Google etcNo legacy burdensBetter infrastructureBetter design to fit modern machine metadataHigher localization rateWhat can Pyramid do for you?Scaling up to thousands of machines30GB/s and 3GB/s read/write pe
4、rformance in a 100-machine clusterAutomatic data partition among machinesTransparent machine failure handlingFailed machine is automatically detected and removedData and task are migrated to other machines automatically and quicklyMachine failure is transparent to appsAutomatic parallelizationNo par
5、allel or distributed system background are requiredDynamic load balance among machinesLow operation and maintenance costPyramids infrastructureDCS: Distributed Computing SystemDTS: Distributed Table SystemDFS: Distributed File SystemTo simplify design and implementationFor stage resultsDFSDTSDCSDFSs
6、 goalCapable of storing thousands of TB of dataHigh and sustainable aggregate IO bandwidthHundreds of GB/s read performanceTens of GB/s write performanceUninterruptible serviceBuilt-in fault tolerance and high availabilityAutomatic machine managementPlug and playDFSs infrastructureSingle master + ma
7、ny workersFiles: divided into fixed-size chunks (256MB) stored by workersChunks: replicated on multiple workersDFSDTSDCSDM333444555666777888000222111Infrastructure of DTSSingle DTS master + many workersSorted and partitioned by row keyEach partition is about 256MBPartitions can be split or merged du
8、e to insertion or deletionB+ tree hierarchyDTSs goalUp to trillions of rows and billions of columnsSorted and partitioned by row keyDistributed among hundreds or thousands of machinesEach machine serves a few to a few thousand partitions Queries are done by corresponding machinesDFSDTSDCSDCSs goalA
9、restricted programming model (Map/Reduce)Widely adaptableAutomatically parallel and transparently fault-tolerantA distributed execution frameworkPartitioning job to tasksScheduling tasks among machinesDealing with machine failureBenefits to engineersAutomatic parallelization Transparent fault-tolera
10、nceBoth native C+ and scriptNo experience on parallel or distributed system requiredDFSDTSDCSMap & reduce prototypeMap: (k1,v1) = list (k2, v2)k1, v1 and k2, v2 are all binary stringsUsually (k2,v2) are drawn from a different domain than the inputWritten by the userReduce: (k2, v2) = list(v)Both k2
11、and v2 are stringsUsually v is draw from the same domain as the inputWritten by the userHow MapReduce worksMapReduce transforms and rearranges dataMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RInput data for map #1Input data for map #2Input data for map #3Input data for map #MMapReduce dia
12、gramRead dataMap: extract something from each recordShuffle and SortReduce: aggregate, summarize, filter, or transformWrite the resultsScheduling & load balancingJob splittingM map tasksR Reduce tasksTask assignmentThe master obtains a machine pool from the underlying systemA task is assigned to a w
13、orker whenever it es idleData shuffling begins before all map tasks are doneReduce tasks will not start until all map tasks are doneDealing with stragglersA few stragglers are commonBad disksResource consumed by other jobsBackup taskStarting pointResult arbitrationCommutative and associative propert
14、yTask control policyEnable/disable backup tasksLimit map tasks number per machine during reduce timeCountering against failed workersDone/undergoing map tasks on failed worker Done/undergoing reduce tasks on failed workerMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RMap worker of word coun
15、tingRetrieve input data from DFS fileInterpret input data (text)Call map() to retrieve each word and emit(word, “1”)Deliver the emitted pair to corresponding reduce buckets: md5(key) mod R, R: number of reduce tasksSort each bucket and combine duplicated to and emitKeep final results to local disk a
16、s temporary fileReduce worker of word countingRetrieve corresponding bucket (set of ) from each map worker through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall reduce() to sum up under the same “word” and emit Process the emitted content by output_format_tDeli
17、ver final results to output_target_t (DFS file)Map worker of distributed sortRetrieve input data through input_source_t: e.g., DFS fileInterpret input data according to input_format_t: e.g., textCall user defined map() functionMap() retrieve every and emit themCollect pairs emitted by map()Deliver t
18、hese pairs to corresponding reduce bucket according to the partitioner_te.g., R-1 separator key strings, R: the number of reduce tasksKeep final results to local disk as temporary filestr1str2str(R-1)123RReduce worker of distributed sortRetrieve corresponding bucket (set of pairs) from each map work
19、er through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall user defined reduce() function: emit(key, value)Collect results emitted by reduce() Process the above results according to output_format_tDeliver final results to output_target_tstr1str2str(R-1)123RHow ma
20、p worksMap collects its output by bucketMulti-way mergingKept as local temporary files123R213R31232R123R123R512MB512MB512MBHow reduce worksReduce shuffles data from each map workerEach shuffle implies a disk read on a map machineReduce sorts its input dataMulti-way mergingOften assign reduce worker
21、more memoryMap 1:3Map 2:3Map 3:3333Map M:3A3,F3,B3,U3C3,K3,A3,A3,V3U3,C3,B3,L3V3,K3,L3,A31GBA3,F3,B3,U3,C3,K31GBA3,A3,V3,U3,C3,B31GBL3,1GBBatch lookup from a tableAssumptionsA large table (e.g., tens of GBs or more)A relatively large volume of source keys (a few GBs or more)TargetFor every key in th
22、e source, retrieve its value in the tableCan MapReduce offer a solution to it?One solutionInput sources: the table and the source keysThe map() functionFor the table input, emit(key+0, value)For the source keys input, emit(key+1, null)The reduce() functionFor , store key and value to member variable
23、s of the classFor , compare current key with stored key, emit if equal, emit otherwiseHow to make these two items in the same bucketPartition(key+x) = md5(key) mod RA few more MapReduce examplesSobar log analysis at 2008.11First DCS appRetrieve source data by another DCS app firstLinkuniq at 2008.12
24、1TB+ single reduce task input before applying combinerWebinfodb backup: retrieve and sortMap worker: retrieve webpage from webinfodbReduce worker: gunzip and press webpage by lsrPartitioner: a list of urlsASP log: retrieve and split by products PLSA30M docs, 700K words and 300/1000 topicsChallenges
25、during developmentComplexity500,000+ C+ code linesAsynchronous RPC callsFault toleranceFailed/slow machinesLarge clock skewTransient failed DNSMomentary network breakUndetected network transferring errorLost、disordered、delayed network packetsRoadmap to PyramidPreparationDFS designDFS development & t
26、estDCS designDCS development & testDTS designDTS development & test200720082009Q3Q4Q1Q2Q3Q4Q1Q2Q3Q4TodayDFS trial appDCS trial appDTS trial appReferencesReference:Pyramid Wiki: DCS SDK: Q & ADFS WorkerData are divided into chunks (256MB)Chunks are distributed among workersChunks are replicated on mu
27、ltiple workers for reliability as well as performanceClients communicates with workers for data directlyDFS MasterKeeps all metadata in memoryFile namespaceChunk namespaceLogs all metadata mutationsSends/receives heartbeat/response to/from all workersDetects machine failure and migrates chunks autom
28、aticallyBalances load between workers automaticallyShadow masters to counter against master failureClients communicate with the master for metadata onlyDFS Master failurePrimary master + shadow master(s)A shadow master keeps itself synchronized by replaying commit log produced by the primary masterA
29、 shadow master lags the primary master, typically fractions of a secondA shadow master es the primary after the latter failsDFS: Performance considerationsLarge chunk sizeChunk replicas distribution policyConcurrent appendingNearest network data transferringMasters batch commit logMasters in-memory
30、metadata policyMasters copy-on-write metadata structure10:1 read/write performance gapDFS: 3 replicasReliability and PerformanceDCS: The partition functionDefault partitionermd5(key) mod RMetamorphosis: Hash(f(key)Hash(url) = f(site(url)Hash(url) = f(domain(url)Hash(key+x) = f(key)Other partitioners
31、SeparatorsHow a worker performs a mutationA worker always writes a mutation to a memory patchReads and writes can be proceeded in parallel by adopting copy-on-write techniqueThe in-memory portion is written to disk and reset to empty when its size reaches a thresholdThe original tablet and its disk patches are immutable to accelerate reading, writing and splittingThe tablet is rew
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 產(chǎn)品差異化與供應(yīng)鏈金融創(chuàng)新考核試卷
- 體育會展項目融資工具創(chuàng)新考核試卷
- 電氣系統(tǒng)維護考核試卷
- 人工智能在罕見內(nèi)分泌疾病診斷中的多模態(tài)數(shù)據(jù)應(yīng)用考核試卷
- 供應(yīng)鏈金融創(chuàng)新服務(wù)考核試卷
- 傳動部件的動態(tài)性能仿真分析考核試卷
- 2025年中國PVC便箋盒數(shù)據(jù)監(jiān)測研究報告
- 2025年中國FR挾口杯數(shù)據(jù)監(jiān)測研究報告
- 2025至2030年中國面罩市場分析及競爭策略研究報告
- 2025至2030年中國鋁研磨面板材市場分析及競爭策略研究報告
- 2025至2030中國柔性直流輸電行業(yè)運營規(guī)劃及發(fā)展前景深度分析報告
- 安全產(chǎn)風(fēng)險管理制度
- 深化國有企業(yè)改革調(diào)研提綱
- 小學(xué)騎車安全課件
- 公司個人獨資章程范本
- 《中國酒類企業(yè)ESG披露指南》
- 2025至2030年中國玉米淀粉行業(yè)市場現(xiàn)狀分析及前景戰(zhàn)略研判報告
- 安徽省2025年普通高校招生志愿預(yù)填表(普通類)
- 2025高考全國一卷語文真題
- 2025年微電子科學(xué)與工程專業(yè)就業(yè)前景調(diào)查報告
- 《生物活性物質(zhì)》課件
評論
0/150
提交評論