下一代hadoop上的執(zhí)行引擎_第1頁(yè)
下一代hadoop上的執(zhí)行引擎_第2頁(yè)
下一代hadoop上的執(zhí)行引擎_第3頁(yè)
下一代hadoop上的執(zhí)行引擎_第4頁(yè)
下一代hadoop上的執(zhí)行引擎_第5頁(yè)
已閱讀5頁(yè),還剩25頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、 Hortonworks Inc. 2014Page 1Apache Tez : Next Generation Execution Engine upon HadoopJeff Zhang Hortonworks Inc. 2014OutlineTez IntroductionTez APITez InternalTez Project StatusQ & A Hortonworks Inc. 2014Tez IntroductionPage 3 Distributed execution framework targeted towards data-processing appl

2、ications. Based on expressing a computation as a dataflow graph. Highly customizable to meet a broad spectrum of use cases. Built on top of YARN the resource management framework for Hadoop. Open source Apache project and Apache licensed.Hadoop 1 - Hadoop 2HADOOP 1.0HDFS(redundant, reliable storage)

3、MapReduce(cluster resource management & data processing)Pig(data flow)Hive(sql)Others(cascading)HDFS2(redundant, reliable storage)YARN(cluster resource management)Tez(execution engine)HADOOP 2.0Data FlowPigSQLHiveOthers(Cascading)BatchMapReduceReal Time Stream ProcessingStormOnline Data Processi

4、ngHBase,AccumuloMonolithicResource ManagementExecution EngineUser APILayeredResource Management YARNExecution Engine TezUser API Hive, Pig, Cascading, Your App! Hortonworks Inc. 2014Pig/Hive-MR versus Pig/Hive-TezPage 5Pig/Hive - MRPig/Hive - TezI/O Synchronization BarrierI/O Synchronization Barrier

5、Job 1Job 2Job 3Single Job Hortonworks Inc. 2014OutlineTez IntroductionTez APITez InternalTez Project InfoQ & A Hortonworks Inc. 2014Tez Expressing the computationPage 7Tez provides the following APIs to define the processing DAG API ( Vertex, Edge ) Defines the structure of the data processing a

6、nd the relationship between producers and consumers Enable definition of complex data flow pipelines using simple graph connection APIs. Tez expands the logical DAG at runtime This is how all the tasks in the job get specified Runtime API ( Task ) Defines the interfaces using which the framework and

7、 app code interact with each other App code transforms data and moves it between tasks This is how we specify what actually executes in each task on the cluster nodes Hortonworks Inc. 2014Tez DAG API / Define DAG DAG dag = new DAG(); / Define Vertex Vertex map1 = new Vertex(MapProcessor.class); / De

8、fine Edge Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class); / Connect them dag.addVertex(map1) .addEdge(edge) Page 8Simple DAG definition API Hortonworks Inc. 2014Tez DAG APIPage 9 Data movement Defines routing of data between tasksOne-To-One : Data

9、from the ith producer task routes to the ith consumer task.Broadcast : Data from a producer task routes to all consumer tasks.Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. Scheduling

10、 Defines when a consumer task is scheduledSequential : Consumer task may be scheduled after a producer task completes.Concurrent : Consumer task must be co-scheduled with a producer task. Data source Defines the lifetime/reliability of a task outputPersisted : Output will be available after the task

11、 exits. Output may be lost later on.Persisted-Reliable : Output is reliably stored and will always be availableEphemeral : Output is available only while the producer task is runningEdge properties define the connection between producer and consumer tasks in the DAG Hortonworks Inc. 2014Tez Runtime

12、API (IPO)Flexible Inputs-Processor-Outputs Model Thin API layer to wrap around arbitrary application code Compose inputs, processor and outputs to execute arbitrary processing Event routing based control plane architecture Applications decide logical data format and data transfer technology Customiz

13、e for performance Built-in implementations for Hadoop 2.0 data services HDFS and YARN ShuffleService. Built on the same API. Your impls are as first class as ours!Page 10InputProcessorOutput Hortonworks Inc. 2014Tez Runtime API ( VertexManager) VertexManagerControl on the flow execution engine in ve

14、rtex level Hortonworks Inc. 2014Tez Logical DAG expansion at RuntimePage 12Reduce1Map2Reduce2JoinMap1 Hortonworks Inc. 2014Tez Performance Benefits of expressing the data processing as a DAG Reducing overheads and queuing effects Gives system the global picture for better planning Efficient use of r

15、esources Re-use resources to maximize utilization Pre-launch, pre-warm and cache Locality & resource aware scheduling Support for application defined DAG modifications at runtime for optimized execution Change task concurrency Change task scheduling Change DAG edges Change DAG verticesPage 13 Ho

16、rtonworks Inc. 2014Tez Benefits of DAG executionFaster Execution and Higher Predictability Eliminate replicated write barrier between successive computations. Eliminate job launch overhead of workflow jobs. Eliminate extra stage of map reads in every workflow job. Eliminate queue and resource conten

17、tion suffered by workflow jobs that are started after a predecessor job completes. Better locality because the engine has the global picturePage 14Pig/Hive - MRPig/Hive - Tez Hortonworks Inc. 2014OutlineTez IntroductionTez APITez InternalTez Project InfoQ & A Hortonworks Inc. 2014Tez System Diag

18、ram Hortonworks Inc. 2014Tez Container Re-Use Reuse YARN containers/JVMs to launch new tasks Reduce scheduling and launching delays Shared in-memory data across tasks JVM JIT friendly executionPage 17YARN Container / JVMTezChildTezTask1TezTask2Shared ObjectsYARN ContainerTez Application MasterStart

19、TaskTask DoneStart Task Hortonworks Inc. 2014Tez Customizable Core EnginePage 18Vertex-2Vertex-1StartvertexVertex ManagerStarttasksDAGSchedulerGet PriorityGet PriorityStartvertexTaskSchedulerGet containerGet containerVertex ManagerDetermines task parallelismDetermines when tasks in a vertex can star

20、t.DAG SchedulerDetermines priority of taskTask SchedulerAllocates containers from YARN and assigns them to tasks Hortonworks Inc. 2014Tez SessionsPage 19Application MasterClientStart SessionSubmit DAGTask SchedulerContainer PoolShared Object RegistryPre WarmedJVMSessions Standard concepts of pre-lau

21、nch and pre-warm applied Key for interactive queries Represents a connection between the user and the cluster Multiple DAGs executed in the same session Containers re-used across queries Takes care of data locality and releasing resources when idle Hortonworks Inc. 2014Use case of Tez Split Group by

22、 + Join Orderby Automatic Reduce Parallelism Reduce Slow Start/Pre-launch Hortonworks Inc. 2014Split Group by + Joinf = load foo as (x,y,z)f1 = Group f By x;f2 = Group f By y;j = Join g1 by group, g2 by group Hortonworks Inc. 2014Orderbyf = Load foo as (x, y);o = Order f by x; Hortonworks Inc. 2014T

23、ez Automatic Reduce ParallelismPage 23 Map VertexReduce VertexApp MasterVertex ManagerData Size StatisticsVertex StateMachineSet ParallelismCancel TaskRe-RouteEvent ModelMap tasks send data statistics events to the Reduce Vertex Manager.Vertex ManagerPluggable application logic that understands the

24、data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism Hortonworks Inc. 2014Tez Reduce Slow Start/Pre-launchPage 24 Map VertexReduce VertexApp MasterVertex ManagerTask CompletedVertex StateMachineStart TasksStartEvent ModelMap completion events sent to th

25、e Reduce Vertex Manager.Vertex ManagerPluggable application logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start. Hortonworks Inc. 2014OutlineTez IntroductionTez APITez InternalTez Project StatusQ & A

26、Hortonworks Inc. 2014Tez Current status Apache Top Level ProjectRapid development. Over 1500 jiras opened. Over 1100 resolvedGrowing community of contributors and usersLatest release is 0.5 Focus on stabilityTesting and quality are highest priorityCode ready and deployed on multi-node environments a

27、t scale Support for a vast topology of DAGs Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changesApache Hive 0.13 release supports Tez as an execution engine (HIVE-4660)Apache Pig port to Tez is also done(PIG-3446)Cascading 3.0 support

28、TezPage 26 Hortonworks Inc. 2014Tez Adoption Apache Hive Hadoop standard for declarative access via SQL-like interface Apache Pig Hadoop standard for procedural scripting and pipeline processing Cascading Developer friendly Java API and SDK Scalding (Scala API on Cascading) Commercial Vendors ETL : Use Tez instead of MR or custom pipelines Analytics Vendors : Use Tez as a target platform for scaling parallel analytical tools to large

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論