基于Hadoop的企業(yè)數(shù)據(jù)倉(cāng)庫(kù)建設(shè)與創(chuàng)新劉汪根_第1頁(yè)
基于Hadoop的企業(yè)數(shù)據(jù)倉(cāng)庫(kù)建設(shè)與創(chuàng)新劉汪根_第2頁(yè)
基于Hadoop的企業(yè)數(shù)據(jù)倉(cāng)庫(kù)建設(shè)與創(chuàng)新劉汪根_第3頁(yè)
基于Hadoop的企業(yè)數(shù)據(jù)倉(cāng)庫(kù)建設(shè)與創(chuàng)新劉汪根_第4頁(yè)
基于Hadoop的企業(yè)數(shù)據(jù)倉(cāng)庫(kù)建設(shè)與創(chuàng)新劉汪根_第5頁(yè)
已閱讀5頁(yè),還剩17頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

基于Hadoop的企業(yè)數(shù)據(jù)倉(cāng)庫(kù)建設(shè)與創(chuàng)新2016/6/231www.transwarp.ioCC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITICTGCITGCIGTICTCGGIITTCC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITCITGCITGC

IGTICTCGG劉汪根大數(shù)據(jù)平臺(tái)研發(fā)總監(jiān)星環(huán)信息科技(上海)有限公司wayne.liu@transwarp.io企業(yè)級(jí)數(shù)據(jù)倉(cāng)庫(kù)架構(gòu)OLAP

Analytics數(shù)據(jù)集市抽取Reporting轉(zhuǎn)換清洗加載ETL數(shù)據(jù)倉(cāng)庫(kù)數(shù)據(jù)源數(shù)據(jù)服務(wù)CC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITICTGCITGCIGTICTCGGIITTCC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITCITGCIBTI

GC

IGTICTCGG2016/6/232www.transwarp.io一個(gè)典型的案例1. 單機(jī)數(shù)據(jù)庫(kù),單表上億記錄已經(jīng)是存儲(chǔ)、查詢以及分析的最大上限2. 多維Cube數(shù)據(jù)膨脹塊,單機(jī)無(wú)法存儲(chǔ),集中式存儲(chǔ)昂貴3. 查詢性能下降,進(jìn)而導(dǎo)致穩(wěn)定性下降4. 無(wú)法適應(yīng)新業(yè)務(wù)的要求,如移動(dòng)應(yīng)用或者時(shí)效性要求SSIS+T-SQLSQLServer

DWSQL

ServerSSASOLAP

AnalyticsReportingSSRSCC

GGITITC問CG題IG:TCITGCITCGGITITCC

GGIITTCCGGITICTGCITGCIGTICTCGGIITTCC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITCITGCITGC

IGTICTCGG2016/6/233www.transwarp.ioCC

GGITITCCG如IGTCI果TGCIT需CGGITI要TCC

GG融IITTCC合GGITIC大TGCIT數(shù)GCIGT據(jù)ICTCGG,IITTCC數(shù)GGITI倉(cāng)TCCGI如GTCITG何CITCG構(gòu)GITITCC建GGIIT呢TCCG?GITCITGCITGC

IGTICTCGG2016/6/234www.transwarp.io某互聯(lián)網(wǎng)巨頭的企業(yè)數(shù)倉(cāng)架構(gòu)SOURCEATAStructuraldataStreamingSourcesPortalTactical

ReportingBusiness

IntelligenceAnalyticsForecastingETL

ClustersMaster

ClusterTeradataStandby

ClusterTeradata<5

PBHBaseHiveKylinUn-structural

dataMySQLDaily

Sync-up MySQL報(bào)表區(qū)>50

PBTeradata

優(yōu)點(diǎn)?完善的SQL支持,事務(wù)支持,提供OLAP分析功能?MPP執(zhí)行引擎穩(wěn)定,可以處理100TB以上的數(shù)據(jù)?管理工具比較完善缺點(diǎn)?單節(jié)點(diǎn)>百萬(wàn)$,并且在大數(shù)據(jù)規(guī)??蓴U(kuò)展性不足?無(wú)法存儲(chǔ)非結(jié)構(gòu)化數(shù)據(jù),或者實(shí)時(shí)數(shù)據(jù)?無(wú)法有效支持?jǐn)?shù)據(jù)挖掘類需求CC

GGIDTITCCGIGTCITGCITCGGITITCC

GGIITTCCGGITICTGCITGCIGTICTCGGIITTCC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITCITGCITGC

IGTICTCGG2016/6/235www.transwarp.ioMPP+

Hadoop混合結(jié)構(gòu)2016/6/236www.transwarp.io? 混合架構(gòu)– 結(jié)構(gòu)化數(shù)據(jù)入MPP,非結(jié)構(gòu)化+實(shí)時(shí)數(shù)據(jù)數(shù)據(jù)入Hadoop– 核心BI報(bào)表由MPP生成,低優(yōu)先級(jí)任務(wù)放在Hadoop上– OLAP服務(wù)由MPP提供,MPP上數(shù)據(jù)每天同步到Hadoop上– 為了保證SLA,需要有個(gè)嚴(yán)格的規(guī)則控制對(duì)MPP的資源使用? 優(yōu)點(diǎn)– Hadoop作為補(bǔ)充,能夠滿足新型業(yè)務(wù)的需求,并且重用已經(jīng)構(gòu)建的傳統(tǒng)MPP數(shù)倉(cāng)系統(tǒng)? 缺點(diǎn)– 系統(tǒng)成本比較高,Teradata非常昂貴– 應(yīng)用/開發(fā)人員需要詳細(xì)的管理各個(gè)數(shù)據(jù)表的存儲(chǔ)和計(jì)算方式,一個(gè)業(yè)務(wù)需要適配兩套模型– 沒有一個(gè)統(tǒng)一的引擎,應(yīng)用使用跨系統(tǒng)的數(shù)據(jù)需要數(shù)據(jù)先同步– 實(shí)時(shí)業(yè)務(wù)開發(fā)門檻比較高,丟數(shù)據(jù)情況很常見,無(wú)HA– 開源Hadoop不支持事務(wù),做全表級(jí)別的數(shù)據(jù)同步非常麻煩– Hadoop對(duì)SQL支持比較弱,通常要MapReduce、Spark以及腳本語(yǔ)言混合編程– Hadoop對(duì)數(shù)據(jù)的管理和開發(fā)支持很弱– 需要一個(gè)比較大的運(yùn)維團(tuán)隊(duì),提供架構(gòu)支撐和數(shù)據(jù)開發(fā)CC

GGITITCCGIGTC–ITGC可IT以CG滿GI足TI離TCC線批GG處IITT理CC與G在GI線TICOTLAGCPI分TG析C的IGT需ICT求C;GG兩IIT套TCC系統(tǒng)GG互ITI相TC備CG份I,GTC提IT升GC數(shù)IT據(jù)CG的GI安TIT全CC性GGIITTCCGGITCITGCITGC

IGTICTCGGHadoop需要解決的關(guān)鍵問題穩(wěn)定性、可靠性以及計(jì)算性能非常重要,當(dāng)前Hadoop計(jì)算能力和穩(wěn)定性都存在問題,開發(fā)和運(yùn)維成本也很高SparkMllib技術(shù)門檻高,只能由數(shù)據(jù)科學(xué)家使用,普通的業(yè)務(wù)分析人員無(wú)法有效使用這些技術(shù)數(shù)倉(cāng)能力與構(gòu)建成本數(shù)倉(cāng)批處理和OLAP數(shù)據(jù)挖掘能力數(shù)據(jù)一致性與數(shù)據(jù)同步支持實(shí)時(shí)、非結(jié)構(gòu)化數(shù)據(jù)處理能力ACDE BETL等批處理業(yè)務(wù)是數(shù)倉(cāng)的資源消耗大戶,OLAP是保證BI性能的關(guān)鍵,Hadoop在這方面能力不足數(shù)倉(cāng)平臺(tái)需要能夠處理實(shí)時(shí)數(shù)據(jù)和非結(jié)構(gòu)化數(shù)據(jù)的能力,開發(fā)和運(yùn)維實(shí)時(shí)類應(yīng)用需要簡(jiǎn)單可靠跨系統(tǒng)間數(shù)據(jù)同步和一致性無(wú)法有效保證ETL的一致性對(duì)業(yè)務(wù)非常關(guān)鍵CGITCGITCGITC

GIT2016/6/237www.transwarp.ioCGITCGITC

GITCGITCGITC

GITCGITCGITC

GITCGITCGITCGITCGITC

GCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITC

GITCGCC

GGITITCC如GIGTCI何TGCIT解CGGITI決TCC

GGHIITTaCCdGGIoTICToGCpITG作CIGTIC為TCGG統(tǒng)IITTCC

一GGITITC數(shù)CGIGT倉(cāng)CITGC的ITCGG關(guān)ITITCC鍵GGIITT問CCGGI題TCITGCITGC

IGTICTCGG2016/6/238www.transwarp.io星環(huán)科技團(tuán)隊(duì)的成果1.支持最全SQL,唯一支持PL/SQL3.支持分布式事務(wù)處理4.業(yè)界唯一分布式流式SQL 5.豐富的數(shù)據(jù)挖掘和機(jī)器學(xué)習(xí)算法6.運(yùn)維無(wú)憂?

簡(jiǎn)單易用?

7*24小時(shí)不間斷?降低流應(yīng)用開發(fā)門檻,提高流應(yīng)用開發(fā)效率?針對(duì)性的優(yōu)化讓StreamSQL比編程開發(fā)流應(yīng)用性能更高?幫助用戶零成本將傳統(tǒng)業(yè)務(wù)邏輯變成流應(yīng)用?

保證事務(wù)處理的ACIDBEGINTRANSACTION/COMMIT/ROLLBACK語(yǔ)法進(jìn)行事務(wù)處理? 兩階段封鎖協(xié)議可保證事務(wù)的完全可序列化?

多版本(快照)隔離可以保證只讀事務(wù)的2.性能超群率先進(jìn)入復(fù)雜數(shù)據(jù)分析的百TB時(shí)代2016/6/239www.transwarp.io?支持99%的SQL

2003語(yǔ)法Hadoop業(yè)界唯一支持OraclePL/SQLHadoop業(yè)界唯一支持DB2

SQL/PL?

幫助用戶零成本遷移傳統(tǒng)應(yīng)用CC

G?GI新TITC應(yīng)C用G的IGT開CI發(fā)TG成CIT本CG很低GITITCC

GGIITTCCGGITICTGCITGCIGTICTCGGIITTCC

GGITITCCGIGTCITGCITCGG高IT并ITC發(fā)C性GGIITTCCGGITCITGCITGC

IGTICTCGGInceptorPL/SQL

Compiler架構(gòu)SQLParserSQLStatementsAbstractSyntax

TreeConstantConstantFoldingConFsotaldnint

gFoldingRDDDAGSQLNormalizerLogical

OptimizerCSEbytecodegenerationcolumnpruneroperatorprunerpartitionprunerpredicatepushdownPL/SQLASToptimizer

SQL2003JoinoptimizationsCBO

OptimizerPL/SQLControlfunctiondead

coderedundantinlining elimination eliminationCSEloopinvariantsCFG

Optimizer Parallel

OptimizercursorparallelizationTableStatisticsDAG

OptimizershufflereducerPhysicalPlanDAGSchedulersparktasksparktasksparktaskFirstPL/SQL

Compileron

Hadoop;98%Oracle

PL/SQLCompatibility.CGITCAnGaIlyTzCerGITCFlGowITGCrapGhITCGITCGITCGITCGITCGITCGI2016/6/2310www.transwarp.iohoistingTCGITCGITCGITCGITCGITCGITC

GCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITC

G050010002000250030001TB10TB30TB100TB

Transwarp

Inceptor’s

Performance

TPC-DS

Execution

Time

for

99

Queries

(in

minutes)可擴(kuò)展的分布式計(jì)算技術(shù)Test

environment:?29worker

nodes?2CPUs,12Cores,E5-2620

v2?100GB

memory?Network:2X

1Gbps?Disks:12X

3TBORCFilesDataNodeORCFilesDataNodeORCFilesDataNodeORCFilesDataNodemetastore

1500InceptormasterExecutorExecutorExecutorExecutorTransactionManagerZK

Cluster星環(huán)SQLon

Hadoop已經(jīng)能夠高效處理100TB數(shù)據(jù)的復(fù)雜分析2016/6/2311www.transwarp.ioTranswarpInceptor’sPhysicalDeployment

Diagram單機(jī) 計(jì)算并行 計(jì)算分布 數(shù)據(jù)、計(jì)算均分布化MPI消息通訊模式

Map/Reduce計(jì)算模式C

GITCGITC

GITCGITCGITCGITCGITC

GITCGITCGITCGITCGITCGITCGITC

GITCC

GITCGITCGITCGITC

GITCGITC

GITCGITCGITCGITCGITC

GITCGITCGITC

GITCGG支持分布式事務(wù)處理采用多版本兩階段封鎖協(xié)議實(shí)現(xiàn)可串行化快照隔離(SerializableSnapshot

Isolation)Transaction1

begin

transaction

selectmax(price)fromorderswhereage<

20

readvalueintolocalvariable

maxorder

updateorderssetprice=maxorder-1,

commitTransaction2

begin

transaction

updateorderssetprice=200whereid=

“007”

commit優(yōu)點(diǎn):1.

兩階段封鎖協(xié)議可保證事務(wù)的完全可序列化;2.

多版本(快照)隔離可以保證只讀事務(wù)的高并發(fā)性3.

分布式計(jì)算保證超高的吞吐率80000060000040000020000002016/6/2312www.transwarp.io1000000120000020000001800000160000014000001

column8

columns18

columnsCRUDPerformance(MERGE

INTO)Test

environment:?4worker

nodes?2CPUs,12Cores,E5-2620

v2?256GB

memory?Disks:6X

1TBCGITCGITC

GITCGITCGITCGITCGITCGITCGITC

GITCGITC

GITCGITCGITCGITCGITCGITC

GCGITCGITCGITCGITCGITC

GITCGITCGITC

GITCGITCGITCGITCGITC

G采用分布式內(nèi)存進(jìn)行交互式分析1WA2XB3YC4ZD5OE6PF7QG8RHHDFSStorage

LayerHDFSTextorORCorParquetFilesMemory

TierSSD

TierColumnarstoreSecondaryindexTable

formatSSDas

cacheHolodesk–AColumnarStoreIn-memoryoronSSDcache

layerFileSystem

APIInceptorServerExecutor

Executor

Executor

ExecutorColumnarStore

APICube

(D1,

D2,D3)Column

D1Column

D2Column

D3INDEXINDEXINDEX

INDEXColumn

M1Cube

(D1,

D2),(D2,D3),

(D1,

D3)ColumnarStore

APICube

(D1,

D2,D3)Column

D1Column

D2Column

D3INDEXINDEXINDEX

INDEXColumn

M1Cube

(D1,

D2),(D2,D3),

(D1,

D3)ColumnarStore

APICube

(D1,

D2,D3)Column

D1Column

D2Column

D3INDEXINDEXINDEX

INDEXColumn

M1Cube

(D1,

D2),(D2,D3),

(D1,

D3)ColumnarStore

APICube

(D1,

D2,D3)Column

D1Column

D2Column

D3INDEXINDEXINDEX

INDEXColumn

M1Cube

(D1,

D2),(D2,D3),

(D1,

D3)ZK

ClusterCube是OLAP分析的常用2016/6/2313www.transwarp.io技–術(shù)

SliceDiceRollupDrill

UpDrill

DownPivotCC

GGITITCCGIGTCOITfGf-CHITeCGapGITITCC

GGIITTCCGGITICTGCITGCIGTICTCGGIITTCC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITCITGCITGC

IGTICTCGG支持SQL的流處理用StreamSQL對(duì)網(wǎng)站訪問情況按URL統(tǒng)計(jì)CREATESTREAMaccesslog(ipSTRING,urlSTRING,time

TIMESTAMP)TBLPROPERTIES(“topic”=“accesslog”,“kafka.zookeeper”=“2:2181”);CREATETABLEresult(urlSTRING,count

INT);按消息時(shí)間以10s間隔統(tǒng)計(jì)網(wǎng)站訪問情況INSERTINTOresultSELECTurl,count(1)FROMaccesslogGROUPBYtimeSTREAMWINDOWswas(SEPERATEDBYtimeLENGTH?10?SLIDE?10?

SECONDFORMAT?yyyy-MM-dd

hh-mm-ss.SSS?);按系統(tǒng)時(shí)間以10s間隔統(tǒng)計(jì)網(wǎng)站訪問情況INSERTINTOresultSELECTurl,count(1)FROMaccesslogGROUPBYtimeSTREAMWINDOWswas(LENGTH

?10?SLIDE?10?

SECOND);CGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITC

G2016/6/2314www.transwarp.ioCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITC

GStream架構(gòu)圖SourceManagerDistributedExecution

EngineApplication

ManagerSinkerSQL

CompilerStream

SQLODBCJDBCShellR

API ModuleAlgorithm

APIData

Mining2016/6/2315www.transwarp.ioStorage

ManagerComputing

LayerTranswarpHyperbase Transwarp

HolodeskPL/SQLandANSISQLonstreaming,canruncomplexOLAPanalysison

StreamKafka,socketandfilesourcesare

supportedProvideawriterpoolandimprovewritingthroughputintoHyperbase&

HolodeskParalleltasksschedulingforbetter

performanceOptimizedexecutionforwindowdata

analysisCGITCGITCGITC

GITCGITCGITCGITCGITCGITC

GITCMaGnaIgTeCrGITCGITCGITCGITC

GCGITCGITCGITCGGIITTCCGGITICTCGITCGITCGITCGITCGITCGITCGITCGITC

G用分析師熟悉的開發(fā)方式做客戶行為分析訓(xùn)練數(shù)據(jù)采樣民生銀行2012年的04~09半年的交易流水,一共大約2億條記錄,506萬(wàn)消費(fèi)頻繁度消費(fèi)水平美食愛好旅游愛好體育愛好電子愛好IT愛好年輕活力男性女性商人2016/6/2316www.transwarp.io開車一族電話達(dá)人差旅人士數(shù)GI據(jù)TC大小約80G。并行360度用戶畫像在2分鐘內(nèi)完成對(duì)506萬(wàn)獨(dú)立持卡人的畫像GITC

GITCCC

G個(gè)IT獨(dú)C立GI持TC卡G人ITC,GITC

GGIITTCCGGITICTGCITGCIGTICTCGGIITTCC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITCITGCITGC

IGTICTCGG數(shù)據(jù)挖掘產(chǎn)品DiscoverHDFSInceptorLocal

FileRDBMS模型獲取統(tǒng)一接口層R存儲(chǔ)算法接口分析層數(shù)據(jù)接口DataFrame(Table)

數(shù)據(jù)統(tǒng)一接口層機(jī)器學(xué)習(xí)庫(kù) 深度學(xué)習(xí)庫(kù)特征庫(kù) 特征提取庫(kù)行業(yè)方案Python2016/6/2317www.transwarp.ioCGITC

算法庫(kù)CGITCGIGTCITGCITCGGITITCCGGIITTCCGGITICTGCITGCIGTICTCGGIITTCC

GGITITCCGIGTCITGCITCGGITITCCGGIITTCCGGITCITGCITGC

IGTICTCGGTDH數(shù)據(jù)倉(cāng)庫(kù)架構(gòu)Holodesk數(shù)據(jù)集市ETL數(shù)據(jù)倉(cāng)庫(kù)SybaseJDBC/ODBC數(shù)據(jù)源數(shù)據(jù)服務(wù)OracleInceptorTranswarpManagerGuardianR

StudioData

DictionaryStreamDiscoverTranswarpDataAliveHyperbaseSQL

ServerPortalInformixDB2Tactical

Reportingsiness

IntelligenceAnalytics消息隊(duì)列ForecastingCGITC

GITC2016/6/2318www.transwarp.ioGITCGITCGITC

GITCGITC

GITCGITCGITCGITCGITC

GITCGITC

GITBCu

GITCGITC

GCGITCGITCGITCGITCGITC

GITCGITCGITCGITCGITC

GITCGITCGITC

G某互聯(lián)網(wǎng)公司ETL驗(yàn)證0.101.0010.000100200600500400300復(fù)雜ETL任務(wù)性能對(duì)比TDH Teradata Perf

Speedup/coreTDH2016/6/2319www.transwarp.io40worker

nodes?Network:

10Gbps?Disks:4X

1TBTeradata?6800H,3700

AMP?2CPUs,14Cores,

E5-2697?Network:

BYNETETL7tables,3factand4

dimension9.8TBin

total16complexSQL,join/groupby/mergeinto/insert

coveredTDH完全勝任Teradata運(yùn)行的復(fù)雜ETL任務(wù),并且單CPU的效能是Teradata的兩倍。? 使用TDH獨(dú)立構(gòu)建的數(shù)據(jù)倉(cāng)庫(kù)完全可以取代混合部署C

GITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITCGITC?2GCPITUsC,

10GCoIrTesC,

E5G-2I6T20Cv2GITC

GCGITCGITCGITCGITCGITCGITCGITCGITC

GITCGITC?1G28G

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論