并行計(jì)算機(jī)體系結(jié)構(gòu)課件

上傳人：7*** IP屬地：貴州上傳時(shí)間：2022-12-30 格式：PPT 頁數(shù)：144 大?。?.10MB 積分：25 舉報(bào) 版權(quán)申訴

已閱讀5頁，還剩139頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

ParallelComputerArchitecture

并行計(jì)算機(jī)體系構(gòu)造

Lecture7

TheIntroductionofMulticoreProcessorApril13,2021

ParallelComputerArchitectur12022/12/30TheIntroductionofMulticoreProcessor2主要內(nèi)容多核處理器開展的動(dòng)力多核處理器需要解決的關(guān)鍵問題多核處理器的開展現(xiàn)狀多核處理器中的新興技術(shù)2022/12/28TheIntroductionof22022/12/30TheIntroductionofMulticoreProcessor3Today’sProcessorVoltagelevelAflashlight(~1volt)CurrentlevelAnoven(~250amps)PowerlevelAlightbulb(~100watts)AreaApostagestamp(~1squareinch)PerformanceGFLOPS2022/12/28TheIntroductionof32022/12/30TheIntroductionofMulticoreProcessor4Whatisthefutureneed?PerformanceneedisneverendingComplainsfromend-usersnowadaysTomorrow’skillerapplicationNextStep:Howcanwegetto1TFLOPS?2022/12/28TheIntroductionof42022/12/30TheIntroductionofMulticoreProcessor5Tomorrow’skillerApplication(RMS)2022/12/28TheIntroductionof52022/12/30TheIntroductionofMulticoreProcessor6多核開展的動(dòng)力—線延遲Considerthe1Tflop/ssequentialmachine:Datamusttravelsomedistance,r,togetfrommemorytoCPU.Toget1dataelementpercycle,thismeans1012timespersecondatthespeedoflight,c=3x108m/s.Thusr<c/1012=0.3mm.Nowput1Tbyteofstorageina0.3mmx0.3mmarea:Eachwordoccupiesabout3squareAngstroms,orthesizeofasmallatom.Nochoicebutparallelismr=0.3mm1Tflop/s,1Tbytesequentialmachine2022/12/28TheIntroductionof62022/12/30TheIntroductionofMulticoreProcessor7多核開展的動(dòng)力—發(fā)熱問題IncreasingFrequencyWatts/cm211010010001.51.030.10.07i386i486PentiumPentiumProPentiumIIPentiumIIIHotPlateNuclearReactorRocketNozzlePentium4(Prescott)Pentium4(Willamette)DissipatedPower~CV2f2022/12/28TheIntroductionof72022/12/30TheIntroductionofMulticoreProcessor8ManagingtheHeatLoadLiquidcoolingsysteminAppleG5sHeatsinksin6XXseriesPentium4s2022/12/28TheIntroductionof82022/12/30TheIntroductionofMulticoreProcessor9多核開展的動(dòng)力—漏電流LeakageCurrent

FromMinorNuisancetoChipKillerDynamicPowerLeakagePower3002502001501005002501801309070DissipatedPower~CV2fProcessTechnology(nm)Power(W)2022/12/28TheIntroductionof92022/12/30TheIntroductionofMulticoreProcessor10多核開展的動(dòng)力—制造本錢Moore’s2ndlaw(Rock’slaw)Demoof0.06micronCMOS2022/12/28TheIntroductionof10TechnologyTrends:MicroprocessorCapacity2022/12/30TheIntroductionofMulticoreProcessor112Xtransistors/ChipEvery1.5yearsCalled“Moore’sLaw”

Microprocessorshavebecomesmaller,denser,andmorepowerful.Notjustprocessors,bandwidth,storage,etcGordonMoore(co-founderofIntel)predictedin1965thatthetransistordensityofsemiconductorchipswoulddoubleroughlyevery18months.TechnologyTrends:Microproces11Moore’sLawStillHolds2022/12/30TheIntroductionofMulticoreProcessor12NoExponentialisForever,

ButperhapswecanDelayitForeverMoore’sLawStillHolds2022/1212MeansofIncreasingPerformanceIncreasingClockFrequencyFrom60MHzto3,800MHzin12yearsHasresultedinexpectedperformanceincreaseExecutionOptimizationThekernelisInstructionLevelParallelism2022/12/30TheIntroductionofMulticoreProcessor13MeansofIncreasingPerformanc13Abriefhistoryofmicro-architectureevolutionTwoaxes:Exploringtheparallelism,muchoftheperformancefromparallelismBit-LevelParallelismInstruction-LevelParallelism(ILP)Thread-LevelParallelism(TLP)Hidingthememorylatency2022/12/30TheIntroductionofMulticoreProcessor144bitdata32bitdataPipeline,in-orderPipeline,out-ordersuperscalarSuper-pipelineVLIW,speculation,predicationSMT64bitdataDualcoremulticoremanycorewhereweareAbriefhistoryofmicro-archi14WhatisPipelining?2022/12/30TheIntroductionofMulticoreProcessor15Inthisexample:Sequentialexecutiontakes4*90min=6hoursPipelinedexecutiontakes30+4*40+20=3.5hoursBandwidth=loads/hourBW=4/6l/hpipeliningBW=4/3.5l/hpipeliningPipelininghelpsbandwidthbutnotlatency(90min)BandwidthlimitedbyslowestpipelinestagePotentialspeedup=NumberpipestagesABCD6PM789TaskOrderTime304040404020DavePatterson’sLaundryexample:4peopledoinglaundry wash(30min)+dry(40min)+fold(20min)=90minLatencyWhatisPipelining?2022/12/2815VLIW2022/12/30TheIntroductionofMulticoreProcessor16VLIW2022/12/28TheIntroduction16MeansofIncreasingPerformanceExecutionOptimizationMorepowerfulinstructionsExecutionoptimization(pipelining,branchprediction,executionofmultipleinstructions,reorderinginstructionstream,etc.)ThegainfromexploringILPisdiminishingTheinherentbarrierILPneedtotackleControldependence,datadependence…2022/12/30TheIntroductionofMulticoreProcessor17MeansofIncreasingPerformanc17MeansofIncreasingPerformanceWhatisthenext?NeedtofeedTLPfortheprocessorHeretheproblemisessentiallythesameasparallelprogrammingTechnologiesforTLPSimultaneousMulti-threading(SMT)->Example:IntelHyper-threadingChipmultiprocessing(CMP)->Multi-CoreProcessor2022/12/30TheIntroductionofMulticoreProcessor18MeansofIncreasingPerformanc18Micro-architectureTrends2022/12/30TheIntroductionofMulticoreProcessor19AdaptedfromJohanDeGelas,QuestforMoreProcessingPower,AnandTech,Feb.8,2005.Micro-architectureTrends2022/19UnderstandingSMTandCMP2022/12/30TheIntroductionofMulticoreProcessor20MakeclearConcurrencyvs.ParallelismConcurrency:twoormorethreadsareinprogressatthesametime:

Parallelism:twoormorethreadsareexecutingatthesametimeMultiplecoresneededThread1Thread2Thread1Thread2UnderstandingSMTandCMP2022/20SimultaneousMultithreading(SMT)MinimalresourcereplicationProvidesinstructionstooverlapmemorylatencySeparatethreadsexploitidleresources2022/12/30TheIntroductionofMulticoreProcessor21Context1Context2FunctionalUnitsL1CacheL2Cache…MainMemorySimultaneousMultithreading(S21SMT:simultaneousmultithreading2022/12/30TheIntroductionofMulticoreProcessor22

SuperscalarMultithreadedSMTIssueslotsSMT:simultaneousmultithreadi22GototheeraofMulticoreConcurrencyintheformofhardwaremultithreadinghasbeenaroundforawhile.Usefulforhidingmemorylatencies.Onlyabout30%performanceimprovementforspecialapplication.Howcanwecontinuetoutilizetheever-highertransistordensitiespredictedbyMoore’sLaw?CurrentView:

Cancontinueperformanceimprovementsbypackingmultipleprocessingcoresontoasinglechip,i.e.,multicore.Multi-core==ChipMultiprocessing==Tera-scaleComputing2022/12/30TheIntroductionofMulticoreProcessor23GototheeraofMulticoreConc23ChipMultiprocessingMuchlargerdegreeofresourcereplicationTwocompleteprocessingcoresoneachchipOuterlevelsofcacheandexternalinterfacearesharedGreatlyreducedresourcecontentioncomparedtoSMT2022/12/30TheIntroductionofMulticoreProcessor24L2Cache…MainMemoryContext1Context2FunctionalUnitsFunctionalUnitsL1CacheL1CacheChipMultiprocessingMuchlarge24WhatwebenefitfromMulti-Core?2022/12/30TheIntroductionofMulticoreProcessor25NewTargetforMicro-architecture–highperformance/powerWhatwebenefitfromMulti-Cor25Multi-CoreProcessorsImprovedcost/performanceratioMinimalincreasesinarchitecturalcomplexityprovidesignificantincreasesinperformanceMinimizesperformancestalls,withadramaticincreaseinoveralleffectivesystemperformanceGreaterEEP(energyefficientperformance)andscalabilityCoresenablethread-levelparallelismMulti-corearchitectureenablesdivide-and-conquerstrategytoperformmoreworkinagivenclockcycle.2022/12/30TheIntroductionofMulticoreProcessor26Multi-CoreProcessorsImproved26Multi-CoreProcessors(cont.)What’sspecialformany-cores?Explicitmulti-threadsrequiredtospeedupsingleapplicationperformanceCoretocorecommunicationLatencyreduceBandwidthincreaseCachesizeper-corewillalsoreduce2022/12/30TheIntroductionofMulticoreProcessor27Multi-CoreProcessors(cont.)W27Multi-CoreProcessors(cont.)2022/12/30TheIntroductionofMulticoreProcessor28Multi-CoreProcessors(cont.)228IntelClovertown上的延遲測(cè)試2022/12/30TheIntroductionofMulticoreProcessor29IntelClovertown上的延遲測(cè)試2022/1229Whatistheproblem?Whereistheinnovation?Howaboutthecore?Equaltotheoriginaloneornot?SimplecoremaybeagoodchooseHowaboutthepowercontrolonchip?FinegranularitypowercontrolHowabouttheinterconnectionbetweencoresandotherunits?XcoresmeansXtimesofmemoryreferencesRequireshigherthroughputsbetweencoresandcaches,withincachehierarchy,andbetweenlast-levelcacheandmemoryRequireslesslatenciesinthoseplacesFourbasickindsofinterconnectsBuses,crossbars,tiny-networks,andringsEachhasitsowntradeoffsinthroughput,latency,resourceoccupation,andeaseofimplementationMaybesuitableatdifferentlevels2022/12/30TheIntroductionofMulticoreProcessor30PPMEM/$

HierarchyMEM/$

HierarchyPPPPWhatistheproblem?Whereis30Whatistheproblem?Whereistheinnovation?HowabouttheCache?(NUCA:non-uniformcachearch.)2022/12/30TheIntroductionofMulticoreProcessor31ANUCASubstrateforFlexibleCMPCacheSharing,Proc.the19thAnnualInternationalConferenceonSupercomputing,June2005,pp.31-40Whatistheproblem?Whereis31多核處理器的問題多核處理器實(shí)際上是一個(gè)片上并行系統(tǒng)分層性分布性加速單個(gè)應(yīng)用需要顯式多線程多內(nèi)核處理器系統(tǒng)對(duì)軟件技術(shù)的核心問題是并行程序的開發(fā)問題，包括并行程序的編程與調(diào)試－多核處理器的軟件挑戰(zhàn)2022/12/30TheIntroductionofMulticoreProcessor32多核處理器的問題多核處理器實(shí)際上是一個(gè)片上并行系統(tǒng)2022/32Whatistheproblem?Whereistheinnovation?Wherearethethreads?–MaybethemostlargestchallengeMakeprogrammerwritethreadingprogramsTheWorldmaybeconfused.AutomaticParallelismMissionimpossible,butcanimproveinsomesense.MakemodulewiththreadingforuseHowtocontrolhighlevelbehaviorofourprograms?TrytoeasetheburdenofprogrammerLooksgood,buthowcan?2022/12/30TheIntroductionofMulticoreProcessor33Whatistheproblem?Whereis33如何應(yīng)對(duì)多核上的軟件挑戰(zhàn)讓程序員進(jìn)展并行編程繼承和優(yōu)化OpenMP和MPI等新的編程語言X10等事務(wù)內(nèi)存(TransactionalMemory)自動(dòng)并行化難度大，經(jīng)過20年的開展通用性仍不好推測(cè)多線程(Speculativemulti-threading)實(shí)現(xiàn)并行庫INTELMKL、SCALAPACK如何控制程序的高級(jí)行為?其他有價(jià)值的工作函數(shù)語言、數(shù)據(jù)流、領(lǐng)域語言2022/12/30TheIntroductionofMulticoreProcessor34如何應(yīng)對(duì)多核上的軟件挑戰(zhàn)讓程序員進(jìn)展并行編程2022/12/34Allofabovearestillopenissues2022/12/30TheIntroductionofMulticoreProcessor352022/12/28TheIntroductionof35Break！2022/12/30TheIntroductionofMulticoreProcessor362022/12/28TheIntroductionof36MulticoreProductsNowadaysLotsofdual-coreproductsnow:Intel:PentiumDandPentiumExtremeEdition,CoreDuo(2),Woodcrest,MontecitoIBMPowerPCAMDOpteron/Athlon64SunUltraSPARCIV.Systemswithmorethantwocoresareherewithmorecoming:IBMCell(asymmetric).Dual-corePowerPCpluseight“synergisticprocessingelements〞.SunNiagaraEightcores,fourhyper-threadedthreadspercore.GeneralPurposeComputationonGraphicsProcessors(GPGPU)Intelexpectstoproduce16-oreven32-corechipswithinadecade.2022/12/30TheIntroductionofMulticoreProcessor37MulticoreProductsNowadaysLot37ArchitectureofDual-CoreChips2022/12/30TheIntroductionofMulticoreProcessor38INTELCOREDUOTwophysicalcoresinapackageEachwithitsownexecutionresourcesEachwithitsownL1cache32Kinstructionand32KdataBothcoressharetheL2cache2MB8-waysetassociative;64-bytelinesize10clockcycleslatency;WriteBackupdatepolicyEXECoreFPUnitEXECoreFPUnitL2CacheL1CacheL1CacheSystemBus(667MHz,5333MB/s)AMDOpteronSeparate1MbyteL2cachesImprovementforMemoryaffinityandThreadaffinityArchitectureofDual-CoreChip38IntelMulti-corePlan2022/12/30TheIntroductionofMulticoreProcessor39IntelMulti-corePlan2022/12/239CellfromIBMandSony2022/12/30TheIntroductionofMulticoreProcessor40CellfromIBMandSony2022/12/40CellfromIBMandSony2022/12/30TheIntroductionofMulticoreProcessor41CellfromIBMandSony2022/12/41NiagarafromSUN2022/12/30TheIntroductionofMulticoreProcessor42NiagarafromSUN2022/12/28The42ThetechnologiesunderwayRethinktheconcurrencyandparallelismformulti-coreNewprogrammingmodelandprogramminglanguagesHardwaresupport(andsoftware)formultithreadingControl-drivenspeculationSpeculativemultithreadingData-drivenspeculationProgramdemultiplexingArchitecturalthreadenhancementSupportforhardwarethreadsLightweightsynchronization(monitor/mwait)2022/12/30TheIntroductionofMulticoreProcessor43ThetechnologiesunderwayRethi43RethinktheCandPformulti-coreWhatwehaveseenformulti-coreMoreparallelismneedtobeexploitedScalingmaybemoreimportantMoreheterogeneityneedtobeexploitedTaskmappingmayberevisitedLowlatencyandhighbandwidthbetweencoresonchipFinegranularityparallelismmayberethinked2022/12/30TheIntroductionofMulticoreProcessor44RethinktheCandPformulti-44RethinktheCandPformulti-coreMakefulluseofMulti-coreresourcesMoreparallelismHideMemoryaccessstall–well-knownMemoryWall2022/12/30TheIntroductionofMulticoreProcessor45RethinktheCandPformulti-45索引計(jì)算在clovertown上的測(cè)試索引計(jì)算是計(jì)算密集與IO密集并重的應(yīng)用網(wǎng)頁數(shù)據(jù)32GB，生成的索引大小為4.5GB2022/12/30TheIntroductionofMulticoreProcessor46階段主要使用資源讀文檔數(shù)據(jù)磁盤分詞（中日韓）CPU解析文檔CPU建內(nèi)存索引CPU，內(nèi)存寫磁盤索引磁盤索引計(jì)算在clovertown上的測(cè)試索引計(jì)算是計(jì)算密集與I46索引計(jì)算在clovertown上的測(cè)試(續(xù)〕索引各個(gè)階段，有的以計(jì)算為主，有的以IO為主考慮將索引過程劃分為多個(gè)流水段，實(shí)現(xiàn)流水索引算法，充分利用系統(tǒng)計(jì)算資源流水段的劃分原那么資源獨(dú)立：各個(gè)流水段使用獨(dú)立的資源時(shí)間接近：各個(gè)流水段的用時(shí)比較接近細(xì)粒度流水算法：利用流水段的重疊執(zhí)行，實(shí)現(xiàn)并行化2022/12/30TheIntroductionofMulticoreProcessor47索引計(jì)算在clovertown上的測(cè)試(續(xù)〕2022/12/47Intelclovertown測(cè)試環(huán)境2022/12/30TheIntroductionofMulticoreProcessor48CPUGenuine(R)Intel(R)2.66GHzquadcoredualprocessorsDiskUltra320SCSI磁盤Memory6GOSRedHatLinux(2.6.5-1.358smp)Compilergccversion3.3.3Intelclovertown測(cè)試環(huán)境2022/12/2848索引計(jì)算在clovertown上的測(cè)試(續(xù)〕單核上的性能提高流水線隱藏局部讀文檔I/O時(shí)間多核下的性能提高計(jì)算并行化2022/12/30TheIntroductionofMulticoreProcessor49性能提高8.2%性能提高53.4%測(cè)試時(shí)，使用1.5G內(nèi)存，且待測(cè)數(shù)據(jù)和索引位于同一塊磁盤索引計(jì)算在clovertown上的測(cè)試(續(xù)〕單核上的性能提高49RethinktheCandPformulti-coreProcessoraffinitybenefitfortaskmappingParallelFFTcomputationinNPBget14%performanceincreaseforMPICH2022/12/30TheIntroductionofMulticoreProcessor50RethinktheCandPformulti-50RethinktheCandPformulti-coreExploitdynamicandadaptiveout-of-orderexecutionpatternsonmulti-coreandheterogeneoussystem2022/12/30TheIntroductionofMulticoreProcessor51RethinktheCandPformulti-51ThetechnologiesunderwayRethinktheconcurrencyandparallelismformulti-coreNewprogrammingmodelandprogramminglanguagesHardwaresupport(andsoftware)formultithreadingControl-drivenspeculationSpeculativemultithreadingData-drivenspeculationProgramdemultiplexingArchitecturalthreadenhancementSupportforhardwarethreadsLightweightsynchronization(monitor/mwait)2022/12/30TheIntroductionofMulticoreProcessor52ThetechnologiesunderwayRethi52ProgrammingModelandPLsBridgetheapplicationsoftwaretosystemsoftwareandhardwareforbetterexpressingtheparallelismforsuchheterogeneoussystemsTransactionalMemoryIBMX10SUNFortress其它有意義的探索函數(shù)語言數(shù)據(jù)流領(lǐng)域語言2022/12/30TheIntroductionofMulticoreProcessor53ProgrammingModelandPLsBridg53Transactionalmemory

awaytoeasethreadprogrammingThreadprogrammingisaboringthing2022/12/30TheIntroductionofMulticoreProcessor54Transactionalmemory

awayto54Transactionalmemory

awaytoeasethreadprogrammingThreadprogrammingisaboringthing2022/12/30TheIntroductionofMulticoreProcessor55Transactionalmemory

awayto55Transactionalmemory

awaytoeasethreadprogrammingAtransactionisasequenceofmemoryloadsandstoresthateithercommitsorabortsIfatransactioncommits,alltheloadsandstoresappeartohaveexecutedatomicallyIfatransactionaborts,noneofitsstorestakeeffectTransactionoperationsaren'tvisibleuntiltheycommitorabortSimplifiedversionoftraditionalACIDdatabasetransactions(nodurability,forexample)2022/12/30TheIntroductionofMulticoreProcessor56Transactionalmemory

awayto56Transactionalmemoryexample2022/12/30TheIntroductionofMulticoreProcessor57Transactionalmemoryexample2057ProblemsinTransactionalMemory2022/12/30TheIntroductionofMulticoreProcessor58ProblemsinTransactionalMemo58SolutionsforTransactionalMemory2022/12/30TheIntroductionofMulticoreProcessor59SolutionsforTransactionalMe59X10對(duì)多內(nèi)核系統(tǒng)與集群系統(tǒng)提供統(tǒng)一的支持高生產(chǎn)率語言設(shè)計(jì)注重可移植性和平安性性能擴(kuò)展了Java虛擬機(jī)提供手工性能調(diào)整的手段在Java語言根底上開發(fā)繼承了JAVA語言的核心價(jià)值---高生產(chǎn)率，可移植性，成熟、平安面向主流Java/C/C++程序員2022/12/30TheIntroductionofMulticoreProcessor60X10對(duì)多內(nèi)核系統(tǒng)與集群系統(tǒng)提供統(tǒng)一的支持2022/12/260X10Vision:PortableProductiveParallelProgramming2022/12/30TheIntroductionofMulticoreProcessor61X10PlacesPhysicalPEsX10languagedefinesmappingfromX10objects&activitiestoX10placesX10DataStructuresX10deploymentdefinesmappingfromvirtualX10placestophysicalprocessingelementsX10Vision:PortableProductiv61DynamicparallelismwithaPartitionedGlobalAddressSpacePlacesencapsulatebindingofactivitiesandgloballyaddressabledataasync(P)S---runstatementSasynchronouslyatplacePfinishS---executestatementS,andwaitfordescendantasync’stoterminateatomicS---executestatementSatomicallyNoplace-remoteaccessespermittedinatomicsection2022/12/30TheIntroductionofMulticoreProcessor62Storageclasses:Activity-localPlace-localPartitionedglobalImmutableDeadlocksafety:anyX10programwrittenwithasync,atomic,andfinishcanneverdeadlock2022/12/28TheIntroductionof62X10程序例如2022/12/30TheIntroductionofMulticoreProcessor63ActivityA4finishasyncasyncActivityA0(Part3)ActivityA0(Part2)IndexOutOfBoundsexceptionfinishActivityA0(Part1)asyncActivityA1asyncActivityA2//X10pseudocodemain(){//implicitfinishActivityA0(Part1);async{A1;asyncA2;}try{finish{ActivityA0(Part2);asyncA3;asyncA4;}catch(…){…}ActivityA0(Part3);}ActivityA3X10程序例如2022/12/28TheIntroduct63ThetechnologiesunderwayRethinktheconcurrencyandparallelismformulti-coreNewprogrammingmodelandprogramminglanguagesHardwaresupport(andsoftware)formultithreadingControl-drivenspeculationSpeculativemultithreadingData-drivenspeculationProgramdemultiplexingArchitecturalthreadenhancementSupportforhardwarethreadsLightweightsynchronization(monitor/mwait)2022/12/30TheIntroductionofMulticoreProcessor64ThetechnologiesunderwayRethi64Speculativemultithreading2022/12/30TheIntroductionofMulticoreProcessor65Speculativeparallelthread(SPT)execution:Originalprogramexecution:timeABCMainthreadSpeculativethreadSpawnCommitspeculativeresultsCBASpeculativeexecutionSpeculativeThreadingformemorydependencesSpeculativeThreadingforvalueswithpre-computationsliceSpeculativemultithreading202265ProblemsinSpeculativemultithreadingLocatethesectionoftheprogramthatcanefficientlybeexecutedinparallelPre-computationslicehaslowcomputationaloverheadWorkloadbalanceLowoverheadforpre-computationsliceBufferingandmulti-versioninginthememoryhierarchyBufferingwillkeepthespeculativestatusuntilthethreadisverifiedandcanbecommittedMulti-versioningalloweachvariabletohaveadifferentvalueforeachofthethreadsrunninginparallelCheckdatadependencemis-speculationsquickly2022/12/30TheIntroductionofMulticoreProcessor66ProblemsinSpeculativemultit66SummaryforcurrenttrendsTomanycoreHardwaresupportformultithreadingTransactionMemoryHardtowritefastthreadedprogramsLockscreatefundamentalproblemsTransactionalmemoryshieldsprogrammersHardwarespeedsuptransactionalmemoryEnergy-efficientdesign2022/12/30TheIntroductionofMulticoreProcessor67SummaryforcurrenttrendsTom67Addmoreaxestothemicro-architectureevolutionReliabilityHardwarefailurewillbemoreandmoreintensivewhenfeaturesizecontinuouslyshrinking.Forexample,10outof1000coresmightbeunfunctional,and10anothermightproduceincorrectresult…Usingmulti-coretodoredundantcomputationAndothers?2022/12/30TheIntroductionofMulticoreProcessor68Addmoreaxestothemicro-arc68Today’sConclusionIt’sanageofcoresandthreadsManychallengesthereButWewillhavemoreopportunitiesforinnovationthanwehaveeverhadBeforeMulti-coreprocessorimplementation(inherentparallelism)hassignificantimpactonsoftwareapplicationsFullpotentialharnessedbyprogramsthatmigratetoathreadedsoftwaremodelEfficientuseofthreads(kernelorsystem/userthreads)isKEYtodramaticallyincreaseeffectivesystemperformance2022/12/30TheIntroductionofMulticoreProcessor69Today’sConclusionIt’sanage69FutureReadingMaterialsGotoGoogle,GeneralReadingTheLandscapeofparallelcomputingresearch:aviewfromBerkeleyIntelDeveloperForum,spring,2006,Beijing://intel/multi-core/docs.htmSpecialInterestedISCA05panelformulti-corerelatedClockRateversusIPC:theendoftheroadforconventionalmicro-architectures,ISCA2000(27th)Theimpactofmulti-coreonMathSoftwareNUCAC.Kim,D.Burger,andS.W.Keckler.Anadaptive,non-uniformcachestructureforwire-delaydominatedon-chipcaches.InProceedingsofthe10thInternationalConferenceonArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASPLOS),October2002.ANUCASubstrateforFlexibleCMPCacheSharing,Proc.the19thAnnualInternationalConferenceonSupercomputing,June2005://www-128.ibm/developerworks/power/cell2022/12/30TheIntroductionofMulticoreProcessor70FutureReadingMaterialsGoto70關(guān)于我們的實(shí)驗(yàn)室研究方向多核體系構(gòu)造相關(guān)研究，重點(diǎn)在于多核處理器仿真、存儲(chǔ)一致性以及虛擬化技術(shù)人員組成副教授1人，博士生2人，碩士生5人在研工程基于多核平臺(tái)的片上多處理器并行仿真技術(shù)研究〔國家863〕工程意義解決在多核時(shí)代，隨片上處理器核數(shù)目的增加仿真時(shí)間成超線性增加的問題研究?jī)?nèi)容〔兩年〕并行CMP仿真器構(gòu)造研究，并行CMP仿真器同步機(jī)制研究，并行CMP仿真器共享存儲(chǔ)仿真方案的研究，并行CMP仿真器性能評(píng)測(cè)方案的研究Hypervisor代碼解構(gòu)〔華為國家863“小型機(jī)〞重大專項(xiàng)子工程〕2022/12/30TheIntroductionofMulticoreProcessor71關(guān)于我們的實(shí)驗(yàn)室研究方向2022/12/28TheIntr71謝謝2022/12/30TheIntroductionofMulticoreProcessor722022/12/28TheIntroductionof72

ParallelComputerArchitecture

并行計(jì)算機(jī)體系構(gòu)造

Lecture7

TheIntroductionofMulticoreProcessorApril13,2021

ParallelComputerArchitectur732022/12/30TheIntroductionofMulticoreProcessor74主要內(nèi)容多核處理器開展的動(dòng)力多核處理器需要解決的關(guān)鍵問題多核處理器的開展現(xiàn)狀多核處理器中的新興技術(shù)2022/12/28TheIntroductionof742022/12/30TheIntroductionofMulticoreProcessor75Today’sProcessorVoltagelevelAflashlight(~1volt)CurrentlevelAnoven(~250amps)PowerlevelAlightbulb(~100watts)AreaApostagestamp(~1squareinch)PerformanceGFLOPS2022/12/28TheIntroductionof752022/12/30TheIntroductionofMulticoreProcessor76Whatisthefutureneed?PerformanceneedisneverendingComplainsfromend-usersnowadaysTomorrow’skillerapplicationNextStep:Howcanwegetto1TFLOPS?2022/12/28TheIntroductionof762022/12/30TheIntroductionofMulticoreProcessor77Tomorrow’skillerApplication(RMS)2022/12/28TheIntroductionof772022/12/30TheIntroductionofMulticoreProcessor78多核開展的動(dòng)力—線延遲Considerthe1Tflop/ssequentialmachine:Datamusttravelsomedistance,r,togetfrommemorytoCPU.Toget1dataelementpercycle,thismeans1012timespersecondatthespeedoflight,c=3x108m/s.Thusr<c/1012=0.3mm.Nowput1Tbyteofstorageina0.3mmx0.3mmarea:Eachwordoccupiesabout3squareAngstroms,orthesizeofasmallatom.Nochoicebutparallelismr=0.3mm1Tflop/s,1Tbytesequentialmachine2022/12/28TheIntroductionof782022/12/30TheIntroductionofMulticoreProcessor79多核開展的動(dòng)力—發(fā)熱問題IncreasingFrequencyWatts/cm211010010001.51.030.10.07i386i486PentiumPentiumProPentiumIIPentiumIIIHotPlateNuclearReactorRocketNozzlePentium4(Prescott)Pentium4(Willamette)DissipatedPower~CV2f2022/12/28TheIntroductionof792022/12/30TheIntroductionofMulticoreProcessor80ManagingtheHeatLoadLiquidcoolingsysteminAppleG5sHeatsinksin6XXseriesPentium4s2022/12/28TheIntroductionof802022/12/30TheIntroductionofMulticoreProcessor81多核開展的動(dòng)力—漏電流LeakageCurrent

FromMinorNuisancetoChipKillerDynamicPowerLeakagePower3002502001501005002501801309070DissipatedPower~CV2fProcessTechnology(nm)Power(W)2022/12/28TheIntroductionof812022/12/30TheIntroductionofMulticoreProcessor82多核開展的動(dòng)力—制造本錢Moore’s2ndlaw(Rock’slaw)Demoof0.06micronCMOS2022/12/28TheIntroductionof82TechnologyTrends:MicroprocessorCapacity2022/12/30TheIntroductionofMulticoreProcessor832Xtransistors/ChipEvery1.5yearsCalled“Moore’sLaw”

Microprocessorshavebecomesmaller,denser,andmorepowerful.Notjustprocessors,bandwidth,storage,etcGordonMoore(co-founderofIntel)predictedin1965thatthetransistordensityofsemiconductorchipswoulddoubleroughlyevery18months.TechnologyTrends:Microproces83Moore’sLawStillHolds2022/12/30TheIntroductionofMulticoreProcessor84NoExponentialisForever,

ButperhapswecanDelayitForeverMoore’sLawStillHolds2022/1284MeansofIncreasingPerformanceIncreasingClockFrequencyFrom60MHzto3,800MHzin12yearsHasresultedinexpectedperformanceincreaseExecutionOptimizationThekernelisInstructionLevelParallelism2022/12/30TheIntroductionofMulticoreProcessor85MeansofIncreasingPerformanc85Abriefhistoryofmicro-architectureevolutionTwoaxes:Exploringtheparallelism,muchoftheperformancefromparallelismBit-LevelParallelismInstruction-LevelParallelism(ILP)Thread-LevelParallelism(TLP)Hidingthememorylatency2022/12/30TheIntroductionofMulticoreProcessor864bitdata32bitdataPipeline,in-orderPipeline,out-ordersuperscalarSuper-pipelineVLIW,speculation,predicationSMT64bitdataDualcoremulticoremanycorewhereweareAbriefhistoryofmicro-archi86WhatisPipelining?2022/12/30TheIntroductionofMulticoreProcessor87Inthisexample:Sequentialexecu

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

并行計(jì)算機(jī)體系結(jié)構(gòu)課件

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

并行計(jì)算機(jī)體系結(jié)構(gòu)課件

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔