并行程序設(shè)計(jì) 中文課件 07 CUDA ProgrammingVIP

上傳人：o*** IP屬地：未知上傳時(shí)間：2024-12-31 格式：PPT 頁數(shù)：87 大?。?.87MB 積分：6 舉報(bào) 版權(quán)申訴

并行程序設(shè)計(jì) 中文課件 07 CUDA Programming_第1頁

并行程序設(shè)計(jì) 中文課件 07 CUDA Programming_第2頁

并行程序設(shè)計(jì) 中文課件 07 CUDA Programming_第3頁

并行程序設(shè)計(jì) 中文課件 07 CUDA Programming_第4頁

并行程序設(shè)計(jì) 中文課件 07 CUDA Programming_第5頁

已閱讀5頁，還剩82頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

CUDAProgramming

(GPUProgramming)Instructor:ZhangWeizhe(張偉哲)ComputerNetworkandInformationSecurityTechniqueResearchCenter,SchoolofComputerScienceandTechnology,HarbinInstituteofTechnologyMotivation動(dòng)機(jī)GPUArchitectureGPU架構(gòu)Threewaystoaccelerateapplications三種加速應(yīng)用的方法CUDAProgrammingModelCUDA編程模型CUDAProgrammingBasicsCUDA編程基礎(chǔ)Outline3ASimpleExampleThreeIntegerArrays

A[N]B[N]C[N]Wewanttocalculate

C[i]=A[i]+B[i]

4Traditionally,onthecpu(serial) for(i=0;i<N;++i)

C[i]=A[i]+B[i] T(N)=O(N)5Traditionally,onthecpu(parallel)createNthreads

C[threadid.i]=A[threadid.i]+B[threadid.i]

T(N)=O(1)6GPUComputingButthereisaproblem.ApplicationslikeNeedthousandsofthreadstoexecuteMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline8AsimplecomparisonbetweenCPUandGPU

9AdetaileddescriptionGraphicsProcessingClusters(GPCs)TextureProcessingClusters(TPCs)StreamingMultiprocessors(SM)10pascal-architecture-whitepaperMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline12ThreemethodsCUDAOptimizedLibrariesProgrammingLanguages13ThreemethodsCUDAOptimizedLibrariesTheselibrariesarewritteninCUDASimplyreplaceyourstandardlibraryfunctionswithcorrespondingCUDAlibrariesItsupportsmanymathlibrariesbutnotallasupportedlistcanbefoundat/gpu-accelerated-libraries

14ThreemethodsItisadirective-basedprogrammingmodelYouneedtoisnertsomedirectivesinyourcodeUseopenacccompilertocompilethecode15ThreemethodsProgrammingLanguagesMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline18PrerequestforCUDAProgrammingHardwareANvidiaGraphicscard:Itcanbeaspecializedcomputingcard,likeTeslaPascalGP100(tooexpensive),oranormalgamegraphiccard,likeGTorGTX.CheckwheteryourGPUsupportsCUDA:youcancheckoutthiswebsite/cuda-gpus

Clickon19PrerequestforCUDAProgrammingSoftwareCUDAToolkit:It’ssupportedonWindows,Mac,andmoststandardLinuxdistributions.Downloadfromhttps:///cuda-toolkitVisualStudio(ifonwindows):IfyouworkonWindows,forIknow,VSistheonlyIDEthatcanworkwithCUDA.Ifyoudon’twanttoinstallVS,youcanusetheCUDAcompilerNVCCdirectlyfromacommandline.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline21CUDAExecutionFlowCUDAApplicationHost=CPUDevice=GPUHost=CPUDevice=GPUParallelcodeSerialcodeSerialcodeParallelcode22CUDAExecutionFlow1.CopydatafromCPUmemorytoGPUmemory23CUDAExecutionFlowInstructtheGPUtostartcomputing24CUDAExecutionFlowCopytheresultsbacktoCPUmemoryCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline26CUDAThreadsParallelportionofanapplicatonfloatx=in[threadIdx.x];floaty=func(x);out[threadIdx.x]=y;

in[i]in[i+1]in[i+2]in[i+3]out[i]out[i+1]out[i+2]out[i+3]AkernelisafunctionexecutedontheGPUasanarrayofthreadsinparallelandcanbecalledfromCPUAllthreadsexecutethesamecode,cantakedifferentpathsEachthreadhasanID27CUDAThreads.....…..WarpWarpBlock28CUDAThreads.....Warp32ThreadsaregroupedintowarpsAwarpinCUDAistheminimumsizeofthedataprocessedinSIMDfashionbyaCUDAmultiprocessor.ThreadIDswithinawarpareconsecutiveandincreasingWarpisunitofthreadschedulinginSMs29CUDAThreads.....WarpOneorMorewarpsaregroupedintoblocksAthreadblockisbatchofthreadsthatcancooperatewitheachotherbysharingdatathroughsharedmemoryandsynchronizingtheirexecution.Ablockcanatmostcontain1024threadsbecasuseofthehardwaresourcelimitThethreadidisuniqueandstartsfromzeroinablockWarp30CUDAThreadsBlockBlockBlockBlockBlockBlockGridAkernelwillbeexecutedasagrid31CUDAThreadsKernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith2SMsSM0SM1Block0Block1Block2Block3Block4Block5Block6Block7CUDAThreads32KernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith4SMsSM0SM1SM2SM4Block0Block4Block1Block5Block2Block6Block3Block7CUDAThreads33CUDAThreads34Block0CUDAThreads35Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblem:branchdivergenceExamplewithdivergence:If(threadIdx.x>2){…}ThiscreatestwodifferentcontrolpathsforthreadsinablockToavoidthis:If(threadIdx.x/WARP_SIZE>2){…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Sodon’tusethiskindofcode,letthewholewarpdothesamework.CUDAThreads36Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblembranchdivergenceExamplewithdivergence:

If(threadIdx.x%2==0){…} else{…}ThiscreatestwodifferentcontrolpathsforthreadsinablockCUDAThreads37

If(threadIdx.x%2==0){…}else{…}CUDAThreads38If(threadIdx.x/WARP_SIZE==0){…}Else{…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Letthewholewarpdothesamework.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutlineGlobalMemory&SyntaxGlobalmemoryisthe“main”memoryoftheGPU.Ithasglobalscopeandlifetimeoftheallocatingprogram(oruntilcudaFreeiscalled)GlobalmemoryissimilartotheheapinaCprogram.GlobalmemorysyntaxAllocatewithcudaMalloc(void**devPtr,size_tsize)FreewithcudaMalloc(void*devPtr)40intblk_sz=64;float*Md;intsize=blk_sz*blk_sz*sizeof(float);cudaMalloc((void**)&Md,size);…cudaFree(Md);Host-DeviceDataTransfercudaMemcpy()MemorydatatransferRequiresfourparametersPointertodestinationPointertosourceNumberofbytescopiedType/DirectionoftransferHosttoHost,HosttoDevice,DevicetoHost,DevicetoDeviceTransfertodeviceisasynchronous41cudaMemcpy(Md,M.elements,size,cudaMemcpyHostToDevice);cudaMemcpy(M.elements,Md,size,cudaMemcpyDeviceToHost);CPUMemoryGPUGPUMemoryCPUPCI-E8GB/sGDDR5190GB/sConstantMemory&SyntaxConstantmemoryisaformofvirtualaddressingofglobalmemory.SpecialpropertiesCached&read-onlySupportsbroadcastingasinglevaluetoalltheelementswithinawarpConstantmemoryisrestrictedto64KB(kernelargumentsarepassedthroughconstantmemory)ConstantmemorysyntaxInglobalscope(outsideofkernel,attoplevelofprogram)__constant__int

foo[2014];InhostcodecudaMemcpyToSymbol(foo,h_src,sizeof(int)*1024);42TextureMemoryComplicatedandonlymarginallyusefulforgeneralpurposecomputationUsefulcharacteristics2Dor3Ddatalocalityforcachingpurposesthrough“CUDAarrays”.GoesintospecialtexturecacheFastinterpolationon1D,2D,or3DarrayConvertingintegersto“unitized”floatingpointnumbers43It’sacomplextopic,youcanlearneverythingyouwanttoknowaboutitfromCUDAHandbookSharedMemory&SyntaxSharedmemoryisusedtoexchangedatabetweenCUDAthreadswithinablock.VeryfastmemorylocatedintheSMOn-chipmemory,low-latency，user-controlledL1cacheSharedmemorysyntaxStaticallocation__shared__floatdata[1024];//declarationinkernel,nothinginhostcodeDynamicallocationHost:

kernel<<<grid,block,numBytesShMem>>>(arg);Device(inkernel):

extern__shared__float

s[];44RememberSM=StreamingmultiprocessorSM≠SharedmemoryComputationalIntensityComputationalintensity=FLOPS/IOMatrixmultiplication:n3/n2=nN-bodysimulation:n2/n=n45Ifcomputationalintensityis>1,thensamedatausedinmorethan1computation.Doasfewgloballoadsandasmanysharedloadsaspossible.Registers&LocalMemoryRegistersFastest“memory”possible,about10xfasterthansharedmemoryMoststackvariablesdeclaredinkernelsarestoredinregisters(example:floatx)StaticallyindexedarraysstoredonthestackaresometimesputinregistersLocalMemoryLocalmemoryiseverythingonthestackthatcan’tfitinregistersThescopeoflocalmemoryisjustthethread.Localmemoryisstoredinglobalmemory(muchslowerthanregisters!)46Non-Programmable！CUDAMemoryModelSummaryMemorySpaceManagedbyPhysicalImplementationScopeonGPUScopeonCPULifetimeRegistersCompilerOn-chipPerThreadNotvisibleLifetimeofathreadLocalCompilerDeviceMemoryPerThreadNotvisibleSharedProgrammerOn-chipBlockNotvisibleBlocklifetimeGlobalProgrammerDeviceMemoryAllThreadsRead/WriteApplicationoruntilexplicitlyfreedConstantProgrammerDeviceMemoryAllThreadsRead-onlyRead/WriteTextureProgrammerDeviceMemoryAllThreadsRead-onlyRead/Write47MotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline50ParallelProgramminginCUDACThefirstexample:

SummingVectorsImaginehavingtwolistsofnumberswherewewanttosumcorrespondingelementsofeachlistandstoretheresultinathirdlist.51ParallelProgramminginCUDACCPUcore1CPUcore252ParallelProgramminginCUDAC53ParallelProgramminginCUDAC__host____device__cudaError_tcudaMalloc(void**devPtr,size_tsize)__host__and__device__istypequafiler,whichmeasthisfunctioncanbecalledonthedeviceorthehost.AllcudafunctionstakesaerrorcodeasareturnvalueDifferentfromC’sMallocfunction,thisfunctionstakesapointertopointerasaparameter.54ParallelProgramminginCUDACExecutionconfiguration

<<<Dg,Db,Ns,S>>>Dgisoftypedim3andspecifiesthedimensionandsizeofthegrid,suchthatDg.x*Dg.y*Dg.zequalsthenumberofblocksbeinglaunched;Dbisoftypedim3andspecifiesthedimensionandsizeofeachblock,suchthatDb.x*Db.y*Db.zequalsthenumberofthreadsperblock;Nsistypeofsize_tandspecifiesthenumberofbytesinsharedmemorythatisdynamicallyallocatedperblockforthiscallinadditiontothestaticallyallocatedmemory;It’sanoptionalargumentwhichdefaultto0;SisoftypecudaStream_tandspecifiestheassociatedstream;It’sanoptionalargumentwhichdefaultsto055ParallelProgramminginCUDAC56ParallelProgramminginCUDAC__host__cudaError_tcudaMemcpy(void*dst,constvoid*src,size_t count,cudaMemcpyKindkind)CopiesdatabetweenhostanddevicecudaMemcpyKindspecifiesthedirectionofthecopy.ItisoneofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice57ParallelProgramminginCUDAC__global__functionsmushhavevoidreturntype.Anycalltoa__global__functionmustspecifyitsexecutionconfiguration.Acalltoa__global__functionisasynchronous,meaningitreturnsbeforethedevicehascompleteditsexecution.58ParallelProgramminginCUDACblockIdx.xcontainstheblockindexwithinthegridthreadIdx.xcontainsthethreadindexwithintheblockCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline60SharedMemoryandSynchronizationThreadswithinablockcancommunicatewitheachotherthroughsharedmemory.__shared__isusedtomakethevariableresidentinsharedmemory.Iftwothreadswanttowritethesamesharedvariable,theremustbeasynchronizationbetweentwothreads.EitherAwritesafterB,orBwritesafterA.Now,let’stakealookatanexamplethatusesthesefeatures61SharedMemoryandSynchronizationDOTPRODUCTEachthreadmultipliesapairofcorrespondingentries,andtheneverythreadmovesontoitsnextpair.Becausetheresultneedstobethesumofallthesepairwiseproducts,eachthreadkeepsarunningsumofthepairsithasadded.62SharedMemoryandSynchronizationSupposewehaveNthreads,andthearraysizeisN*M63SharedMemoryandSynchronization64SharedMemoryandSynchronization65SharedMemoryandSynchronization66SharedMemoryandSynchronization67SharedMemoryandSynchronizationCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline69ConstantMemoryandEventsConstantmemoryusedfordatathatwillnotchangeoverthecourseofakernelexecutionNVIDIAhardwareprovides64KBofconstantmemorythatittreatsdifferentlythanittreatsstandardglobalmemory.70ConstantMemoryandEventsProblemDescription:Produceanimageofathree-dimensionalseene.Asimpleidea:Imaginethelightsgothroughtheobjects,andcastashadowontheplate.Sincelightscancomefromanyplaceatanypointinourscene,itturnsourit’seasiertoworkbackward.Eachpixelintheimagewillshootarayintothescene.Wefigureoutwhatcolorisseenbyeachpixelbytracingarayfromthepixelinquestionthroughthesceneuntilithitsoneofourobjects.Wethensaythatthepixelwould“see”thisobjectandcanassignitscolorbasedonthecoloroftheobjectitsees.Mostofthecomputationrequiredbyraytracingisinhecomputationoftheseintersectionsoftheraywiththeobjextsinthescene.71ConstantMemoryandEvents72ConstantMemoryandEventsWhatwilltheraytracerdoItwillfirearayfromeachpixelComputewhichrayshitwhichspheres,andthedepthofeachofthesehits.Inthecasewherearaypassesthroughmultiplespheres,onlythespherefurthesttotheimagecanbeseen.Wewillmodeloursphereswithadatastructurethatstoresthesphere’scentercoordinateof(x,y,z),itsradius,anditscolorof(r,b,g).73ConstantMemoryandEvents74ConstantMemoryandEvents75ConstantMemoryandEvents76ConstantMemoryandEventsConstantmemoryalwaysdeclaredinFilescope(globalvariable)CUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline78TextureMemoryLikeconstantmemory,texturememoryisAnothervarietyofread-onlymemoryCachedonchip,soitwillprovidhighereffectivebandwidthDesignedforgraphicsapplicationswherememoryaccesspatternsexhibitagreatdealofspatialocality79TextureMemoryspatialocalityInacomputingapplication,thisroughlyimpliesthatathreadislikelytoreadfromanaddress“near”theaddressthatnearbythreadsread,asshowninthefigure.80TextureMemoryForCPUcachingscheme-------->thefouraddressesshownarenotconsecutiveandwouldnotbecached

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

并行程序設(shè)計(jì) 中文課件 07 CUDA ProgrammingVIP

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

并行程序設(shè)計(jì) 中文課件 07 CUDA ProgrammingVIP

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔