版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
CUDAProgramming
(GPUProgramming)Instructor:ZhangWeizhe(張偉哲)ComputerNetworkandInformationSecurityTechniqueResearchCenter,SchoolofComputerScienceandTechnology,HarbinInstituteofTechnologyMotivation動機GPUArchitectureGPU架構Threewaystoaccelerateapplications三種加速應用的方法CUDAProgrammingModelCUDA編程模型CUDAProgrammingBasicsCUDA編程基礎Outline3ASimpleExampleThreeIntegerArrays
A[N]B[N]C[N]Wewanttocalculate
C[i]=A[i]+B[i]
4Traditionally,onthecpu(serial) for(i=0;i<N;++i)
C[i]=A[i]+B[i] T(N)=O(N)5Traditionally,onthecpu(parallel)createNthreads
C[threadid.i]=A[threadid.i]+B[threadid.i]
T(N)=O(1)6GPUComputingButthereisaproblem.ApplicationslikeNeedthousandsofthreadstoexecuteMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline8AsimplecomparisonbetweenCPUandGPU
9AdetaileddescriptionGraphicsProcessingClusters(GPCs)TextureProcessingClusters(TPCs)StreamingMultiprocessors(SM)10pascal-architecture-whitepaperMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutline12ThreemethodsCUDAOptimizedLibrariesProgrammingLanguages13ThreemethodsCUDAOptimizedLibrariesTheselibrariesarewritteninCUDASimplyreplaceyourstandardlibraryfunctionswithcorrespondingCUDAlibrariesItsupportsmanymathlibrariesbutnotallasupportedlistcanbefoundat/gpu-accelerated-libraries
14ThreemethodsItisadirective-basedprogrammingmodelYouneedtoisnertsomedirectivesinyourcodeUseopenacccompilertocompilethecode15ThreemethodsProgrammingLanguagesMotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline18PrerequestforCUDAProgrammingHardwareANvidiaGraphicscard:Itcanbeaspecializedcomputingcard,likeTeslaPascalGP100(tooexpensive),oranormalgamegraphiccard,likeGTorGTX.CheckwheteryourGPUsupportsCUDA:youcancheckoutthiswebsite/cuda-gpus
Clickon19PrerequestforCUDAProgrammingSoftwareCUDAToolkit:It’ssupportedonWindows,Mac,andmoststandardLinuxdistributions.Downloadfromhttps:///cuda-toolkitVisualStudio(ifonwindows):IfyouworkonWindows,forIknow,VSistheonlyIDEthatcanworkwithCUDA.Ifyoudon’twanttoinstallVS,youcanusetheCUDAcompilerNVCCdirectlyfromacommandline.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline21CUDAExecutionFlowCUDAApplicationHost=CPUDevice=GPUHost=CPUDevice=GPUParallelcodeSerialcodeSerialcodeParallelcode22CUDAExecutionFlow1.CopydatafromCPUmemorytoGPUmemory23CUDAExecutionFlowInstructtheGPUtostartcomputing24CUDAExecutionFlowCopytheresultsbacktoCPUmemoryCUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutline26CUDAThreadsParallelportionofanapplicatonfloatx=in[threadIdx.x];floaty=func(x);out[threadIdx.x]=y;
in[i]in[i+1]in[i+2]in[i+3]out[i]out[i+1]out[i+2]out[i+3]AkernelisafunctionexecutedontheGPUasanarrayofthreadsinparallelandcanbecalledfromCPUAllthreadsexecutethesamecode,cantakedifferentpathsEachthreadhasanID27CUDAThreads.....…..WarpWarpBlock28CUDAThreads.....Warp32ThreadsaregroupedintowarpsAwarpinCUDAistheminimumsizeofthedataprocessedinSIMDfashionbyaCUDAmultiprocessor.ThreadIDswithinawarpareconsecutiveandincreasingWarpisunitofthreadschedulinginSMs29CUDAThreads.....WarpOneorMorewarpsaregroupedintoblocksAthreadblockisbatchofthreadsthatcancooperatewitheachotherbysharingdatathroughsharedmemoryandsynchronizingtheirexecution.Ablockcanatmostcontain1024threadsbecasuseofthehardwaresourcelimitThethreadidisuniqueandstartsfromzeroinablockWarp30CUDAThreadsBlockBlockBlockBlockBlockBlockGridAkernelwillbeexecutedasagrid31CUDAThreadsKernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith2SMsSM0SM1Block0Block1Block2Block3Block4Block5Block6Block7CUDAThreads32KernelGridBlock0Block1Block2Block3Block4Block5Block6Block7Devicewith4SMsSM0SM1SM2SM4Block0Block4Block1Block5Block2Block6Block3Block7CUDAThreads33CUDAThreads34Block0CUDAThreads35Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblem:branchdivergenceExamplewithdivergence:If(threadIdx.x>2){…}ThiscreatestwodifferentcontrolpathsforthreadsinablockToavoidthis:If(threadIdx.x/WARP_SIZE>2){…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Sodon’tusethiskindofcode,letthewholewarpdothesamework.CUDAThreads36Allthreadswithinawarpmustexecutethesameinstructionatanygiventime,butthiswillyieldsaproblembranchdivergenceExamplewithdivergence:
If(threadIdx.x%2==0){…} else{…}ThiscreatestwodifferentcontrolpathsforthreadsinablockCUDAThreads37
If(threadIdx.x%2==0){…}else{…}CUDAThreads38If(threadIdx.x/WARP_SIZE==0){…}Else{…}Alsocreatestwodifferentcontrolpathsforthreadsinablock,butthegranularityisawholemultipleofwarpsize;allthreadsinanygivenwarpfollowthesamepath.Letthewholewarpdothesamework.CUDAProgrammingModelPrerequestforCUDAProgrammingCUDAExecutionFlowCUDAThreadsCUDAMemoryModelOutlineGlobalMemory&SyntaxGlobalmemoryisthe“main”memoryoftheGPU.Ithasglobalscopeandlifetimeoftheallocatingprogram(oruntilcudaFreeiscalled)GlobalmemoryissimilartotheheapinaCprogram.GlobalmemorysyntaxAllocatewithcudaMalloc(void**devPtr,size_tsize)FreewithcudaMalloc(void*devPtr)40intblk_sz=64;float*Md;intsize=blk_sz*blk_sz*sizeof(float);cudaMalloc((void**)&Md,size);…cudaFree(Md);Host-DeviceDataTransfercudaMemcpy()MemorydatatransferRequiresfourparametersPointertodestinationPointertosourceNumberofbytescopiedType/DirectionoftransferHosttoHost,HosttoDevice,DevicetoHost,DevicetoDeviceTransfertodeviceisasynchronous41cudaMemcpy(Md,M.elements,size,cudaMemcpyHostToDevice);cudaMemcpy(M.elements,Md,size,cudaMemcpyDeviceToHost);CPUMemoryGPUGPUMemoryCPUPCI-E8GB/sGDDR5190GB/sConstantMemory&SyntaxConstantmemoryisaformofvirtualaddressingofglobalmemory.SpecialpropertiesCached&read-onlySupportsbroadcastingasinglevaluetoalltheelementswithinawarpConstantmemoryisrestrictedto64KB(kernelargumentsarepassedthroughconstantmemory)ConstantmemorysyntaxInglobalscope(outsideofkernel,attoplevelofprogram)__constant__int
foo[2014];InhostcodecudaMemcpyToSymbol(foo,h_src,sizeof(int)*1024);42TextureMemoryComplicatedandonlymarginallyusefulforgeneralpurposecomputationUsefulcharacteristics2Dor3Ddatalocalityforcachingpurposesthrough“CUDAarrays”.GoesintospecialtexturecacheFastinterpolationon1D,2D,or3DarrayConvertingintegersto“unitized”floatingpointnumbers43It’sacomplextopic,youcanlearneverythingyouwanttoknowaboutitfromCUDAHandbookSharedMemory&SyntaxSharedmemoryisusedtoexchangedatabetweenCUDAthreadswithinablock.VeryfastmemorylocatedintheSMOn-chipmemory,low-latency,user-controlledL1cacheSharedmemorysyntaxStaticallocation__shared__floatdata[1024];//declarationinkernel,nothinginhostcodeDynamicallocationHost:
kernel<<<grid,block,numBytesShMem>>>(arg);Device(inkernel):
extern__shared__float
s[];44RememberSM=StreamingmultiprocessorSM≠SharedmemoryComputationalIntensityComputationalintensity=FLOPS/IOMatrixmultiplication:n3/n2=nN-bodysimulation:n2/n=n45Ifcomputationalintensityis>1,thensamedatausedinmorethan1computation.Doasfewgloballoadsandasmanysharedloadsaspossible.Registers&LocalMemoryRegistersFastest“memory”possible,about10xfasterthansharedmemoryMoststackvariablesdeclaredinkernelsarestoredinregisters(example:floatx)StaticallyindexedarraysstoredonthestackaresometimesputinregistersLocalMemoryLocalmemoryiseverythingonthestackthatcan’tfitinregistersThescopeoflocalmemoryisjustthethread.Localmemoryisstoredinglobalmemory(muchslowerthanregisters!)46Non-Programmable!CUDAMemoryModelSummaryMemorySpaceManagedbyPhysicalImplementationScopeonGPUScopeonCPULifetimeRegistersCompilerOn-chipPerThreadNotvisibleLifetimeofathreadLocalCompilerDeviceMemoryPerThreadNotvisibleSharedProgrammerOn-chipBlockNotvisibleBlocklifetimeGlobalProgrammerDeviceMemoryAllThreadsRead/WriteApplicationoruntilexplicitlyfreedConstantProgrammerDeviceMemoryAllThreadsRead-onlyRead/WriteTextureProgrammerDeviceMemoryAllThreadsRead-onlyRead/Write47MotivationGPUArchitectureThreewaystoaccelerateapplicationsCUDAProgrammingModelCUDAProgrammingBasicsOutlineCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline50ParallelProgramminginCUDACThefirstexample:
SummingVectorsImaginehavingtwolistsofnumberswherewewanttosumcorrespondingelementsofeachlistandstoretheresultinathirdlist.51ParallelProgramminginCUDACCPUcore1CPUcore252ParallelProgramminginCUDAC53ParallelProgramminginCUDAC__host____device__cudaError_tcudaMalloc(void**devPtr,size_tsize)__host__and__device__istypequafiler,whichmeasthisfunctioncanbecalledonthedeviceorthehost.AllcudafunctionstakesaerrorcodeasareturnvalueDifferentfromC’sMallocfunction,thisfunctionstakesapointertopointerasaparameter.54ParallelProgramminginCUDACExecutionconfiguration
<<<Dg,Db,Ns,S>>>Dgisoftypedim3andspecifiesthedimensionandsizeofthegrid,suchthatDg.x*Dg.y*Dg.zequalsthenumberofblocksbeinglaunched;Dbisoftypedim3andspecifiesthedimensionandsizeofeachblock,suchthatDb.x*Db.y*Db.zequalsthenumberofthreadsperblock;Nsistypeofsize_tandspecifiesthenumberofbytesinsharedmemorythatisdynamicallyallocatedperblockforthiscallinadditiontothestaticallyallocatedmemory;It’sanoptionalargumentwhichdefaultto0;SisoftypecudaStream_tandspecifiestheassociatedstream;It’sanoptionalargumentwhichdefaultsto055ParallelProgramminginCUDAC56ParallelProgramminginCUDAC__host__cudaError_tcudaMemcpy(void*dst,constvoid*src,size_t count,cudaMemcpyKindkind)CopiesdatabetweenhostanddevicecudaMemcpyKindspecifiesthedirectionofthecopy.ItisoneofcudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost,cudaMemcpyDeviceToDevice57ParallelProgramminginCUDAC__global__functionsmushhavevoidreturntype.Anycalltoa__global__functionmustspecifyitsexecutionconfiguration.Acalltoa__global__functionisasynchronous,meaningitreturnsbeforethedevicehascompleteditsexecution.58ParallelProgramminginCUDACblockIdx.xcontainstheblockindexwithinthegridthreadIdx.xcontainsthethreadindexwithintheblockCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline60SharedMemoryandSynchronizationThreadswithinablockcancommunicatewitheachotherthroughsharedmemory.__shared__isusedtomakethevariableresidentinsharedmemory.Iftwothreadswanttowritethesamesharedvariable,theremustbeasynchronizationbetweentwothreads.EitherAwritesafterB,orBwritesafterA.Now,let’stakealookatanexamplethatusesthesefeatures61SharedMemoryandSynchronizationDOTPRODUCTEachthreadmultipliesapairofcorrespondingentries,andtheneverythreadmovesontoitsnextpair.Becausetheresultneedstobethesumofallthesepairwiseproducts,eachthreadkeepsarunningsumofthepairsithasadded.62SharedMemoryandSynchronizationSupposewehaveNthreads,andthearraysizeisN*M63SharedMemoryandSynchronization64SharedMemoryandSynchronization65SharedMemoryandSynchronization66SharedMemoryandSynchronization67SharedMemoryandSynchronizationCUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline69ConstantMemoryandEventsConstantmemoryusedfordatathatwillnotchangeoverthecourseofakernelexecutionNVIDIAhardwareprovides64KBofconstantmemorythatittreatsdifferentlythanittreatsstandardglobalmemory.70ConstantMemoryandEventsProblemDescription:Produceanimageofathree-dimensionalseene.Asimpleidea:Imaginethelightsgothroughtheobjects,andcastashadowontheplate.Sincelightscancomefromanyplaceatanypointinourscene,itturnsourit’seasiertoworkbackward.Eachpixelintheimagewillshootarayintothescene.Wefigureoutwhatcolorisseenbyeachpixelbytracingarayfromthepixelinquestionthroughthesceneuntilithitsoneofourobjects.Wethensaythatthepixelwould“see”thisobjectandcanassignitscolorbasedonthecoloroftheobjectitsees.Mostofthecomputationrequiredbyraytracingisinhecomputationoftheseintersectionsoftheraywiththeobjextsinthescene.71ConstantMemoryandEvents72ConstantMemoryandEventsWhatwilltheraytracerdoItwillfirearayfromeachpixelComputewhichrayshitwhichspheres,andthedepthofeachofthesehits.Inthecasewherearaypassesthroughmultiplespheres,onlythespherefurthesttotheimagecanbeseen.Wewillmodeloursphereswithadatastructurethatstoresthesphere’scentercoordinateof(x,y,z),itsradius,anditscolorof(r,b,g).73ConstantMemoryandEvents74ConstantMemoryandEvents75ConstantMemoryandEvents76ConstantMemoryandEventsConstantmemoryalwaysdeclaredinFilescope(globalvariable)CUDAProgrammingBasicParallelProgramminginCUDACSharedMemoryandSynchronizationConstantMemoryandEventsTextureMemoryOutline78TextureMemoryLikeconstantmemory,texturememoryisAnothervarietyofread-onlymemoryCachedonchip,soitwillprovidhighereffectivebandwidthDesignedforgraphicsapplicationswherememoryaccesspatternsexhibitagreatdealofspatialocality79TextureMemoryspatialocalityInacomputingapplication,thisroughlyimpliesthatathreadislikelytoreadfromanaddress“near”theaddressthatnearbythreadsread,asshowninthefigure.80TextureMemoryForCPUcachingscheme-------->thefouraddressesshownarenotconsecutiveandwouldnotbecached
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年度房地產企業(yè)財務報表編制與披露咨詢合同3篇
- 2024數控機床買賣附帶產品生命周期管理服務合同3篇
- 《xx網絡提案》課件
- 廣東省2024-2025學年高一上學期期末考試英語試卷(含解析)
- 2024影視制作合同拍攝計劃與投資預算
- 2024幼兒園教育集團股權轉讓及教育資源整合服務合同3篇
- 脊柱理療知識培訓課件
- 2024年標準工程造價協(xié)議范本版B版
- 鋰電池基本知識培訓課件
- 鄭州軌道工程職業(yè)學院《教育學基礎B》2023-2024學年第一學期期末試卷
- 公司費用預算表格模板(詳細版)
- 詩詞若干首唐宋明朝詩人詠四川
- 華為經營管理-華為市場營銷體系(6版)
- 2023年中國育齡女性生殖健康研究報告
- 作物生產與經營管理專業(yè)調研報告
- 鋼結構加工廠考察報告
- 財務崗總結 財務工作者的個人總結
- 作文講評原來我也擁有這么多
- 2023年副主任醫(yī)師(副高)-普通外科學(副高)考試高頻試題(歷年真題)帶答案
- 發(fā)電機檢修作業(yè)指導書
- 新華人壽保險管理信息系統(tǒng)案例分析
評論
0/150
提交評論