版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
1、 David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,1,2008 Taiwan CUDA CourseProgramming Massively Parallel Processors:the CUDA experience Lecture 8: Application Case Study - Quantitative MRI Reconstruction, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,2,Ackno
2、wledgements,Sam S. Stone, Haoran Yi, Justin P. Haldar, Wen-mei W. Hwu, Bradley P. Sutton, Zhi-Pei Liang, Keith Thulburin*,Center for Reliable and High-Performance Computing, Beckman Institute for Advanced Science and Technology,Department of Electrical and Computer Engineering University of Illinois
3、 at Urbana-Champaign * University of Illinois, Chicago Medical Center, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,3,Overview,Magnetic resonance imaging Least-squares (LS) reconstruction algorithm Optimizing the LS reconstruction on the G80 Overcoming bottlenecks Performance
4、tuning Summary,1, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,4,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding,FFT,LS,2,Cartesian scan data + FFT: Slow scan, fast reconstruction, images may be poor, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30
5、- July 2, 2008,5,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding1,FFT,LS,Spiral scan data + Gridding + FFT: Fast scan, fast reconstruction, better images,2,1 Based on Fig 1 of Lustig et al, Fast Spiral Fourier Transform for Iterative MR Image Reconstruction, IEEE Intl Symp. on
6、 Biomedical Imaging, 2004, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,6,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding,FFT,Least-Squares (LS),Spiral scan data + LS Superior images at expense of significantly more computation,2, David Kirk/NVIDIA and W
7、en-mei W. Hwu Taiwan, June 30 - July 2, 2008,7,An Exciting Revolution - Sodium Map of the Brain,Images of sodium in the brain Requires powerful scanner (9.4 Tesla) Very large number of samples for increased SNR Requires high-quality reconstruction Enables study of brain-cell viability before anatomi
8、c changes occur in stroke and cancer treatment within days!,Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,8,Least-Squares Reconstruction,Compute Q = FHF,Acquire Data,Compute
9、FHd,Find ,Q depends only on scanner configuration FHd depends on scan data found using linear solver Accelerate Q and FHd on G80 Q: 1-2 days on CPU FHd: 6-7 hours on CPU : 1.5 minutes on CPU,5, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,9,Algorithms to Accelerate,for (m = 0;
10、 m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,Compute Q,FHd is nearly identical Scan data M = # scan points kx, ky, kz = 3D scan data Pixel data N = # pixels x, y, z = input 3D pixel data Q = output pi
11、xel data Complexity is O(MN) Inner loop 10 FP MUL or ADD ops 2 FP trig ops 10 loads,6, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,10,From C to CUDA: Step 1What unit of work is assigned to each thread?,7,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n
12、+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) , David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,11,8,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += p
13、him*sin(exp) ,How does loop interchange help?,for (n = 0; n N; n+) for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?, David Kirk/NVIDIA and Wen-mei
14、 W. Hwu Taiwan, June 30 - July 2, 2008,12,9,for (n = 0; n N; n+) for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,How does loop fission help?,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n
15、+) for (m = 0; m M; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,13,10,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhi
16、m for (n = 0; n N; n+) for (m = 0; m M; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?,phi kernel Each thread computes phi at one scan point (each thread corresponds to one loop iteration),Q ker
17、nel Each thread computes Q at one pixel (each thread corresponds to one outer loop iteration), David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,14,Tiling of Scan Data,LS recon uses multiple grids Each grid operates on all pixels Each grid operates on a distinct subset of scan data
18、 Each thread in the same grid operates on a distinct pixel,for (m = 0; m M/32; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,12,Thread n operates on pixel n:, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,15,LS recon uses multiple grids Ea
19、ch grid operates on all pixels Each grid operates on a distinct subset of scan data Each thread in the same grid operates on a distinct pixel,Tiling of Scan Data,12,for (m = 31M/32; m 32M/32; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,Thread n operates on pi
20、xel n:, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,16,13,Q(float* x,y,z,rQ,iQ,kx,ky,kz,phi, int startM,endM) n = blockIdx.x*TPB + threadIdx.x for (m = startM; m endM; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim * cos(exp) iQn += phim * sin(exp) ,From C to CUDA: Ste
21、p 2Where are the potential bottlenecks?,Bottlenecks Memory BW Trig ops Overheads (branches, addr calcs), David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,17,Step 3: Overcoming bottlenecks,14,LS recon on CPU (SP) Q: 45 hours, 0.5 GFLOPS FHd: 7 hours, 0.7 GFLOPS Counting each trig o
22、p as 1 FLOP, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,18,Step 3: Overcoming Bottlenecks (Mem BW),15,Register allocate pixel data Inputs (x, y, z); Outputs (rQ, iQ) Exploit temporal and spatial locality in access to scan data Constant memory + constant caches Shared memory,
23、 David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,19,Step 3: Overcoming Bottlenecks (Mem BW),Register allocation of pixel data Inputs (x, y, z); Outputs (rQ, iQ) FP arithmetic to off-chip loads: 2 to 1 Performance 5.1 GFLOPS (Q), 5.4 GFLOPS (FHd) Still bottlenecked on memory BW,16
24、, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,20,Step 3: Overcoming Bottlenecks (Mem BW),Old bottleneck: off-chip BW Solution: constant memory FP arithmetic to off-chip loads: 284 to 1 Performance 18.6 GFLOPS (Q), 22.8 GFLOPS (FHd) New bottleneck: trig operations,17, David Ki
25、rk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,21,Sidebar: Estimating Off-Chip Loads with Const Cache,How can we approximate the number of off-chip loads when using the constant caches? Given: 128 tpb, 4 blocks per SM, 256 scan points per grid Assume no evictions due to cache conflicts
26、7 accesses to global memory per thread (x, y, z, rQ x 2, iQ x 2) 4 blocks/SM * 128 threads/block * 7 accesses/thread = 3,584 global mem accesses 4 accesses to constant memory per scan point (kx, ky, kz, phi) 256 scan points * 4 loads/point = 1,024 constant mem accesses Total off-chip memory accesses
27、 = 3,584 + 1,024 = 4,608 Total FP arithmetic ops = 4 blocks/SM * 128 threads/block * 256 iters/thread * 10 ops/iter = 1,310,720 FP arithmetic to off-chip loads: 284 to 1,18, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,22,Step 3: Overcoming Bottlenecks (Trig),Old bottleneck: t
28、rig operations Solution: SFUs Performance 98.2 GFLOPS (Q), 92.2 GFLOPS (FHd) New bottleneck: overhead of branches and address calculations,19, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,23,Sidebar: Effects of Approximations,Avoid temptation to measure only absolute error (I0
29、 I) Can be deceptively large or small Metrics PSNR: Peak signal-to-noise ratio SNR: Signal-to-noise ratio Avoid temptation to consider only the error in the computed value Some apps are resistant to approximations; others are very sensitive,20,A.N. Netravali and B.G. Haskell, Digital Pictures: Repre
30、sentation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995)., David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,24,Step 3: Overcoming Bottlenecks (Overheads),Old bottleneck: Overhead of branches and address calculations Solution: Loop unrolling and experimenta
31、l tuning Performance 179 GFLOPS (Q), 145 GFLOPS (FHd),21, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,25,Experimental Tuning: Tradeoffs,In the Q kernel, three parameters are natural candidates for experimental tuning Loop unrolling factor (1, 2, 4, 8, 16) Number of threads pe
32、r block (32, 64, 128, 256, 512) Number of scan points per grid (32, 64, 128, 256, 512, 1024, 2048) Cant optimize these parameters independently Resource sharing among threads (register file, shared memory) Optimizations that increase a threads performance often increase the threads resource consumpt
33、ion, reducing the total number of threads that execute in parallel Optimization space is not linear Threads are assigned to SMs in large thread blocks Causes discontinuity and non-linearity in the optimization space,22, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,26,Experimen
34、tal Tuning: Example,Increase in per-thread performance, but fewer threads: Lower overall performance,23, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,27,Experimental Tuning: Scan Points Per Grid,24, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,28,Sidebar
35、: Cache-Conscious Data Layout,kx, ky, kz, and phi components of same scan point have spatial and temporal locality Prefetching Caching Old layout does not fully leverage that locality New layout does fully leverage that locality,25, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008
36、,29,Experimental Tuning: Scan Points Per Grid (Improved Data Layout),26, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,30,Experimental Tuning: Loop Unrolling Factor,27, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,31,Sidebar: Optimizing the CPU Implementa
37、tion,Optimizing the CPU implementation of your application is very important Often, the transformations that increase performance on CPU also increase performance on GPU (and vice-versa) The research community wont take your results seriously if your baseline is crippled Useful optimizations Data ti
38、ling SIMD vectorization (SSE) Fast math libraries (AMD, Intel) Classical optimizations (loop unrolling, etc) Intel compiler (icc, icpc),28, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,32,Summary of Results,29,8X, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,33,Summ
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年中國消費行業(yè)趨勢預(yù)測與發(fā)展前景
- 創(chuàng)意經(jīng)濟(jì)下的工藝美術(shù)品設(shè)計與生產(chǎn)
- 插秧苗研學(xué)課程設(shè)計
- 企業(yè)項目路演技巧及匯報制作要點
- 2024年度金融行業(yè)委托代繳員工社保及福利協(xié)議書3篇
- MZ82-生命科學(xué)試劑-MCE
- Lucimycin-FI-1163-生命科學(xué)試劑-MCE
- L-Hydroxyproline-7-amido-4-methylcoumarin-hydrochloride-生命科學(xué)試劑-MCE
- 小班小手生成性課程設(shè)計
- 2024年租賃合同:酒店設(shè)備租賃協(xié)議
- 護(hù)理質(zhì)控輸液查對制度
- 2024三方物流園區(qū)租賃與運營管理合同3篇
- 【MOOC】例解宏觀經(jīng)濟(jì)統(tǒng)計學(xué)-江西財經(jīng)大學(xué) 中國大學(xué)慕課MOOC答案
- 《中國的土地政策》課件
- 【MOOC】電工學(xué)-西北工業(yè)大學(xué) 中國大學(xué)慕課MOOC答案
- 專題12 簡·愛-2024年中考語文復(fù)習(xí)文學(xué)名著必考篇目分層訓(xùn)練(原卷版)
- 【高考語文】2024年全國高考新課標(biāo)I卷-語文試題評講
- 客戶滿意度論文開題報告
- 2024-2025學(xué)年八年級上冊歷史期末復(fù)習(xí)選擇題(解題指導(dǎo)+專項練習(xí))原卷版
- 課桌椅人體工程學(xué)
- 中石油系統(tǒng)員工安全培訓(xùn)
評論
0/150
提交評論