胡文美教授cuda中文講座_lecture8_case_study_2008.ppt_第1頁
胡文美教授cuda中文講座_lecture8_case_study_2008.ppt_第2頁
胡文美教授cuda中文講座_lecture8_case_study_2008.ppt_第3頁
胡文美教授cuda中文講座_lecture8_case_study_2008.ppt_第4頁
胡文美教授cuda中文講座_lecture8_case_study_2008.ppt_第5頁
已閱讀5頁,還剩31頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、 David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,1,2008 Taiwan CUDA CourseProgramming Massively Parallel Processors:the CUDA experience Lecture 8: Application Case Study - Quantitative MRI Reconstruction, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,2,Ackno

2、wledgements,Sam S. Stone, Haoran Yi, Justin P. Haldar, Wen-mei W. Hwu, Bradley P. Sutton, Zhi-Pei Liang, Keith Thulburin*,Center for Reliable and High-Performance Computing, Beckman Institute for Advanced Science and Technology,Department of Electrical and Computer Engineering University of Illinois

3、 at Urbana-Champaign * University of Illinois, Chicago Medical Center, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,3,Overview,Magnetic resonance imaging Least-squares (LS) reconstruction algorithm Optimizing the LS reconstruction on the G80 Overcoming bottlenecks Performance

4、tuning Summary,1, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,4,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding,FFT,LS,2,Cartesian scan data + FFT: Slow scan, fast reconstruction, images may be poor, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30

5、- July 2, 2008,5,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding1,FFT,LS,Spiral scan data + Gridding + FFT: Fast scan, fast reconstruction, better images,2,1 Based on Fig 1 of Lustig et al, Fast Spiral Fourier Transform for Iterative MR Image Reconstruction, IEEE Intl Symp. on

6、 Biomedical Imaging, 2004, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,6,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding,FFT,Least-Squares (LS),Spiral scan data + LS Superior images at expense of significantly more computation,2, David Kirk/NVIDIA and W

7、en-mei W. Hwu Taiwan, June 30 - July 2, 2008,7,An Exciting Revolution - Sodium Map of the Brain,Images of sodium in the brain Requires powerful scanner (9.4 Tesla) Very large number of samples for increased SNR Requires high-quality reconstruction Enables study of brain-cell viability before anatomi

8、c changes occur in stroke and cancer treatment within days!,Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,8,Least-Squares Reconstruction,Compute Q = FHF,Acquire Data,Compute

9、FHd,Find ,Q depends only on scanner configuration FHd depends on scan data found using linear solver Accelerate Q and FHd on G80 Q: 1-2 days on CPU FHd: 6-7 hours on CPU : 1.5 minutes on CPU,5, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,9,Algorithms to Accelerate,for (m = 0;

10、 m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,Compute Q,FHd is nearly identical Scan data M = # scan points kx, ky, kz = 3D scan data Pixel data N = # pixels x, y, z = input 3D pixel data Q = output pi

11、xel data Complexity is O(MN) Inner loop 10 FP MUL or ADD ops 2 FP trig ops 10 loads,6, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,10,From C to CUDA: Step 1What unit of work is assigned to each thread?,7,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n

12、+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) , David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,11,8,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += p

13、him*sin(exp) ,How does loop interchange help?,for (n = 0; n N; n+) for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?, David Kirk/NVIDIA and Wen-mei

14、 W. Hwu Taiwan, June 30 - July 2, 2008,12,9,for (n = 0; n N; n+) for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,How does loop fission help?,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n

15、+) for (m = 0; m M; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,13,10,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhi

16、m for (n = 0; n N; n+) for (m = 0; m M; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?,phi kernel Each thread computes phi at one scan point (each thread corresponds to one loop iteration),Q ker

17、nel Each thread computes Q at one pixel (each thread corresponds to one outer loop iteration), David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,14,Tiling of Scan Data,LS recon uses multiple grids Each grid operates on all pixels Each grid operates on a distinct subset of scan data

18、 Each thread in the same grid operates on a distinct pixel,for (m = 0; m M/32; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,12,Thread n operates on pixel n:, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,15,LS recon uses multiple grids Ea

19、ch grid operates on all pixels Each grid operates on a distinct subset of scan data Each thread in the same grid operates on a distinct pixel,Tiling of Scan Data,12,for (m = 31M/32; m 32M/32; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,Thread n operates on pi

20、xel n:, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,16,13,Q(float* x,y,z,rQ,iQ,kx,ky,kz,phi, int startM,endM) n = blockIdx.x*TPB + threadIdx.x for (m = startM; m endM; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim * cos(exp) iQn += phim * sin(exp) ,From C to CUDA: Ste

21、p 2Where are the potential bottlenecks?,Bottlenecks Memory BW Trig ops Overheads (branches, addr calcs), David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,17,Step 3: Overcoming bottlenecks,14,LS recon on CPU (SP) Q: 45 hours, 0.5 GFLOPS FHd: 7 hours, 0.7 GFLOPS Counting each trig o

22、p as 1 FLOP, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,18,Step 3: Overcoming Bottlenecks (Mem BW),15,Register allocate pixel data Inputs (x, y, z); Outputs (rQ, iQ) Exploit temporal and spatial locality in access to scan data Constant memory + constant caches Shared memory,

23、 David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,19,Step 3: Overcoming Bottlenecks (Mem BW),Register allocation of pixel data Inputs (x, y, z); Outputs (rQ, iQ) FP arithmetic to off-chip loads: 2 to 1 Performance 5.1 GFLOPS (Q), 5.4 GFLOPS (FHd) Still bottlenecked on memory BW,16

24、, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,20,Step 3: Overcoming Bottlenecks (Mem BW),Old bottleneck: off-chip BW Solution: constant memory FP arithmetic to off-chip loads: 284 to 1 Performance 18.6 GFLOPS (Q), 22.8 GFLOPS (FHd) New bottleneck: trig operations,17, David Ki

25、rk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,21,Sidebar: Estimating Off-Chip Loads with Const Cache,How can we approximate the number of off-chip loads when using the constant caches? Given: 128 tpb, 4 blocks per SM, 256 scan points per grid Assume no evictions due to cache conflicts

26、7 accesses to global memory per thread (x, y, z, rQ x 2, iQ x 2) 4 blocks/SM * 128 threads/block * 7 accesses/thread = 3,584 global mem accesses 4 accesses to constant memory per scan point (kx, ky, kz, phi) 256 scan points * 4 loads/point = 1,024 constant mem accesses Total off-chip memory accesses

27、 = 3,584 + 1,024 = 4,608 Total FP arithmetic ops = 4 blocks/SM * 128 threads/block * 256 iters/thread * 10 ops/iter = 1,310,720 FP arithmetic to off-chip loads: 284 to 1,18, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,22,Step 3: Overcoming Bottlenecks (Trig),Old bottleneck: t

28、rig operations Solution: SFUs Performance 98.2 GFLOPS (Q), 92.2 GFLOPS (FHd) New bottleneck: overhead of branches and address calculations,19, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,23,Sidebar: Effects of Approximations,Avoid temptation to measure only absolute error (I0

29、 I) Can be deceptively large or small Metrics PSNR: Peak signal-to-noise ratio SNR: Signal-to-noise ratio Avoid temptation to consider only the error in the computed value Some apps are resistant to approximations; others are very sensitive,20,A.N. Netravali and B.G. Haskell, Digital Pictures: Repre

30、sentation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995)., David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,24,Step 3: Overcoming Bottlenecks (Overheads),Old bottleneck: Overhead of branches and address calculations Solution: Loop unrolling and experimenta

31、l tuning Performance 179 GFLOPS (Q), 145 GFLOPS (FHd),21, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,25,Experimental Tuning: Tradeoffs,In the Q kernel, three parameters are natural candidates for experimental tuning Loop unrolling factor (1, 2, 4, 8, 16) Number of threads pe

32、r block (32, 64, 128, 256, 512) Number of scan points per grid (32, 64, 128, 256, 512, 1024, 2048) Cant optimize these parameters independently Resource sharing among threads (register file, shared memory) Optimizations that increase a threads performance often increase the threads resource consumpt

33、ion, reducing the total number of threads that execute in parallel Optimization space is not linear Threads are assigned to SMs in large thread blocks Causes discontinuity and non-linearity in the optimization space,22, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,26,Experimen

34、tal Tuning: Example,Increase in per-thread performance, but fewer threads: Lower overall performance,23, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,27,Experimental Tuning: Scan Points Per Grid,24, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,28,Sidebar

35、: Cache-Conscious Data Layout,kx, ky, kz, and phi components of same scan point have spatial and temporal locality Prefetching Caching Old layout does not fully leverage that locality New layout does fully leverage that locality,25, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008

36、,29,Experimental Tuning: Scan Points Per Grid (Improved Data Layout),26, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,30,Experimental Tuning: Loop Unrolling Factor,27, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,31,Sidebar: Optimizing the CPU Implementa

37、tion,Optimizing the CPU implementation of your application is very important Often, the transformations that increase performance on CPU also increase performance on GPU (and vice-versa) The research community wont take your results seriously if your baseline is crippled Useful optimizations Data ti

38、ling SIMD vectorization (SSE) Fast math libraries (AMD, Intel) Classical optimizations (loop unrolling, etc) Intel compiler (icc, icpc),28, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,32,Summary of Results,29,8X, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,33,Summ

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論