




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、Parallel Programmingin C with MPI and OpenMP,Michael J. Quinn,Chapter 11,Matrix Multiplication,Outline,Sequential algorithms Iterative, row-oriented Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannons algorithm,Iterative, Row-oriented Algorithm,Series of inner p
2、roduct (dot product) operations,Performance as n Increases,Reason:Matrix B Gets Too Big for Cache,Computing a row of C requires accessing every element of B,Block Matrix Multiplication,Replace scalar multiplication with matrix multiplication Replace scalar addition with matrix addition,Recurse Until
3、 B Small Enough,Comparing Sequential Performance,First Parallel Algorithm,Partitioning Divide matrices into rows Each primitive task has corresponding rows of three matrices Communication Each task must eventually see every row of B Organize tasks into a ring,First Parallel Algorithm (cont.),Agglome
4、ration and mapping Fixed number of tasks, each requiring same amount of computation Regular communication among tasks Strategy: Assign each process a contiguous group of rows,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,
5、B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Complexity Analysis,Algorithm has p iterations During each iteration a process multiplies(n / p) (n / p) block of A by (n / p) n block of B: (n3 / p2) Total computation time: (n3 / p) Each process ends up passing(p-1)n2/p
6、 = (n2) elements of B,Isoefficiency Analysis,Sequential algorithm: (n3) Parallel overhead: (pn2)Isoefficiency relation: n3 Cpn2 n Cp This system does not have good scalability,Weakness of Algorithm 1,Blocks of B being manipulated have p times more columns than rows Each process must access every ele
7、ment of matrix B Ratio of computations per communication is poor: only 2n / p,Parallel Algorithm 2(Cannons Algorithm),Associate a primitive task with each matrix element Agglomerate tasks responsible for a square (or nearly square) block of C Computation-to-communication ratio rises to n / p,Element
8、s of A and B Needed to Compute a Processs Portion of C,Algorithm 1,Cannons Algorithm,Blocks Must Be Aligned,Before,After,Blocks Need to Be Aligned,A00,B00,A01,B01,A02,B02,A03,B03,A10,B10,A11,B11,A12,B12,A13,B13,A20,B20,A21,B21,A22,B22,A23,B23,A30,B30,A31,B31,A32,B32,A33,B33,Each triangle represents
9、a matrix block Only same-color triangles should be multiplied,Rearrange Blocks,A00,B00,A01,B01,A02,B02,A03,B03,A10,B10,A11,B11,A12,B12,A13,B13,A20,B20,A21,B21,A22,B22,A23,B23,A30,B30,A31,B31,A32,B32,A33,B33,Block Aij cycles left i positions Block Bij cycles up j positions,Consider Process P1,2,B02,A
10、10,A11,A12,B12,A13,B22,B32,Step 1,Consider Process P1,2,B12,A11,A12,A13,B22,A10,B32,B02,Step 2,Consider Process P1,2,B22,A12,A13,A10,B32,A11,B02,B12,Step 3,Consider Process P1,2,B32,A13,A10,A11,B02,A12,B12,B22,Step 4,Complexity Analysis,Algorithm has p iterations During each iteration process multip
11、lies two (n / p ) (n / p ) matrices: (n3 / p 3/2) Computational complexity: (n3 / p) During each iteration process sends and receives two blocks of size (n / p ) (n / p ) Communication complexity: (n2/ p),Isoefficiency Analysis,Sequential algorithm: (n3) Parallel overhead: (pn2) Isoefficiency relati
12、on: n3 C pn2 n C p This system is highly scalable,Summary,Considered two sequential algorithms Iterative, row-oriented algorithm Recursive, block-oriented algorithm Second has better cache hit rate as n increases Developed two parallel algorithms First based on rowwise block striped decomposition Se
13、cond based on checkerboard block decomposition Second algorithm is scalable, while first is not,Parallel Programmingin C with MPI and OpenMP,Michael J. Quinn,Chapter 12,Solving Linear Systems,Outline,Terminology Back substitution Gaussian elimination Jacobi method Conjugate gradient method,Terminolo
14、gy,System of linear equations Solve Ax = b for x Special matrices Symmetrically banded Upper triangular Lower triangular Diagonally dominant Symmetric,Symmetrically Banded,4,2,-1,0,0,0,3,-4,5,6,0,0,1,6,3,2,4,0,0,2,-2,0,9,2,0,0,7,3,8,7,0,0,0,4,0,2,Semibandwidth 2,Upper Triangular,4,2,-1,5,9,2,0,-4,5,
15、6,0,-4,0,0,3,2,4,6,0,0,0,0,9,2,0,0,0,0,8,7,0,0,0,0,0,2,Lower Triangular,4,0,0,0,0,0,0,0,0,0,0,0,5,4,3,0,0,0,2,6,2,3,0,0,8,-2,0,1,8,0,-3,5,7,9,5,2,Diagonally Dominant,19,0,2,2,0,6,0,-15,2,0,-3,0,5,4,22,-1,0,4,2,3,2,13,0,-5,5,-2,0,1,16,0,-3,5,5,3,5,-32,Symmetric,3,0,2,2,0,6,0,7,4,3,-3,5,5,4,0,-1,0,4,2
16、,3,-1,9,0,-5,0,-3,0,0,5,5,6,5,4,-5,5,-3,Back Substitution,Used to solve upper triangular systemTx = b for x Methodology: one element of x can be immediately computed Use this value to simplify system, revealing another element that can be immediately computed Repeat,Back Substitution,1x0,+1x1,1x2,+4
17、x3,8,=, 2x1,3x2,+1x3,5,=,2x2, 3x3,0,=,2x3,4,=,Back Substitution,1x0,+1x1,1x2,+4x3,8,=, 2x1,3x2,+1x3,5,=,2x2, 3x3,0,=,2x3,4,=,x3 = 2,Back Substitution,1x0,+1x1,1x2,0,=, 2x1,3x2,3,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,+1x1,1x2,0,=, 2x1,3x2,3,=,2x2,6,=,2x3,4,=,x2 = 3,Back Substitution,1x0,+1x1,3,=, 2
18、x1,12,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,+1x1,3,=, 2x1,12,=,2x2,6,=,2x3,4,=,x1 = 6,Back Substitution,1x0,9,=, 2x1,12,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,9,=, 2x1,12,=,2x2,6,=,2x3,4,=,x0 = 9,Pseudocode,for i n 1 down to 1 do x i b i / a i, i for j 0 to i 1 do b j b j x i a j, i endfor endfor
19、,Time complexity: (n2),Data Dependence Diagram,We cannot execute the outer loop in parallel. We can execute the inner loop in parallel.,Row-oriented Algorithm,Associate primitive task with each row of A and corresponding elements of x and b During iteration i task associated with row j computes new
20、value of bj Task i must compute xi and broadcast its value Agglomerate using rowwise interleaved striped decomposition,Interleaved Decompositions,Rowwise interleaved striped decomposition,Columnwise interleaved striped decomposition,Complexity Analysis,Each process performs about n / (2p) iterations
21、 of loop j in all A total of n -1 iterations in all Computational complexity: (n2/p) One broadcast per iteration Communication complexity: (n log p),Column-oriented Algorithm,Associate one primitive task per column of A and associated element of x Last task starts with vector b During iteration i ta
22、sk i computes xi, updates b, and sends b to task i -1 In other words, no computational concurrency Agglomerate tasks in interleaved fashion,Complexity Analysis,Since b always updated by a single process, computational complexity same as sequential algorithm: (n2) Since elements of b passed from one process to another each iteration, communication complexity is (n2),Comparison,Message- passing time dominates,Computation time dominates,Gaussian Elimination,Used to solve Ax = b when A is dense Reduces Ax = b to upper triangular system Tx = c Back su
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 患者健康教育方法及技巧
- 老年健康主題演講大綱
- 社戲教學課件
- 線上教學反饋課件
- 音樂教學原創(chuàng)課件圖譜
- 排澇站可行性研究報告
- 教學課件素材溫柔文案
- 學校中層干部述職會上校長講話:把職責扛穩(wěn)把問題剖深把實干抓實
- 旅游公司開業(yè)儀式策劃方案
- 新春朗讀活動方案
- 獸醫(yī)檢驗題庫與答案
- 新編旅游職業(yè)道德 課件 譚為躍 第3-5章 旅行社從業(yè)人員道德素養(yǎng)、酒店從業(yè)者道德素養(yǎng)、景區(qū)點從業(yè)人員道德素養(yǎng)
- 《客艙安全與應急處置》-課件:援助者的選擇
- 高度近視眼底疾病知識講座
- 《陸上風電場工程概算定額》(NB-T 31010-2019)
- 市政管道施工培訓課件
- 探索玻璃瓶身藝術噴漆裝飾的美學價值
- 南電一水庫防洪搶險應急預案
- 2023年中國收藏卡市場研究報告-2023
- 檔案掃描保密管理制度
- 2024版網絡安全攻防演練與實踐分享培訓課件
評論
0/150
提交評論