c - Matrix Multiplication CUDA -
c - Matrix Multiplication CUDA -
i have been reading through several websites , used nvida's code guide still getting wrong answer. main inquire user size, , display , b display resulting matrix c. run 2x2 matrix both , b sample output:
matrix 0.000000 8.000000 2.000000 2.000000 matrix b 3.000000 1.000000 5.000000 7.000000 matrix c (results) 0.000000 9.000000 7.000000 4.000000
but that's incorrect. should be:
40.000 56.000 16.000 16.000
i changed decimals whole numbers easier check, , found it's incorrect. not understand why incorrect, though took right code sample.
#ifndef _matrixmul_kernel_h_ #define _matrixmul_kernel_h_ #include <stdio.h> // thread block size #define block_size 16 #define tile_size 16 // cuda kernel __global__ void matrixmul( float* c, float* a, float* b, int wa, int wb) { // block index int bx = blockidx.x; int = blockidx.y; // thread index int tx = threadidx.x; int ty = threadidx.y; // index of first sub-matrix of processed // block int abegin = wa * block_size * by; // index of lastly sub-matrix of processed // block int aend = abegin + wa - 1; // step size used iterate through // sub-matrices of int astep = block_size; // index of first sub-matrix of b processed // block int bbegin = block_size * bx; // step size used iterate through // sub-matrices of b int bstep = block_size * wb; float csub=0; // loop on sub-matrices of , b // required compute block sub-matrix (int = abegin, b = bbegin; <= aend; += astep, b += bstep) { // declaration of shared memory array // used store sub-matrix of __shared__ float as[block_size][block_size]; // declaration of shared memory array bs // used store sub-matrix of b __shared__ float bs[block_size][block_size]; // load matrices global memory // shared memory; each thread loads // 1 element of each matrix as[ty][tx] = a[a + wa * ty + tx]; bs[ty][tx] = b[b + wb * ty + tx]; // synchronize create sure matrices // loaded __syncthreads(); // multiply 2 matrices together; // each thread computes 1 element // of block sub-matrix (int k = 0; k < block_size; ++k) csub += as[ty][k] * bs[k][tx]; // synchronize create sure preceding // computation done before loading 2 new // sub-matrices of , b in next iteration __syncthreads(); } // write block sub-matrix device memory; // each thread writes 1 element int c = wb * block_size * + block_size * bx; c[c + wb * ty + tx] = csub; } #endif // #ifndef _matrixmul_kernel_h_
host code:
//perform calculation //setup execution parameters dim3 threads(block_size, block_size); dim3 grid(c.colsize / threads.x, c.rowsize / threads.y); // execute kernel matrixmul<<< grid, threads >>>(devicematrixc, devicematrixa, devicematrixb, a.colsize, b.colsize);
thanks help, dan
the code using implicitly requires size of matrices round multiples of block size (16x16 in case). inner product calculation processes tile width @ time without checking out of bounds memory access. reason, 2x2 matrices not work.
if seek running kernel 16x16 input (for illustration 0 padding 2x2 case 16x16), should able confirm result.
c cuda matrix-multiplication
Comments
Post a Comment