c - Matrix Multiplication CUDA -

- April 15, 2012

i have been reading through several websites , used nvida's code guide still getting wrong answer. main inquire user size, , display , b display resulting matrix c. run 2x2 matrix both , b sample output:

matrix 0.000000 8.000000 2.000000 2.000000   matrix b 3.000000 1.000000 5.000000 7.000000   matrix c (results) 0.000000 9.000000 7.000000 4.000000

but that's incorrect. should be:

40.000 56.000 16.000 16.000

i changed decimals whole numbers easier check, , found it's incorrect. not understand why incorrect, though took right code sample.

#ifndef _matrixmul_kernel_h_ #define _matrixmul_kernel_h_  #include <stdio.h>  // thread block size #define block_size 16 #define tile_size  16    // cuda kernel __global__ void matrixmul( float* c, float* a, float* b, int wa, int wb) {     // block index     int bx = blockidx.x;     int = blockidx.y;  // thread index int tx = threadidx.x; int ty = threadidx.y;  // index of first sub-matrix of processed  // block int abegin = wa * block_size * by;  // index of  lastly sub-matrix of processed  // block int aend   = abegin + wa - 1;  // step size used iterate through  // sub-matrices of int astep  = block_size;  // index of first sub-matrix of b processed  // block int bbegin = block_size * bx;  // step size used iterate through  // sub-matrices of b int bstep  = block_size * wb; float csub=0; // loop on sub-matrices of , b // required compute block sub-matrix (int = abegin, b = bbegin; <= aend; += astep, b += bstep)  {     // declaration of shared memory array      // used store sub-matrix of     __shared__ float as[block_size][block_size];      // declaration of shared memory array bs      // used store sub-matrix of b     __shared__ float bs[block_size][block_size];      // load matrices global memory     // shared memory; each thread loads     // 1 element of each matrix     as[ty][tx] = a[a + wa * ty + tx];     bs[ty][tx] = b[b + wb * ty + tx];      // synchronize  create sure matrices      // loaded     __syncthreads();      // multiply 2 matrices together;     // each thread computes 1 element     // of block sub-matrix     (int k = 0; k < block_size; ++k)         csub += as[ty][k] * bs[k][tx];      // synchronize  create sure preceding     // computation done before loading 2 new     // sub-matrices of , b in next iteration     __syncthreads(); } // write block sub-matrix device memory; // each thread writes 1 element int c = wb * block_size * + block_size * bx; c[c + wb * ty + tx] = csub; }  #endif // #ifndef _matrixmul_kernel_h_

host code:

    //perform calculation     //setup execution parameters     dim3 threads(block_size, block_size);     dim3 grid(c.colsize / threads.x, c.rowsize / threads.y);      //   execute kernel     matrixmul<<< grid, threads >>>(devicematrixc, devicematrixa, devicematrixb, a.colsize, b.colsize);

thanks help, dan

the code using implicitly requires size of matrices round multiples of block size (16x16 in case). inner product calculation processes tile width @ time without checking out of bounds memory access. reason, 2x2 matrices not work.

if seek running kernel 16x16 input (for illustration 0 padding 2x2 case 16x16), should able confirm result.

c cuda matrix-multiplication

Search This Blog

Kamlesh

c - Matrix Multiplication CUDA -

Comments

Post a Comment

Popular posts from this blog

How do I check if an insert was successful with MySQLdb in Python? -

delphi - blogger via idHTTP : error 400 bad request -

postgresql - ERROR: operator is not unique: unknown + unknown -