Recursive Block Matrix Multiplcation

Recursive Block Matrix Multiplcation - c

Trying to implement block matrix multiplication recursively. It works fine for matrices of 2x2 but increase to sizes such as 4x4 and the answers differ vastly
Result of 3 for loops
1.53 0.89 0.53 1.33
1.75 1.09 0.72 1.17
1.78 1.43 0.57 1.69
1.73 1.04 0.62 1.51
Result of recursion
1.34 1.49 0.30 1.45
2.02 1.93 0.79 1.30
2.70 2.75 0.87 2.21
1.81 1.84 0.59 1.47
If the amount of blocks within the matrix is greater than 4 I divide blocks into four larger ones and take the square root to get the new dimension like so then make the 8 recursive calls.
void myRecMat(float** MatrixA, float** MatrixB, float** MatrixC, int srA, int scA, int srB, int scB, int srC, int scC, int blocks,int dim){
if(blocks > 4)
{ blocks=blocks/4;
int newDim = dim/2;
myRecMat(MatrixA,MatrixB,MatrixC, srA,scA,srB,scB,srC,scC,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC, srA,scA+newDim,srB+newDim,scB,srC,scC,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC, srA,scA,srB,scB+newDim,srC,scC+newDim,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC, srA,scA+newDim,srB+newDim,scB,srC+newDim,scC,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC, srA+newDim,scA,srB,scB,srC+newDim,scC,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC, srA+newDim,scA+newDim,srB+newDim,scB,srC+newDim,scC,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC, srA+newDim,scA+newDim,srB,scB+newDim,srC+newDim,scC+newDim,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC, srA+newDim,scA+newDim,srB+newDim,scB+newDim,srC+newDim,scC+newDim,blocks,newDim); }
else
{
int i,j,k,endR,endC;
endR=srC+dim;
endC=scC+dim;
for(i=srC; i< endR; i++)
for(j=scC;j< endC;j++)
for(k=0; k<newDim; k++)
c[i][j] += a[i][k]*b[k][j];
}
}
The sr and sc are for starting row and col. The spacing should be right so I'm honestly out of leads here. Thanks in advanced.

I've compiled and carefully debugged your code. If you only intend to use this function on matrices of 2^k*2^k, these 2 modifications will help.
First:
for(i=srC; i< endR; i++) {
for(j=scC;j< endC;j++) {
for(k=0; k<newDim; k++)
/*c[i][j] += a[i][k]*b[k][j];*/
c[i][j] += a[i][scA+k] * b[srB+k][j];
}
}
Second:
myRecMat(MatrixA,MatrixB,MatrixC,srA,scA,srB,scB,srC,scC,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC,srA,scA+newDim,srB+newDim,scB,srC,scC,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC,srA,scA,srB,scB+newDim,srC,scC+newDim,blocks,newDim);
/*myRecMat(MatrixA,MatrixB,MatrixC,srA,scA+newDim,srB+newDim,scB,srC+newDim, scC,blocks,newDim);*/
myRecMat(MatrixA,MatrixB,MatrixC,srA,scA+newDim,srB+newDim,scB+newDim,srC, scC+newDim,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC,srA+newDim,scA,srB,scB,srC+newDim,scC,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC,srA+newDim,scA+newDim,srB+newDim,scB,srC+newDim,scC,blocks,newDim);
/*myRecMat(MatrixA,MatrixB,MatrixC,srA+newDim,scA+newDim,srB,scB+newDim,srC+newDim,scC+newDim,blocks,newDim);*/
myRecMat(MatrixA,MatrixB,MatrixC,srA+newDim,scA,srB,scB+newDim,srC+newDim,scC+newDim,blocks,newDim);
myRecMat(MatrixA,MatrixB,MatrixC,srA+newDim,scA+newDim,srB+newDim,scB+newDim,srC+newDim,scC+newDim,blocks,newDim);

I believe your problem here is not as much on the implementation of your method, but on the loss of precision of floating-point operations. Sometimes one may think this imprecision is neglectable, but when we do intense operations over a floating point variable, like your triple nested loop, these imprecisions become significant.
One way to go around this is to scale your floating point numbers so they "lose" their decimal part. That is, for example, if you know your matrix won't have numbers with more than two decimal digits, then multiply them all by 100 and get their integer representation. Then perform the arithmetics on integers (which are precise), and in the end get the floating point representation of the result and divide it by 100.
Hope this helps.

Related

No performance gain with transpose of large 2d Matrix using Loop tiling

Transposing global 2D Square Matrix/Array of size 1 gb with tiling approach(Cache Aware) has no performance gain in single threaded execution over Normal transpose method. Not discussing the transpose speed up using AVX,SSE(SIMD) or any other cache oblivious transpose algorithm(http://supertech.csail.mit.edu/papers/FrigoLePr12.pdf)
#include <stdio.h>
#include <sys/time.h>
#define SIZE 16384
float a[SIZE][SIZE], b[SIZE][SIZE];
void testNormalTranspose() {
int i, j, k, l;
b[0][9999] = 1.0;
for (i=0; i<SIZE; i++)
for (j=0; j<SIZE; j++)
a[i][j] = b[j][i];
}
void testTiledTranspose(){
int i, j, k, l;
b[0][9999] = 1.0;
int blocksize = 16;
for (i=0; i<SIZE; i+= blocksize) {
for (j=0; j<SIZE; j+=blocksize) {
for (int ii = i;ii <i + blocksize; ++ii) {
for (int jj = j; jj < j + blocksize; ++jj) {
a[ii][jj] = b[jj][ii];
}
}
}
}
}
int main()
{
struct timeval t1, t2;
/*
gettimeofday(&t1, NULL);
testNormalTranspose();
gettimeofday(&t2, NULL);
printf("Time for the Normal transpose is %ld milliseconds\n",
(t2.tv_sec - t1.tv_sec)*1000 +
(t2.tv_usec - t1.tv_usec) / 1000);
*/
gettimeofday(&t1, NULL);
testTiledTranspose();
gettimeofday(&t2, NULL);
printf("Time for the Tiled transpose is %ld milliseconds\n",
(t2.tv_sec - t1.tv_sec)*1000 +
(t2.tv_usec - t1.tv_usec) / 1000);
printf("%f\n", a[9999][0]);
}

Loop tiling helps in case the data is being reused. If you use an element SIZE times, you better use it SIZE times and only then proceed to a next element.
Unfortunately, transposing 2D matrix you are not reusing any elements neither of matrix a, nor b. Even more, since in the loop you mix rows and cols access (i.e. a[i][j] = b[j][i]), you will never get unit-stride memory access on both a and b arrays at the same time, but only on one of them.
So, in this case tiling is not that efficient, but still you might have some performance improvements even with "random" memory access if:
the element you are accessing now is on the same cache line with an element you were accessing previously AND
that cache line is still available.
So, to see any improvements the memory footprint of this "random" accesses must fit into the cache of your system. Basically this means you have to carefully chose the blocksize and 16 you have chosen in the example might work better on one system and worse on the other.
Here are the results from my computer for different power of 2 block sizes and SIZE 4096:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
transpose_2d 32052765 ns 32051761 ns 21
tiled_transpose_2d/2 22246701 ns 22245867 ns 31
tiled_transpose_2d/4 16912984 ns 16912487 ns 41
tiled_transpose_2d/8 16284471 ns 16283974 ns 43
tiled_transpose_2d/16 16604652 ns 16604149 ns 42
tiled_transpose_2d/32 23661431 ns 23660226 ns 29
tiled_transpose_2d/64 32260575 ns 32259564 ns 22
tiled_transpose_2d/128 32107778 ns 32106793 ns 22
fixed_tile_transpose_2d 16735583 ns 16729876 ns 41
As you can see the version with blocksize 8 works the best for me and it almost double the performance.
Here are the results for SIZE 4131 and power of 3 block sizes:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
transpose_2d 29875351 ns 29874381 ns 23
tiled_transpose_2d/3 30077471 ns 30076517 ns 23
tiled_transpose_2d/9 20420423 ns 20419499 ns 35
tiled_transpose_2d/27 13470242 ns 13468992 ns 51
tiled_transpose_2d/81 11318953 ns 11318646 ns 61
tiled_transpose_2d/243 10229250 ns 10228884 ns 65
fixed_tile_transpose_2d 10217339 ns 10217066 ns 67
Regarding 16384 size issue. I can not reproduce it, i.e. I still see the same gain for big matrix. Just please note, that 16384 * 16384 * sizeof(float) makes 4GB, which might expose some system issues...

Calling with Arguments versus using Globals in C

I have a decent understanding of x86 assembly and i know that when a function is called all the arguments are pushed onto the stack.
I have a function which basically loops through a 8 by 8 array and calls some functions based on the values in the array. Each of these function calls involves 6-10 arguments being passed. This program takes a very long time to run, it is a Chess AI, but this function takes 20% of the running time.
So i guess my question is, what can i do to give my functions access to the variables they need in a faster way?
int row,col,i;
determineCheckValidations(eval_check, b, turn);
int * eval_check_p = &(eval_check[0][0]);
for(row = 0; row < 8; row++){
for(col = 0; col < 8; col++, eval_check_p++){
if (b->colors[row][col] == turn){
int type = b->types[row][col];
if (type == PAWN)
findPawnMoves(b,moves_found,turn,row,col,last_move,*eval_check_p);
else if (type == KNIGHT)
findMappedNoIters(b,moves_found,turn,row,col,*move_map_knight, 8, *eval_check_p);
else if (type == BISHOP)
findMappedIters(b,moves_found,turn,row,col,*move_map_bishop, 4, *eval_check_p);
else if (type == ROOK)
findMappedIters(b,moves_found,turn,row,col,*move_map_rook, 4, *eval_check_p);
else if (type == QUEEN)
findMappedIters(b,moves_found,turn,row,col,*move_map_queen, 8, *eval_check_p);
else if (type == KING){
findMappedNoIters(b,moves_found,turn,row,col,*move_map_king, 8, *eval_check_p);
findCastles(b,moves_found,turn,row,col);
}
}
}
}
all the code can be found # https://github.com/AndyGrant/JChess/tree/master/_Core/_Scripts
A sample of the profile:
% cumulative self self total
time seconds seconds calls s/call s/call name
20.00 1.55 1.55 2071328 0.00 0.00 findAllValidMoves
14.84 2.70 1.15 10418354 0.00 0.00 checkMove
10.06 3.48 0.78 1669701 0.00 0.00 encodeBoard
7.23 4.04 0.56 10132526 0.00 0.00 findMappedIters
6.84 4.57 0.53 1669701 0.00 0.00 getElement
6.71 5.09 0.52 68112169 0.00 0.00 createNormalMove

You have performed good work on profiling. You need to take the function with the worst case and profile it in more detail.
You may want to try different compiler optimization settings when you profile.
Try some common optimization techniques, such as loop unrolling and factoring out invariants from loops.
You may get some improvements by designing your functions with the processor's data cache in mind. Search the web for "optimizing data cache".
If the function works correctly, I recommend posting to CodeReview#StackExchange.com.
Don't assume anything.

Reduce matrix rows with CUDA

Windows 7, NVidia GeForce 425M.
I wrote a simple CUDA code which calculates the row sums of a matrix.
The matrix has uni-dimensional representation (pointer to a float).
The serial version of code is below (it has 2 loops, as expected):
void serial_rowSum (float* m, float* output, int nrow, int ncol) {
float sum;
for (int i = 0 ; i < nrow ; i++) {
sum = 0;
for (int j = 0 ; j < ncol ; j++)
sum += m[i*ncol+j];
output[i] = sum;
}
}
Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:
dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock));
kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);
and the kernel function which performs the parallel sum of the rows (still has 1 loop):
__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {
int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;
if (rowIdx < nrow) {
float sum=0;
for (int k = 0 ; k < ncol ; k++)
sum+=m[rowIdx*ncol+k];
s[rowIdx] = sum;
}
}
So far so good. The serial and parallel (CUDA) results are equal.
The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock parameter: I tested nThreadsPerBlock from 32 to 1024 (maximum number of threads per block allowed for my card).
IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000.
Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock. Time reported in msec over an average of 100 samples:
Matrix: nrow = 90000 x ncol = 1000
Serial: Average Time Elapsed per Sample in msec (100 samples): 289.18.
CUDA (32 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 497.11.
CUDA (1024 ThreadsPerBlock): Average Time Elapsed per Sample in msec (100 samples): 699.66.
Just in case, the version with 32/1024 nThreadsPerBlock is the fastest/slowest one.
I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.
Since I am far from being a CUDA expert:
Am I coding the fastest version for this task? How could I improve my code?
Can I get rid of the loop in the kernel function?
Any thoughts appreciated.
EDIT 1
Although I describe a standard rowSum, I am interested in the AND/OR operation of rows which have (0;1} values, like rowAND/rowOR. That said, it doesn't allow me to exploit the cuBLAS multiply by 1's COL column vector trick, as suggested by some commentators.
EDIT 2
As suggest by users other users and here endorsed:
FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.

Since you mentioned you need general reduction algorithm other than sum only. I will try to give 3 approaches here. kernel approach may have the highest performance. thrust approach is easiest to implement. cuBLAS approach works only with sum and have good performance.
Kernel Approach
Here's a very good doc introducing how to optimize standard parallel reduction. Standard reduction can be divide into 2 stages.
Multiple thread blocks each reduces one part of the data;
One thread block reduces from result of stage 1 to the final 1 element.
For your multi-reduction (reduce rows of mat) problem, only stage 1 is enough. The idea is to reduce 1 row per thread block. For further considerations like multi-row per thread block or 1 row per multiple thread blocks, you can refer to the paper provided by #Novak. This may improve the performance more, especially for matrices with bad shape.
Thrust Approach
General multi-reduction can be done by thrust::reduction_by_key in a few minutes. You can find some discussions here Determining the least element and its position in each matrix column with CUDA Thrust.
However thrust::reduction_by_key does not assume each row has the same length, so you will get performance penalty. Another post How to normalize matrix columns in CUDA with max performance? gives profiling comparison between thrust::reduction_by_key and cuBLAS approach on sum of rows. It may give you a basic understanding about the performance.
cuBLAS Approach
Sum of rows/cols of a matrix A can be seen as a matrix-vector multiplication where the elements of the vector are all ones. it can be represented by the following matlab code.
y = A * ones(size(A,2),1);
where y is the sum of rows of A.
cuBLAS libary provides a high performance matrix-vector multiplication function cublas<t>gemv() for this operation.
Timing result shows that this routine is only 10~50% slower than simply read all the elements of A once, which can be seen as the theoretical upper limit of the performance for this operation.

Reducing the rows of a matrix can be solved by using CUDA Thrust in three ways (they may not be the only ones, but addressing this point is out of scope). As also recognized by the same OP, using CUDA Thrust is preferable for such a kind of problem. Also, an approach using cuBLAS is possible.
APPROACH #1 - reduce_by_key
This is the approach suggested at this Thrust example page. It includes a variant using make_discard_iterator.
APPROACH #2 - transform
This is the approach suggested by Robert Crovella at CUDA Thrust: reduce_by_key on only some values in an array, based off values in a “key” array.
APPROACH #3 - inclusive_scan_by_key
This is the approach suggested by Eric at How to normalize matrix columns in CUDA with max performance?.
APPROACH #4 - cublas<t>gemv
It uses cuBLAS gemv to multiply the relevant matrix by a column of 1's.
THE FULL CODE
Here is the code condensing the two approaches. The Utilities.cu and Utilities.cuh files are mantained here and omitted here. The TimingGPU.cu and TimingGPU.cuh are maintained here and are omitted as well.
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
// --- Required for approach #2
__device__ float *vals;
/**************************************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX - NEEDED FOR APPROACH #1 */
/**************************************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/******************************************/
/* ROW_REDUCTION - NEEDED FOR APPROACH #2 */
/******************************************/
struct row_reduction {
const int Ncols; // --- Number of columns
row_reduction(int _Ncols) : Ncols(_Ncols) {}
__device__ float operator()(float& x, int& y ) {
float temp = 0.f;
for (int i = 0; i<Ncols; i++)
temp += vals[i + (y*Ncols)];
return temp;
}
};
/**************************/
/* NEEDED FOR APPROACH #3 */
/**************************/
template<typename T>
struct MulC: public thrust::unary_function<T, T>
{
T C;
__host__ __device__ MulC(T c) : C(c) { }
__host__ __device__ T operator()(T x) { return x * C; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nrows = 5; // --- Number of rows
const int Ncols = 8; // --- Number of columns
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist(rng);
TimingGPU timerGPU;
/***************/
/* APPROACH #1 */
/***************/
timerGPU.StartCounter();
// --- Allocate space for row sums and indices
thrust::device_vector<float> d_row_sums(Nrows);
thrust::device_vector<int> d_row_indices(Nrows);
// --- Compute row sums by summing values with equal row indices
//thrust::reduce_by_key(thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)),
// thrust::make_transform_iterator(thrust::counting_iterator<int>(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
// d_matrix.begin(),
// d_row_indices.begin(),
// d_row_sums.begin(),
// thrust::equal_to<int>(),
// thrust::plus<float>());
thrust::reduce_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
thrust::make_discard_iterator(),
d_row_sums.begin());
printf("Timing for approach #1 = %f\n", timerGPU.GetCounter());
// --- Print result
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums[i] << "\n";
}
/***************/
/* APPROACH #2 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_2(Nrows, 0);
float *s_vals = thrust::raw_pointer_cast(&d_matrix[0]);
gpuErrchk(cudaMemcpyToSymbol(vals, &s_vals, sizeof(float *)));
thrust::transform(d_row_sums_2.begin(), d_row_sums_2.end(), thrust::counting_iterator<int>(0), d_row_sums_2.begin(), row_reduction(Ncols));
printf("Timing for approach #2 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_2[i] << "\n";
}
/***************/
/* APPROACH #3 */
/***************/
timerGPU.StartCounter();
thrust::device_vector<float> d_row_sums_3(Nrows, 0);
thrust::device_vector<float> d_temp(Nrows * Ncols);
thrust::inclusive_scan_by_key(
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols)) + (Nrows*Ncols),
d_matrix.begin(),
d_temp.begin());
thrust::copy(
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))),
thrust::make_permutation_iterator(
d_temp.begin() + Ncols - 1,
thrust::make_transform_iterator(thrust::make_counting_iterator(0), MulC<int>(Ncols))) + Nrows,
d_row_sums_3.begin());
printf("Timing for approach #3 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_3[i] << "\n";
}
/***************/
/* APPROACH #4 */
/***************/
cublasHandle_t handle;
timerGPU.StartCounter();
cublasSafeCall(cublasCreate(&handle));
thrust::device_vector<float> d_row_sums_4(Nrows);
thrust::device_vector<float> d_ones(Ncols, 1.f);
float alpha = 1.f;
float beta = 0.f;
cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Ncols, Nrows, &alpha, thrust::raw_pointer_cast(d_matrix.data()), Ncols,
thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_row_sums_4.data()), 1));
printf("Timing for approach #4 = %f\n", timerGPU.GetCounter());
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "] = " << d_row_sums_4[i] << "\n";
}
return 0;
}
TIMING RESULTS (tested on a Kepler K20c)
Matrix size #1 #1-v2 #2 #3 #4 #4 (no plan)
100 x 100 0.63 1.00 0.10 0.18 139.4 0.098
1000 x 1000 1.25 1.12 3.25 1.04 101.3 0.12
5000 x 5000 8.38 15.3 16.05 13.8 111.3 1.14
100 x 5000 1.25 1.52 2.92 1.75 101.2 0.40
5000 x 100 1.35 1.99 0.37 1.74 139.2 0.14
It seems that approaches #1 and #3 outperform approach #2, except in the cases of small numbers of columns. The best approach, however, is approach #4, which is significantly more convenient than the others, provided that the time needed to create the plan can be amortized during the computation.

If this is the extent (summing the rows) of the operations you need to do with this data, I wouldn't expect a sizable benefit from the GPU. You have exactly one arithmetic operation per data element, and for that you are paying the cost of transferring that data element to the GPU. And beyond a certain problem size (whatever it takes to keep the machine busy) you get no added benefit from larger problem sizes, because the arithmetic intensity is O(n).
So this isn't a particularly exciting problem to solve on the GPU.
But as talonmies has indicated, you have a coalescing problem in the way you have crafted it, which will further slow things down. Let's take a look at a small example:
C1 C2 C3 C4
R1 11 12 13 14
R2 21 22 23 24
R3 31 32 33 34
R4 41 42 43 44
Above is a simple pictorial example of a small portion of your matrix. The machine data storage is such that elements (11), (12), (13), and (14) are stored in adjacent memory locations.
For coalesced access, we want an access pattern such that adjacent memory locations are requested from the same instruction, executed across the warp.
We need to think about execution of your code from the standpoint of a warp, that is 32 threads executing in lock-step. What is your code doing? Which elements is it retrieving (asking for) at each step/instruction? Let's take a look at this line of code:
sum+=m[rowIdx*ncol+k];
Adjacent threads in the warp have adjacent (i.e. consecutive) values for rowIdx as you have created that variable. So when k = 0, which data element is being asked for by each thread when we try to retrieve the value m[rowIdx*ncol+k] ?
In block 0, thread 0 has a rowIdx of 0. Thread 1 has a rowIdx of 1, etc. So the values being asked for by each thread at this instruction are:
Thread: Memory Location: Matrix Element:
0 m[0] (11)
1 m[ncol] (21)
2 m[2*ncol] (31)
3 m[3*ncol] (41)
But this is not coalesced access! Elements (11), (21), etc. are not adjacent in memory. For coalesced access, we would like that Matrix Element row to read like this:
Thread: Memory Location: Matrix Element:
0 m[?] (11)
1 m[?] (12)
2 m[?] (13)
3 m[?] (14)
If you then work backwards to determine what the value of ? should be, you will come up with an instruction something like this:
sum+=m[k*ncol+rowIdx];
This will give coalesced access, but it will not give you the correct answer, because we are now summing matrix columns instead of matrix rows. We can fix this by re-organizing your data storage to be in column-major order rather than row-major order. (You should be able to google that for ideas, right?) Conceptually, this is equivalent to transposing your matrix m. Whether this is convenient for you to do or not is outside the scope of your question, as I see it, and not really a CUDA issue. It may be a simple thing for you to do as you are creating the matrix on the host or transferring the matrix from host to device. But in summary, I don't know of a way to sum the matrix rows with 100% coalesced access, if the matrix is stored in row-major order. (You could resort to a sequence of row-reductions but that looks painful to me.)
It's not uncommon, when we are thinking about ways to accelerate code on the GPU, to consider re-organizing our data storage to facilitate the GPU. This is one example.
And, yes, what I'm outlining here still retains a loop in the kernel.
As an additional comment, I would suggest timing the data copy portions, and kernel (compute) portions separately. I can't tell from your question whether you are timing just the kernel or the entire (GPU) operation, including the data copies. If you time the data copies separately, you may discover that just the data copy time exceeds your CPU time. Any effort put into optimizing your CUDA code will not affect the data copy time. This might be a useful data point before you spend much time on this.

Lost/confused in optimizing

I just completed a computer graphics course, where we had to program a ray tracer. Though all the results were correct, I was confused about the use of OpenMP (which BTW was not part of the course). I have this loop (C++):
#pragma omp parallel for private(L, ray)
// for (x = x_from; x < x_till; x++) {
// printf("Col: %5d\n", x);
// for (y = y_from; y < y_till; y++) {
for (int xy = 0; xy < xy_range; xy++) {
int x = x_from + (xy % x_width);
int y = y_from + (xy / x_width);
ray = cam->get_ray_at(x, y);
L = trace_ray(ray, 0, cam->inter);
#pragma omp critical
cam->set_pixel(x, y, L);
}
// }
}
I tried many configurations. But what finally confuses me the most is that the above version, with a combined, single for, was the least efficient of all (150 seconds vs 120s for separate x and y for. The 'critical' does not noticeably change the timing.
More: though I would expect the single for-loop to parallelize each separate iteration, it doesn't. Using this method, 25 loops are executed as groups of 8 - 8 - 8 - 1 (8 cores). In fact the separate y-loops (commented out in listing) seem to distribute the load more efficiently. Removing the 'for' in 'parallel for' does improve
slightly (148 vs 150s ;)
Also, I tried local vs global definitions (with the necessary private pragmas). I tried to declare L and ray inside the loops. All to no avail...
I'd appreciate suggestions or pointers...
Here are some more precise data:
Single loop Yes No No Yes
'Critical" No No Yes Yes
---------------------- ---------------------- ---------------------- ----------------------
User CPU Mean User CPU Mean User CPU Mean User CPU Mean
Scene 5 37.9 158.9 3.66 26.5 185.5 7.00 27.0 187.7 6.95 38.7 161.8 4.18
Scene 6 18.8 110 5.85 17.7 112 6.32 18.1 113.8 5.29 19.4 112.2 5.78
Scene 7 149 658.8 4.42 114 679.9 5.96 114 653.8 5.73 149 659.8 4.43
Plane 112.0 497.3 4.44 105 520.5 4.95 103.8 525 5.06 113.5 504.8 4.45
5-balls 126 760.2 6.03 162.3 697.5 4.36 170.3 725.3 4.23 127.3 766.5 6.02
'Mean' is CPU/User, which is the mean core occupation. Note that in several cases, mean is only 4.xx.
Solution, and results:
Single loop Yes No
---------------------- ----------------------
User CPU Mean User CPU Mean
Scene 5 23.9 190.1 7.95 24.4 190.7 7.82
Scene 6 14.3 114.2 7.98 14.5 114.9 7.92
Scene 7 85.5 675.9 7.91 106.9 698.8 6.54
Plane 72.7 579.1 7.97 72.6 578.4 7.97
5-balls 104.8 823.3 7.86 103.9 825.1 7.94
This excellent result is obtained by adding schedule(dynamic, 1) to
the #pragma omp parallel for line like this:
#pragma omp parallel for schedule(dynamic, 1)
which see to a run-time load distribution for cores (as opposed to
compile time).
Just one more note, the ', 1' parameter is to limit the size of the
chunks. It can be left out, in which case openmp uses a default
value. Maybe adding 1 made the load distribution too fine-grained,
but I cannot find any performance difference either way in this case .
I guess the raytracing task is too slow and hides any administrative
overhead.

I have written a Whitted sytle ray tracer that operates on the full ray tree (reflection and refraction) in OpenCL. I have not done it with OpenMP yet but that's my next goal. If you want to learn OpenMP I would start with some simpler tasks first. But let me make a few comments.
How are you doing your timing? You wrote "Removing the 'for' in 'parallel for' does improve slightly". That makes no sense. Removing the for is going to run the same code on each thread not distribute the treads to different iterations (do some hello world tests to show this). It should be slower not faster. That makes me wonder how you do the timing. I added some code to show how to do the timing.
You should not have to use critical. If each iteration writes to a different pixel then it should not be necessary. Depending on the complexity of your scene critical would likely make it much slower.
Lastly, to get the best performance you're going to want to use SSE/AVX as well and operate on multiple pixels at once. This can be done though what's called packet based ray tracing. See the following link for a good discussion on this http://graphics.stanford.edu/~boulos/papers/cook_gi07.pdf
Edit: Since each pixel can take different times you want to use schedule(dynamic) rather than schedule(static) which is normally (but not necessarily) the default. See the code.
Ingo Wald's PhD thesis:
http://www.sci.utah.edu/~wald/PhD/
double dtime = omp_get_wtime();
#pragma omp parallel
{
Ray ray;
Color L;
#pragma omp for schedule(dynamic)
for (int xy = 0; xy < xy_range; xy++) {
int x = x_from + (xy % x_width);
int y = y_from + (xy / x_width);
ray = cam->get_ray_at(x, y);
L = trace_ray(ray, 0, cam->inter);
cam->set_pixel(x, y, L);
}
}
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);

Gaussian random number generator

I'm trying to implement a gaussian distributed random number generator in the interval [0,1].
float rand_gauss (void) {
float v1,v2,s;
do {
v1 = 2.0 * ((float) rand()/RAND_MAX) - 1;
v2 = 2.0 * ((float) rand()/RAND_MAX) - 1;
s = v1*v1 + v2*v2;
} while ( s >= 1.0 );
if (s == 0.0)
return 0.0;
else
return (v1*sqrt(-2.0 * log(s) / s));
}
It's pretty much a straight forward implementation of the algorithm in Knuth's 2nd volume of TAOCP 3rd edition page 122.
The problem is that rand_gauss() sometimes returns values outside the interval [0,1].

Knuth describes the polar method on p 122 of the 2nd volume of TAOCP. That algorithm generates a normal distribution with mean = 0 and standard deviation = 1. But you can adjust that by multiplying by the desired standard deviation and adding the desired mean.
You might find it fun to compare your code to another implementation of the polar method in the C-FAQ.

Change your if statement to (s >= 1.0 || s == 0.0). Better yet, use a break as seen in the following example for a SIMD Gaussian random number generating returning a complex pair (u,v). This uses the Mersenne twister random number generator dsfmt(). If you only want a single, real, random-number, return only u and save the v for the next pass.
inline static void randn(double *u, double *v)
{
double s, x, y; // SIMD Marsaglia polar version for complex u and v
while (1){
x = dsfmt_genrand_close_open(&dsfmt) - 1.;
y = dsfmt_genrand_close_open(&dsfmt) - 1.;
s = x*x + y*y;
if (s < 1) break;
}
s = sqrt(-2.0*log(s)/s);
*u = x*s; *v = y*s;
return;
}
This algorithm is surprisingly fast. Execution times for computing two random numbers (u,v) for four different Gaussian random number generators are:
Times for delivering two Gaussian numbers (u + iv)
i7-2600K # 4GHz, gcc -Wall -Ofast -msse2 ..
gsl_ziggurat = 20.3 (ns)
Box-Muller = 78.8 (ns)
Box-Muller with fast_sin fast_cos = 28.1 (ns)
SIMD Marsaglia polar = 35.0 (ns)
The fast_sin and fast_cos polynomial routines of Charles K. Garrett speed up the Box-Muller computation by a factor 2.9 using a nested polynomial implementation of cos() and sin(). The SIMD Box Muller and polar algorithms are certainly competitive. Also they can be parallelized easily. Using gcc -Ofast -S, the assembly code dump shows that the square root is the SIMD SSE2: sqrt --> sqrtsd %xmm0, %xmm0
Comment: it is really hard and frustrating to get accurate timings with gcc5, but I think these are ok: as of 2/3/2016: DLW
[1] Related link: c malloc array pointer return in cython
[2] A comparison of algorithms, but not necessarily for SIMD versions: http://www.doc.ic.ac.uk/~wl/papers/07/csur07dt.pdf
[3] Charles K. Garrett: http://krisgarrett.net/papers/l2approx.pdf

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight