I am trying to implement a rather simple averaging during transformation of an image. I already successfully implemented the transformation, but now I have to process this resulting image by summing up all pixels of all 5x5 pixels rectangles. My Idea was to increment a counter for each such 5x5 block whenever a pixel in this block is set. However, these block-counters are by far not incremented often enough. So for debugging I checked how often any pixel of such a block is hit at all:
int x = (blockIdx.x*blockDim.x) + threadIdx.x;
int y = (blockIdx.y*blockDim.y) + threadIdx.y;
if((x<5)&&(y<5))
{
resultArray [0]++;
}
The kernel is called like this:
dim3 threadsPerBlock(8, 8);
dim3 grid(targetAreaRect_px._uiWidth / threadsPerBlock.x, targetAreaRect_px._uiHeight / threadsPerBlock.y);
CudaTransformAndAverageImage << < grid, threadsPerBlock >> > (pcPreRasteredImage_dyn, resultArray );
I would expect resultArray [0] to contain 25 after kernel execution, but it only contains 1. Is this due to some optimization by the CUDA compiler?
This:
if((x<5)&&(y<5))
{
resultArray [0]++;
}
is a read after write hazard.
All of the threads which satisfy (x<5)&&(y<5) can potentially attempt simultaneous reads and writes from resultArray[0]. The CUDA execution model does not guarantee anything about the order of simultaneous memory transactions.
You could make this work by using atomic memory transactions, for example:
if((x<5)&&(y<5)) {
atomicAdd(&resultArray[0], 1);
}
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.
Related
I would like to apply a pretty simple straightforward calculation on a n-by-d-dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads. My question is: what is the optimal way to split the problem? How could I significantly reduce the execution time of my script? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.
double * calcDistance(double * X ,int n, int d)
{
//calculate and return an array[n-1] of all the distances
//from the last point
double *distances = calloc(n,sizeof(double));
for(int i=0 ; i<n-1; i++)
{
//distances[i]=0;
for (int j=0; j< d; j++)
{
distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);
}
distances[i] = sqrt(distances[i]);
}
return distances;
}
I provide a main()-caller function in order for the sample to be complete and testable:
#include <stdio.h>
#include <stdlib.h>
#define N 10 //00000
#define D 2
int main()
{
srand(time(NULL));
//allocate the proper space for X
double *X = malloc(D*N*(sizeof(double)));
//fill X with numbers in space (0,1)
for(int i = 0 ; i<N ; i++)
{
for(int j=0; j<D; j++)
{
X[i+j*N] = (double) (rand() / (RAND_MAX + 2.0));
}
}
X = calcDistances(X, N, D);
return 0;
}
I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element.
Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
Your inner loop jumps all over array X with a mixture of strides that varies with
the outer-loop iteration. Unless n and d are quite small,* this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.
what is the optimal way to split the problem?
Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1, have the second do (N / T) ... (2 * N / T) - 1, etc..
How could I significantly reduce the execution time of my script?
The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.
I have already tried utilizing pthreads asynchronously through the use
of a global_index that is imposed to mutex and a local_index. [...]
If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.
* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.
cvCvtColor(img,dst,CV_RGB2YCrCb);
for (int col=0;col<dst->width;col++)
{
for (int row=0;row<dst->height;row++)
{
int idxF = row*dst->widthStep + dst->nChannels*col; // Read the image data
CvPoint pt = {row,col};
temp_ptr2[0] += temp_ptr1[0]* 0.0722 + temp_ptr1[1] * 0.7152 +temp_ptr1[2] *0.2126 ; // channel Y
}
}
But the result is this:
Please assist where am i going wrong?
There is a lot to say about this code sample:
First, you are using the old C-style API (IplImage pointers, cvBlah functions, etc), which is obsolete and more difficult to maintain (in particular, memory leaks are introduced easily), so you should consider using the C++-style structures and functions (cv::Mat structure and cv::blah functions).
Your error is probably coming from the instruction cvCopy(dst,img); at the very beginning. This fills your input image with nothing just before you start your processing, so you should remove this line.
For maximum speed, you should invert the two loops, so that you first iterate over rows then over columns. This is because images in OpenCV are stored row-by-row in memory, hence accessing the images by increasing column is more efficient with respect to the cache usage.
The temporary variable idxF is never used, so you should probably remove the following line too:
int idxF = row*dst->widthStep + dst->nChannels*col;
When you access image data to store the pixels in temp_ptr1 and temp_ptr2, you swapped the positions of the x and y coordinates. You should access the image in the following way:
temp_ptr1 = &((uchar*)(img->imageData + (img->widthStep*pt.y)))[pt.x*3];
You never release the memory allocated for dst, hence introducing a memory leak in your application. Call cvReleaseImage(&dst); at the end of your function.
I am writing a basic code to add two matrix and note down the time taken for single thread and 2 or more threads. In the approach first i divide the given two matrix (initialized randomly) in THREADS number of segments, and then each of these segments are sent to the addition module, which is started by the pthread_create call. The argument to the parallel addition function is the following.
struct thread_segment
{
matrix_t *matrix1, *matrix2, *matrix3;
int start_row, offset;
};
Pointers to two source and one destination matrix. (Once source and the destination may point to the same matrix). The start_row is the row from which the particular thread should start adding, and the offset tells till how much this thread should add starting from start_row.
The matrix_t is a simple structure defined as below:
typedef struct _matrix_t
{
TYPE **mat;
int r, c;
} matrix_t;
I have compiled it with 2 threads, but there is (almost) no speedup when i ran with 10000 x 10000 matrix. I am recording the running time with time -p program.
The matrix random initialization is also done in parallel like above.
I think this is because all the threads work on the same matrix address area, may be because of that a bottleneck is not making any speedup. Although all the threads will work on different segments of a matrix, they don't overlap.
Previously i implemented a parallel mergesort and a quicksort which also showed similar characteristics, i was able to get speedup when i copied the data segment on which a particular thread is to work to a newly allocated memory.
My question is that is this because of:
memory bottleneck?
Time benchmark is not done in the proper way?
Dataset too small?
Coding error?
Other
In the case, if it is a memory bottleneck, then do every parallel program use exclusive memory area, even when multiple access of the threads on the shared memory can be done without mutex?
EDIT
I can see speedup when i make the matrix segments like
curr = 0;
jump = matrix1->r / THREADS;
for (i=0; i<THREADS; i++)
{
th_seg[i].matrix1 = malloc (sizeof (matrix_t));
th_seg[i].matrix1->mat = &(matrix1->mat[curr]);
th_seg[i].matrix1->c = matrix1->c;
th_seg[i].matrix1->r = jump;
curr += jump;
}
That is before passing, assign the base address of the matrix to be processed by this thread in the structure and store the number of rows. So now the base address of each matrix is different for each thread. But only if i add some small dimention matrix 100 x 100 say, many times. Before calling the parallel add in each iteration, i am re assigning the random values. Is the speedup noticed here true? Or due to some other phenomena chaching effects?
To optimize memory usage, you may want to take a look at loop tiling. That would help cache memory to be updated. In this approach you divide your matrices into smaller chunks so the cache can hold the values for longer time and does not need to update it self frequently.
Also notice, creating to many threads just increases the overhead of switching among them.
In order to get a feeling that how much a proper implementation can affect the run time of a concurrent program, these are the results of a programs to multiply two matrices in naive, cocnurrent and tiling-concurrent :
seconds name
10.72 simpleMul
5.16 mulThread
3.19 tilingMulThread
#pragma omp parallel for default(none) shared(x) private (y, z, f) ordered
for (i = 0; i < 512; i++) {
#pragma omp ordered
for (y = 0; y < 512; y++) {
for (z = 0, f = 0; z < 512; z++) {
x[f++] = z + i + y;
}
}
}
The above code runs slower than non SMP execution by about 20%
on a dual core. Without the "#pragma omp ordered" it is about 50% faster than non SMP.
The x[f++] sequence is assumed it has to remain in an ordered form since it's similarly reused later.
Can ordered code be faster than single threading? Is there another method to achieve it?
System is win32/mingw-w64.
It's not really ordered, since the results of one iteration do not depend upon the previous, except for your use of f.
Can you derive f from i,y and z? It looks like you can. For example:
f = z + y * 512 + i * 512 * 512 + initial_f;
Now your code is unordered, and you can get real benefits from parallelization.
Single-threaded/-core code is often faster than multi-threaded/-core due to saturation of the memory system. What happens is that the memory work required by the single thread is close to or at the limit of what the memory system can deliver. Add another thread/core that requires the same work and both threads/cores will need to share what the memory system can provide resulting in wait states and slower execution
After profiling and optimization of the memory work you may reach the point where the multi-threaded code is faster. The optimization requires moving data into non-shared memory (i e L1 & L2 caches) and minimizing accesses to shared memory (L3 & RAM).
The optimization solution is more or less unique to the application at hand. It is not trivial (though some third-party SW vendors will try to say that with their product it's a piece of cake). Once you've done it you'll at least have learned what constructs should be avoided and what techniques are useful.
You are obviously relying on a shared vector x in the inner loop. So each access to that variable must be mutexed by OMP. No wonder that the "parallel" version is slower than the sequential one.
It is difficult to advise you what to change, since your code makes no sense to me at all. What do you expect the result to be? If you have ordered the final result in x will be the version for the value i set to 511. If you don't, it is whoever thread wins for each individual entry.
And what the h... is your f supposed to do? When evaluated it has the same value as w, no? This is just adding noise to make it harder to understand.
I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:
B[i][j] = exp(A[i][j])
where i in [0, Ny] and j in [0, Nx].
Notice that this is different from matrix exponential:
B = exp(A)
which can be accomplished with some unstable/unsupported code in GSL (linalg.h).
I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?
EDIT
Results from the solution post of Drew Hall
All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.
Results when taking into account the {Row,Column}-Major mode to store the matrix:
226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).
Source code for case 1:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
Source code for case 2:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
Source code for case 3:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}
There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.
In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).
No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.
If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.
However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.
Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).