I'm working through CUDA C Programming by Cheng, and came across this piece of code:
void sumMatrixOnHost (float *A, float *B, float *C, const int nx, const int ny) {
float *ia = A;
float *ib = B;
float *ic = C;
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[ix] = ia[ix] + ib[ix];
}
ia += nx; ib += nx; ic += nx;
}
}
This is for matrix addition whereby the matrices are stored in a row-major format.
As I understand, the inner for loop is iterating over a row and performing element addition, and the outer for loop is then used to increment the pointers to the start of the next row.
Why is this approach better than using pointers over the whole matrix, i.e.
for (int i=0; i<ny*nx; i++) {
ic[i] = ia[i] + ib[i];
}
or dual for loops, i.e.
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[iy*nx+ix] = ia[iy*nx+ix] + ib[iy*nx+ix];
}
}
Is this something to do with how it is optimized by the compiler?
The simplest approach, is always the best approach:
for (int i=0; i<ny*nx; i++) {
C[i] = A[i] + B[i];
}
This will be faster than the first solution. The problem with splitting the matrix up by row, is that the vectoriser will do:
process line in batches of 32bytes (size of YMM)
Process the remaining handful of values at the end of the line.
Now repeat for each line!
If however you do it with a single loop, the code generated will be:
process all data in batches of 32bytes (size of YMM)
Process the remaining handful of values at the end of the matrix that don't align to 32byte blocks.
The first version just adds pointless code to process the inner loop. None of that code is needed, it just breaks the ability to vectorise the entire matrix.
The approach on sumMatrixOnHost is better for optimization, and it should execute faster (generally) then the two approach you have suggested.
For the alu multiplication takes more time than addition.
So in sumMatrixOnHost there is no multipicaion, in
for (int i=0; i<ny*nx; i++) {
ic[i] = ia[i] + ib[i];
}
there is multiplication in each iteration of the loop.
in
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[iy*nx+ix] = ia[iy*nx+ix] + ib[iy*nx+ix];
}
}
there are 3 multipication in each iteration of the loop.
A simpler approach can be
int n = ny*nx;
for (int i=0; i<n; i++) {
ic[i] = ia[i] + ib[i];
}
but in the last approach we lose another thing that is good in sumMatrixOnHost, and that is the ability to do the operation on matrix blocks and not the whole matrix.
Related
I´ve been working on a small project for my college, with C and Openmp. To make it short, when trying to parallelize a for loop using the #pragma omp parallel for constructor, it ends up being way slower than the serial version, just by adding that, is a parallel version of odd-even sort that works with an array of integers
I found it has something to do with the threads accessing the memory location of the whole array each time they compare numbers and updating its own copy of it on the cache memory. But I don´t know how to fix it, so rather than updating the whole array, they just check the exact location of the integers they are comparing, I´m kinda new using Openmp so idk if there´s a clause of constructor for this kind of situation.
//version without parallel for
void bubbleSortParalelo(int array[], int size) {
int i,j,first;
for (i = 0; i < size; i++){
first = i % 2;
for (j = first; j < size-1 ; j+= 2){
if (array[j] > array[j+1]){
int temp = array[j+1];
array[j+1]=array[j];
array[j]= temp;
}
}
}
}
//Version with parallel for, takes longer somehow
void bubbleSortParalelo2(int array[], int size) {
int i,j,first;
for (i = 0; i < size; i++){
first = i % 2;
#pragma omp parallel for
for (j = first; j < size-1 ; j+= 2){
if (array[j] > array[j+1]){
int temp = array[j+1];
array[j+1]=array[j];
array[j]= temp;
}
}
}
I want to make the parallel version at least as efficient as the serial one, because right now it takes like 10 times more, becoming worse with the more threads I use.
I have this code that transposes a matrix using loop tiling strategy.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?
When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.
In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).
You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.
You can use the collapse specifier to parallelize over two loops.
# pragma omp parallel for collapse(2)
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.
I thought about just adding #pragma omp parallel for but doesnt this
just parallelize the outer loop?
Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:
(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.
You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
#pragma omp parallel for collapse(...)
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < m; j += blocksize)
for (int k = i; k < i + blocksize; ++k
for (int l = j; l < j + blocksize; ++l)
dst[k + l*n] = src[l + k*m];
}
Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)
I have two vectors, a[n] and b[n], where n is a large number.
a[0] = b[0];
for (i = 1; i < size; i++) {
a[i] = a[i-1] + b[i];
}
With this code we try to achieve that a[i] contains the sum of all the numbers in b[] until b[i]. I need to parallelise this loop using openmp.
The main problem is that a[i] depends of a[i-1], so the only direct way that comes to my mind would be waiting for each a[i-1] number to be ready, which takes a lot of time and makes no sense. Is there any approach in openmp for solving this problem?
You're Carl Friedrich Gauss in the 18 century and your grade school teacher has decided to punish the class with a homework problem that requires a lot or mundane repeated arithmetic. In the previous week your teacher told you to add up the first 100 counting numbers and because you're clever you came up with a quick solution. Your teacher did not like this so he came up with a new problem which he thinks can't be optimized. In your own notation you rewrite the problem like this
a[0] = b[0];
for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];
then you realize that
a0 = b[0]
a1 = (b[0]) + b[1];
a2 = ((b[0]) + b[1]) + b[2]
a_n = b[0] + b[1] + b[2] + ... b[n]
using your notation again you rewrite the problem as
int sum = 0;
for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;
How to optimize this? First you define
int sum(n0, n) {
int sum = 0;
for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
return sum;
}
Then you realize that
a_n+1 = sum(0, n) + sum(n, n+1)
a_n+2 = sum(0, n) + sum(n, n+2)
a_n+m = sum(0, n) + sum(n, n+m)
a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)
So now you know what to do. Find t classmates. Have each one work on a subset of the numbers. To keep it simple you choose size is 100 and four classmates t0, t1, t2, t3 then each one does
t0 t1 t2 t3
s0 = sum(0,25) s1 = sum(25,50) s2 = sum(50,75) s3 = sum(75,100)
at the same time. Then define
fix(int n0, int n, int offset) {
for(int i=n0; i<n; i++) a[i] += offset
}
and then each classmates goes back over their subset at the same time again like this
t0 t1 t2 t3
fix(0, 25, 0) fix(25, 50, s0) fix(50, 75, s0+s1) fix(75, 100, s0+s1+s2)
You realize that with t classmate taking about the same K seconds to do arithmetic that you can finish the job in 2*K*size/t seconds whereas one person would take K*size seconds. It's clear you're going to need at least two classmates just to break even. So with four classmates they should finish in half the time as one classmate.
Now your write up your algorithm in your own notation
int *suma; // array of partial results from each classmate
#pragma omp parallel
{
int ithread = omp_get_thread_num(); //label of classmate
int nthreads = omp_get_num_threads(); //number of classmates
#pragma omp single
suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;
//now have each classmate calculate their partial result s = sum(n0, n)
int s = 0;
#pragma omp for schedule(static) nowait
for (int i=0; i<size; i++) s += b[i], a[i] = sum;
suma[ithread+1] = s;
//now wait for each classmate to finish
#pragma omp barrier
// now each classmate sums each of the previous classmates results
int offset = 0;
for(int i=0; i<(ithread+1); i++) offset += suma[i];
//now each classmates corrects their result
#pragma omp for schedule(static)
for (int i=0; i<size; i++) a[i] += offset;
}
free(suma)
You realize that you could optimize the part where each classmate has to add the result of the previous classmate but since size >> t you don't think it's worth the effort.
Your solution is not nearly as fast as your solution adding the counting numbers but nevertheless your teacher is not happy that several of his students finished much earlier than the other students. So now he decides that one student has to read the b array slowly to the class and when you report the result a it has to be done slowly as well. You call this being read/write bandwidth bound. This severely limits the effectiveness of your algorithm. What are you going to do now?
The only thing you can think of is to get multiple classmates to read and record different subsets of the numbers to the class at the same time.
I have been writing small program, in which I had to use coordinates system on board (x/y in 2d array) and was thinking if I should use indexing like array[x][y], which seems more natural to me or array[y][x] which would match better the way array is represented in memory. I believe both of these methods will be working if I am consistent and it's just naming issue, but what about conventions when writing larger programs?
In my field (image manipulation) the [y][x] convention is more usual. Whatever you do, be consistent and document it well.
You should also consider what you are going to do with these arrays, and whether this is time-critical.
As mentioned in the comments: The element a[r][c+1] is right next to a[r][c]. This fact may have a considerable impact on the performance when iterating over larger arrays. A proper traversal order will cause the cache lines to be fully exploited: When one array index is accessed, it is considered as being "likely" that afterwards, the next index will be accessed, and a whole block of memory will be loaded into the cache. If you are afterwards accessing a completely different memory location (namely, one in the next row), then this cache bandwidth is wasted.
If possible, you should try to use a traversal order that fits the actual memory layout.
(Of course, this is much about "conventions" and "habits": When writing an array access like a[row][col], this is usually interpreted as array access a[y][x], due to the convention of the x-axis being horizontal and the y-axis being vertical...)
Here is a small example that demonstrates the potential performance impact of a "wrong" traversal order:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
float computeSumRowMajor(float **array, int rows, int cols)
{
float sum = 0;
for (int r=0; r<rows; r++)
{
for (int c=0; c<cols; c++)
{
sum += array[r][c];
}
}
return sum;
}
float computeSumColMajor(float **array, int rows, int cols)
{
float sum = 0;
for (int c=0; c<cols; c++)
{
for (int r=0; r<rows; r++)
{
sum += array[r][c];
}
}
return sum;
}
int main()
{
int rows = 5000;
int cols = 5000;
float **array = (float**)malloc(rows*sizeof(float*));
for (int r=0; r<rows; r++)
{
array[r] = (float*)malloc(cols*sizeof(float));
for (int c=0; c<cols; c++)
{
array[r][c] = 0.01f;
}
}
clock_t start, end;
start = clock();
float sumRowMajor = 0;
for (int i=0; i<10; i++)
{
sumRowMajor += computeSumRowMajor(array, rows, cols);
}
end = clock();
double timeRowMajor = ((double) (end - start)) / CLOCKS_PER_SEC;
start = clock();
float sumColMajor = 0;
for (int i=0; i<10; i++)
{
sumColMajor += computeSumColMajor(array, rows, cols);
}
end = clock();
double timeColMajor = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("Row major %f, result %f\n", timeRowMajor, sumRowMajor);
printf("Col major %f, result %f\n", timeColMajor, sumColMajor);
return 0;
}
(apologies if I violated some best practices here, I'm usually a Java guy...)
For me, the row-major access is nearly an order of magnitude faster than the column-major access. Of course, the exact numbers will heavily depend on the target system, but the general issue should be the same on all targets.
I want to do some calculation with some matrices whose size is 2048*2048, for example.But the simulator stop working and it does not simulate the code. I understood that the problem is about the size and type of variable. For example, I run a simple code, which is written below, to check whether I am right or not. I should print 1 after declaring variable A. But it does not work.
Please note that I use Codeblocks. WFM is a function to write a float matrix in a text file and it works properly because I check that before with other matrices.
int main()
{
float A[2048][2048];
printf("1");
float *AP = &(A[0][0]);
const char *File_Name = "example.txt";
int counter = 0;
for(int i = 0; i < 2048; i++)
for(int j = 0; j < 2048; j++)
{
A[i][j] = counter;
++counter;
}
WFM(AP, 2048, 2048, File_Name , ' ');
return 0;
}
Any help and suggestion to deal with this problem and larger matrices is appreciate it.
Thanks
float A[2048][2048];
which requires approx. 2K * 2K * 8 = 32M of stack memory. But typically the stack size of the process if far less than that. Please allocate it dynamically using alloc family.
float A[2048][2048];
This may be too large for a local array, you should allocate memory dynamically by function such as malloc. For example, you could do this:
float *A = malloc(2048*2048*sizeof(float));
if (A == 0)
{
perror("malloc");
exit(1);
}
float *AP = A;
int counter = 0;
for(int i = 0; i < 2048; i++)
for(int j = 0; j < 2048; j++)
{
*(A + 2048*i + j) = counter;
++counter;
}
And when you does not need A anymore, you can free it by free(A);.
Helpful links about efficiency pitfalls of large arrays with power-of-2 size (offered by #LưuVĩnhPhúc):
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Why is my program slow when looping over exactly 8192 elements?
Matrix multiplication: Small difference in matrix size, large difference in timings