pthread speedup for matrix add/multiply - c

I am writing a basic code to add two matrix and note down the time taken for single thread and 2 or more threads. In the approach first i divide the given two matrix (initialized randomly) in THREADS number of segments, and then each of these segments are sent to the addition module, which is started by the pthread_create call. The argument to the parallel addition function is the following.
struct thread_segment
{
matrix_t *matrix1, *matrix2, *matrix3;
int start_row, offset;
};
Pointers to two source and one destination matrix. (Once source and the destination may point to the same matrix). The start_row is the row from which the particular thread should start adding, and the offset tells till how much this thread should add starting from start_row.
The matrix_t is a simple structure defined as below:
typedef struct _matrix_t
{
TYPE **mat;
int r, c;
} matrix_t;
I have compiled it with 2 threads, but there is (almost) no speedup when i ran with 10000 x 10000 matrix. I am recording the running time with time -p program.
The matrix random initialization is also done in parallel like above.
I think this is because all the threads work on the same matrix address area, may be because of that a bottleneck is not making any speedup. Although all the threads will work on different segments of a matrix, they don't overlap.
Previously i implemented a parallel mergesort and a quicksort which also showed similar characteristics, i was able to get speedup when i copied the data segment on which a particular thread is to work to a newly allocated memory.
My question is that is this because of:
memory bottleneck?
Time benchmark is not done in the proper way?
Dataset too small?
Coding error?
Other
In the case, if it is a memory bottleneck, then do every parallel program use exclusive memory area, even when multiple access of the threads on the shared memory can be done without mutex?
EDIT
I can see speedup when i make the matrix segments like
curr = 0;
jump = matrix1->r / THREADS;
for (i=0; i<THREADS; i++)
{
th_seg[i].matrix1 = malloc (sizeof (matrix_t));
th_seg[i].matrix1->mat = &(matrix1->mat[curr]);
th_seg[i].matrix1->c = matrix1->c;
th_seg[i].matrix1->r = jump;
curr += jump;
}
That is before passing, assign the base address of the matrix to be processed by this thread in the structure and store the number of rows. So now the base address of each matrix is different for each thread. But only if i add some small dimention matrix 100 x 100 say, many times. Before calling the parallel add in each iteration, i am re assigning the random values. Is the speedup noticed here true? Or due to some other phenomena chaching effects?

To optimize memory usage, you may want to take a look at loop tiling. That would help cache memory to be updated. In this approach you divide your matrices into smaller chunks so the cache can hold the values for longer time and does not need to update it self frequently.
Also notice, creating to many threads just increases the overhead of switching among them.
In order to get a feeling that how much a proper implementation can affect the run time of a concurrent program, these are the results of a programs to multiply two matrices in naive, cocnurrent and tiling-concurrent :
seconds name
10.72 simpleMul
5.16 mulThread
3.19 tilingMulThread

Related

Efficient Parallel algorithm for array filtering

Given a very large array I want to select only the elements that match some condition. I know a priori the number of elements that will be matched. My current pseucode is:
filter(list):
out = list of predetermined size
i = 0
for element in list
if element matches condition
out[i++] = element
return out
When trying to parallelize the previous algorithm, my naive approach was to just make the increment of i atomic (It's part of an OpenMP project so used #pragma omp atomic). However this implementation has slowed performance when compared to even the serial implementation. What more efficient algorithms are there to implement this? And why am I getting such heavy slowdown?
Information from the comments:
I'm using C with OpenMP;
Currently testing with one million entries;
Serial takes ~7 seconds, Parallel with two threads takes ~15;
The output array size it's exactly half (the problem is equivalent to finding the elements of the array smaller than the median of said array);
I'm testing it with two cores only.
However this implementation has slowed performance when compared to
even the serial implementation. What more efficient algorithms are
there to implement this?
The bottleneck is the overhead of atomic operation, therefore at first glance a more efficient algorithm would be one that would avoid the use of such operation. Although that is possible, the code would require a two step approach for instance each thread would save the elements that it has find out in a private array. After the parallel region the main thread would collect all the elements that each thread have found and merge them into a single array.
The second part of merging into a single array can be made parallel as well or using SIMD instructions.
I have created a small code to mimic your pseudocode:
#include <time.h>
#include <omp.h>
int main ()
{
int array_size = 1000000;
int *a = malloc(sizeof(int) * array_size);
int *out = malloc(sizeof(int) * array_size/2);
for(int i = 0; i < array_size; i++)
a[i] = i;
double start = omp_get_wtime();
int i = 0;
#pragma omp parallel for
for (int n=0 ; n < array_size; ++n ){
if(a[n] % 2 == 0){
int tmp;
#pragma omp atomic capture
tmp = i++;
out[tmp] = a[n];
}
}
double end = omp_get_wtime();
printf("%f\n",end-start);
free(a);
free(out);
return 0;
}
I did an ad hoc benchmark with the following results:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.017891 (s)
-> 4 Threads : 0.015254 (s)
So the parallel version is much much slower, which is expectable because the work performed in parallel is simply not enough to overcome the overhead of the atomic and of the parallelism.
I have tested also removing the atomic and leaving the race-condition just to check the time:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.001283 (s)
-> 4 Threads : 0.000720 (s)
So without the atomic the speedup was approximately 1.2 and 2.2 for 2 and 4 threads, respectively. So naturally the atomic is causing a huge overhead. Nonetheless, the speedups are not great even without any atomic. And this is the most that you can expect from parallelizing your code alone.
Depending in your real code, and on how computation demanding is your condition, you may not achieve great speedups even with a second approach that I have mentioned.
A useful note from the comments from #Paul G:
Even if it isn't going to speed up this particular example, it might
be useful to state that problems like this are generally solved in
parallel computing using a (parallel) exclusive scan algorithm/prefix
sum to determine which thread starts at which index of the output
array. Then every thread has a private output index it is incrementing
and one doesn't need atomics.
That approach might be slower than #dreamcrashs solution in this
particular case but is more general as having private arrays for each
thread can be very tricky (determining their size etc.) especially if
your output is bigger than the input (case where each input element
doesnt give 0 or 1 output element, but n output elements).

Convert sequential loop into parallel in C using pthreads

I would like to apply a pretty simple straightforward calculation on a n-by-d-dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads. My question is: what is the optimal way to split the problem? How could I significantly reduce the execution time of my script? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.
double * calcDistance(double * X ,int n, int d)
{
//calculate and return an array[n-1] of all the distances
//from the last point
double *distances = calloc(n,sizeof(double));
for(int i=0 ; i<n-1; i++)
{
//distances[i]=0;
for (int j=0; j< d; j++)
{
distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);
}
distances[i] = sqrt(distances[i]);
}
return distances;
}
I provide a main()-caller function in order for the sample to be complete and testable:
#include <stdio.h>
#include <stdlib.h>
#define N 10 //00000
#define D 2
int main()
{
srand(time(NULL));
//allocate the proper space for X
double *X = malloc(D*N*(sizeof(double)));
//fill X with numbers in space (0,1)
for(int i = 0 ; i<N ; i++)
{
for(int j=0; j<D; j++)
{
X[i+j*N] = (double) (rand() / (RAND_MAX + 2.0));
}
}
X = calcDistances(X, N, D);
return 0;
}
I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element.
Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
Your inner loop jumps all over array X with a mixture of strides that varies with
the outer-loop iteration. Unless n and d are quite small,* this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.
what is the optimal way to split the problem?
Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1, have the second do (N / T) ... (2 * N / T) - 1, etc..
How could I significantly reduce the execution time of my script?
The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.
I have already tried utilizing pthreads asynchronously through the use
of a global_index that is imposed to mutex and a local_index. [...]
If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.
* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.

OpenMP: Will different memory locations speed up loops?

I have an OpenMP program that calculates variables in a loop.
int a[1000000];
int b[1000000];
int c[1000000];
int d[1000000];
#pragma omp parallel for private(i) shared(a,b,c,d)
for (i=0;i<1000000;++i)
{
d[i] = b[i]*a[i] + c[i]+10;
}
I used perf and perf said the bottleneck is in memory reading and writing.
First question: Is it possible to split the arrays A,B,C,D and put them into different memory banks using OpenMP?
Second question: If I split the arrays A,B,C,D into smaller arrays, will they speed up the execution of the loop?
First question: Is it possible to split the arrays A,B,C,D and put them into different memory banks using OpenMP?
By my knowledge, no, you cannot split arrays explicitly with OpenMP runtime functions. What OpenMP do is to attribute equal load of work to each thread. Let's say you have defined OMP_NUM_THREADS=10, then each thread will receive 100000 iterations to execute. Or last thread will receive less work if remainder not equals to 0. (What you can do is assign thread affinity to cores.)
Second question: If I split the arrays A,B,C,D into smaller arrays, will they speed up the execution of the loop?
No. But you can increase the number of threads to obtain better performance (until at a point that threads' overheads overtake the acceleration)

Why does copying a 2D array column by column take longer than row by row in C? [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 7 years ago.
#include <stdio.h>
#include <time.h>
#define N 32768
char a[N][N];
char b[N][N];
int main() {
int i, j;
printf("address of a[%d][%d] = %p\n", N, N, &a[N][N]);
printf("address of b[%5d][%5d] = %p\n", 0, 0, &b[0][0]);
clock_t start = clock();
for (j = 0; j < N; j++)
for (i = 0; i < N; i++)
a[i][j] = b[i][j];
clock_t end = clock();
float seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
start = clock();
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = b[i][j];
end = clock();
seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
return 0;
}
Output:
address of a[32768][32768] = 0x80609080
address of b[ 0][ 0] = 0x601080
time taken: 18.063229 secs
time taken: 3.079248 secs
Why does column by column copying take almost 6 times as long as row by row copying? I understand that 2D array is basically an nxn size array where A[i][j] = A[i*n + j], but using simple algebra, I calculated that a Turing machine head (on main memory) would have to travel a distance of in both the cases. Here nxn is the size of the array and x is the distance between last element of first array and first element of second array.
It pretty much comes down to this image (source):
When accessing data, your CPU will not only load a single value, but will also load adjacent data into the CPU's L1 cache. When iterating through your array by row, the items that have automatically been loaded into the cache are actually the ones that are processed next. However, when you are iterating by column, each time an entire "cache line" of data (the size varies per CPU) is loaded, only a single item is used and then the next line has to be loaded, effectively making the cache pointless.
The wikipedia entry and, as a high level overview, this PDF should help you understand how CPU caches work.
Edit: chqrlie in the comments is of course correct. One of the relevant factors here is that only very few of your columns fit into the L1 cache at the same time. If your rows were much smaller (say, the total size of your two dimensional array was only some kilobytes) then you might not see a performance impact from iterating per-column.
While it's normal to draw the array as a rectangle, the addressing of array elements in memory is linear: 0 to one minus the number of bytes available (on nearly all machines).
Memory hierarchies (e.g. registers < L1 cache < L2 cache < RAM < swap space on disk) are optimized for the case where memory accesses are localized: accesses that are successive in time touch addresses that are close together. They are even more highly optimized (e.g. with pre-fetch strategies) for sequential access in linear order of addresses; e.g. 100,101,102...
In C, rectangular arrays are arranged in linear order by concatenating all the rows (other languages like FORTRAN and Common Lisp concatenate columns instead). Therefore the most efficient way to read or write the array is to do all the columns of the first row, then move on to the rest, row by row.
If you go down the columns instead, successive touches are N bytes apart, where N is the number of bytes in a row: 100, 10100, 20100, 30100... for the case N=10000 bytes.Then the second column is 101,10101, 20101, etc. This is the absolute worst case for most cache schemes.
In the very worst case, you can cause a page fault on each access. These days on even on an average machine it would take an enormous array to cause that. But if it happened, each touch could cost ~10ms for a head seek. Sequential access is a few nano-seconds per. That's over a factor of a million difference. Computation effectively stops in this case. It has a name: disk thrashing.
In a more normal case where only cache faults are involved, not page faults, you might see a factor of hundred. Still worth paying attention.
There are 3 main aspects that contribute to the timing different:
The first double loop accesses both arrays for the first time. You are actually reading uninitialized memory which is bad if you expect any meaningful results (functionally as well as timing-wise), but in terms of timing what plays part here is the fact that these addresses are cold, and reside in the main memory (if you're lucky), or aren't even paged (if you're less lucky). In the latter case, you would have a page fault on each new page, and would invoke a system call to allocate a page for the first time. Note that this doesn't have anything to do with the order of traversal, but simply because the first access is much slower. To avoid that, initialize both arrays to some value.
Cache line locality (as explained in the other answers) - if you access sequential data, you miss once per line, and then enjoy the benefit of having it fetched already. You most likely won't even hit it in the cache but rather in some buffer, since the consecutive requests will be waiting for that line to get fetched. When accessing column-wise, you would fetch the line, cache it, but if the reuse distance is large enough - you would lose it and have to fetch it again.
Prefetching - modern CPUs would have HW prefetching mechanisms that can detect sequential accesses and prefetch the data ahead of time, which will eliminate even the first miss of each line. Most CPUs also have stride based prefetches which may be able to cover the column size, but these things don't work well usually with matrix structures since you have too many columns and it would be impossible for HW to track all these stride flows simultaneously.
As a side note, I would recommend that any timing measurement would be performed multiple times and amortized - that would have eliminated problem #1.

Most efficient way to calculate the exponential of each element of a matrix

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:
B[i][j] = exp(A[i][j])
where i in [0, Ny] and j in [0, Nx].
Notice that this is different from matrix exponential:
B = exp(A)
which can be accomplished with some unstable/unsupported code in GSL (linalg.h).
I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?
EDIT
Results from the solution post of Drew Hall
All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.
Results when taking into account the {Row,Column}-Major mode to store the matrix:
226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).
Source code for case 1:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
Source code for case 2:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
Source code for case 3:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}
There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.
In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).
No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.
If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.
However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.
Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).

Resources