Convert sequential loop into parallel in C using pthreads - c

I would like to apply a pretty simple straightforward calculation on a n-by-d-dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads. My question is: what is the optimal way to split the problem? How could I significantly reduce the execution time of my script? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.
double * calcDistance(double * X ,int n, int d)
{
//calculate and return an array[n-1] of all the distances
//from the last point
double *distances = calloc(n,sizeof(double));
for(int i=0 ; i<n-1; i++)
{
//distances[i]=0;
for (int j=0; j< d; j++)
{
distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);
}
distances[i] = sqrt(distances[i]);
}
return distances;
}
I provide a main()-caller function in order for the sample to be complete and testable:
#include <stdio.h>
#include <stdlib.h>
#define N 10 //00000
#define D 2
int main()
{
srand(time(NULL));
//allocate the proper space for X
double *X = malloc(D*N*(sizeof(double)));
//fill X with numbers in space (0,1)
for(int i = 0 ; i<N ; i++)
{
for(int j=0; j<D; j++)
{
X[i+j*N] = (double) (rand() / (RAND_MAX + 2.0));
}
}
X = calcDistances(X, N, D);
return 0;
}
I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element.
Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.

Your inner loop jumps all over array X with a mixture of strides that varies with
the outer-loop iteration. Unless n and d are quite small,* this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.
what is the optimal way to split the problem?
Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1, have the second do (N / T) ... (2 * N / T) - 1, etc..
How could I significantly reduce the execution time of my script?
The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.
I have already tried utilizing pthreads asynchronously through the use
of a global_index that is imposed to mutex and a local_index. [...]
If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.
* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.

Related

Efficient Parallel algorithm for array filtering

Given a very large array I want to select only the elements that match some condition. I know a priori the number of elements that will be matched. My current pseucode is:
filter(list):
out = list of predetermined size
i = 0
for element in list
if element matches condition
out[i++] = element
return out
When trying to parallelize the previous algorithm, my naive approach was to just make the increment of i atomic (It's part of an OpenMP project so used #pragma omp atomic). However this implementation has slowed performance when compared to even the serial implementation. What more efficient algorithms are there to implement this? And why am I getting such heavy slowdown?
Information from the comments:
I'm using C with OpenMP;
Currently testing with one million entries;
Serial takes ~7 seconds, Parallel with two threads takes ~15;
The output array size it's exactly half (the problem is equivalent to finding the elements of the array smaller than the median of said array);
I'm testing it with two cores only.
However this implementation has slowed performance when compared to
even the serial implementation. What more efficient algorithms are
there to implement this?
The bottleneck is the overhead of atomic operation, therefore at first glance a more efficient algorithm would be one that would avoid the use of such operation. Although that is possible, the code would require a two step approach for instance each thread would save the elements that it has find out in a private array. After the parallel region the main thread would collect all the elements that each thread have found and merge them into a single array.
The second part of merging into a single array can be made parallel as well or using SIMD instructions.
I have created a small code to mimic your pseudocode:
#include <time.h>
#include <omp.h>
int main ()
{
int array_size = 1000000;
int *a = malloc(sizeof(int) * array_size);
int *out = malloc(sizeof(int) * array_size/2);
for(int i = 0; i < array_size; i++)
a[i] = i;
double start = omp_get_wtime();
int i = 0;
#pragma omp parallel for
for (int n=0 ; n < array_size; ++n ){
if(a[n] % 2 == 0){
int tmp;
#pragma omp atomic capture
tmp = i++;
out[tmp] = a[n];
}
}
double end = omp_get_wtime();
printf("%f\n",end-start);
free(a);
free(out);
return 0;
}
I did an ad hoc benchmark with the following results:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.017891 (s)
-> 4 Threads : 0.015254 (s)
So the parallel version is much much slower, which is expectable because the work performed in parallel is simply not enough to overcome the overhead of the atomic and of the parallelism.
I have tested also removing the atomic and leaving the race-condition just to check the time:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.001283 (s)
-> 4 Threads : 0.000720 (s)
So without the atomic the speedup was approximately 1.2 and 2.2 for 2 and 4 threads, respectively. So naturally the atomic is causing a huge overhead. Nonetheless, the speedups are not great even without any atomic. And this is the most that you can expect from parallelizing your code alone.
Depending in your real code, and on how computation demanding is your condition, you may not achieve great speedups even with a second approach that I have mentioned.
A useful note from the comments from #Paul G:
Even if it isn't going to speed up this particular example, it might
be useful to state that problems like this are generally solved in
parallel computing using a (parallel) exclusive scan algorithm/prefix
sum to determine which thread starts at which index of the output
array. Then every thread has a private output index it is incrementing
and one doesn't need atomics.
That approach might be slower than #dreamcrashs solution in this
particular case but is more general as having private arrays for each
thread can be very tricky (determining their size etc.) especially if
your output is bigger than the input (case where each input element
doesnt give 0 or 1 output element, but n output elements).

Which sequence is more effective in Assembly language?

I have 2 C sequences which both multiply two matrices.
Sequence 1:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = 0; i < M; i++)
for (j = 0; j < P; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
Sequence 2:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = M - 1; i >= 0; i--)
for (j = P - 1; j >= 0; j--)
for (k = N - 1; k >= 0; k--)
C[i][j] += A[i][k] * B[k][j];
My question is: which of them is more efficient when translated in Assembly language?
I'm pretty sure that the second one can be written using the loop instruction, while the first one can be written using inc/jl.
First, you should understand that source code does not dictate what the assembly language is. The C standard allows a compiler to transform a program in any way as long as the resulting observable behavior (defined by the standard) remains the same. (The observable behavior is largely the output to files and devices, interactive input and output, and accesses to special volatile objects.)
Compilers take advantage of this rule to optimize your program. If the results of your loop are the same in either direction, then, in the best compilers, writing the loop in one direction or another has no consequence. The compiler analyzes the source code and sees that the effect of the loop is merely to perform a set of operations whose order does not matter. It represents the loop and the operations within it abstractly and later generates the best assembly code it can.
If the arrays in your example are large, then the time it takes the compiler to execute the loop control instructions is irrelevant. In typical systems, it takes dozens of CPU cycles or more to fetch a value from memory. With large arrays, the bottleneck in your example code will be fetching data from memory. The CPU will be forced to wait for this data, and it will easily complete any loop control or array address arithmetic instructions while it is waiting for data from memory.
Typical systems deal with the slow memory problem by including some fast memory, called cache. Often, there is very fast cache built into the core of the processor itself, plus some fast cache on the chip with the processor, and there are may other levels of cache. Memory in cache is organized into lines, which are segments of consecutive data from memory. Thus, one cache line may contain eight consecutive int objects. When the processor needs data that is not already in cache, an entire cache line is fetched from memory. Because of this, you can avoid the memory delay by using eight consecutive int objects. When you read the first one (or even before—the processor may predict your read and start fetching it ahead of time), all eight will be ready from memory. So your program will only have to wait for the first one. When it goes to use the second through the eight, they will already be in cache, where they are immediately available to the processor.
Unfortunately, array multiplication is notoriously bad for caches. Although your loop traverses the rows of array A (using A[i][k] where k is the fastest-varying index as your code is written), it traverses the columns of B (using B[k][j]). So consecutive iterations of your loop use consecutive elements of A but not consecutive elements of B. If the arrays are large, your program will end up waiting for elements from B to be fetched from memory. And, if you change the code to use consecutive elements from B, then it no longer uses consecutive elements from A.
With array multiplication, a typical way to deal with this problem is to split the array multiplication into smaller blocks, doing only a portion at a time, perhaps 8×8 blocks. This works because the cache can hold multiple lines at a time. If you arrange the work so that one 8×8 block from B (e.g., all the elements with a row number from 16 to 23 and a column number from 32 to 39) is used repeatedly for a while, then it can remain in cache, with all its data immediately available. This sort of rearrangement of work can speed up your program tremendously, making it many times faster. It is a much larger improvement than merely changing the direction of your loops can provide.
Some compilers can see that your loops on i, j, and k can be interchanged, and they may try to reorganize them if there is some benefit. Few compilers can break up the routines into blocks as I describe above. Also, the compiler can rearrange the work in your example only because you show A, B, and C declared as separate arrays. If these were not visible to the compiler but were instead passed as pointers to a function that was performing matrix multiplication, the compiler would not be able to see that A, B, and C point to separate arrays. In this case, it cannot know that the order of the loops does not matter. If the function were passed a C that points to the same array as A, the function would be overwriting some of its input while calculating outputs, and so the loop directions would matter.
There are a variety of matrix multiplication libraries that use the blocking technique and others to perform matrix multiplication efficiently.

Matrix multiplication in 2 different ways (comparing time)

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

Ordered Parallel code runs slower than single threading. Is there a solution?

#pragma omp parallel for default(none) shared(x) private (y, z, f) ordered
for (i = 0; i < 512; i++) {
#pragma omp ordered
for (y = 0; y < 512; y++) {
for (z = 0, f = 0; z < 512; z++) {
x[f++] = z + i + y;
}
}
}
The above code runs slower than non SMP execution by about 20%
on a dual core. Without the "#pragma omp ordered" it is about 50% faster than non SMP.
The x[f++] sequence is assumed it has to remain in an ordered form since it's similarly reused later.
Can ordered code be faster than single threading? Is there another method to achieve it?
System is win32/mingw-w64.
It's not really ordered, since the results of one iteration do not depend upon the previous, except for your use of f.
Can you derive f from i,y and z? It looks like you can. For example:
f = z + y * 512 + i * 512 * 512 + initial_f;
Now your code is unordered, and you can get real benefits from parallelization.
Single-threaded/-core code is often faster than multi-threaded/-core due to saturation of the memory system. What happens is that the memory work required by the single thread is close to or at the limit of what the memory system can deliver. Add another thread/core that requires the same work and both threads/cores will need to share what the memory system can provide resulting in wait states and slower execution
After profiling and optimization of the memory work you may reach the point where the multi-threaded code is faster. The optimization requires moving data into non-shared memory (i e L1 & L2 caches) and minimizing accesses to shared memory (L3 & RAM).
The optimization solution is more or less unique to the application at hand. It is not trivial (though some third-party SW vendors will try to say that with their product it's a piece of cake). Once you've done it you'll at least have learned what constructs should be avoided and what techniques are useful.
You are obviously relying on a shared vector x in the inner loop. So each access to that variable must be mutexed by OMP. No wonder that the "parallel" version is slower than the sequential one.
It is difficult to advise you what to change, since your code makes no sense to me at all. What do you expect the result to be? If you have ordered the final result in x will be the version for the value i set to 511. If you don't, it is whoever thread wins for each individual entry.
And what the h... is your f supposed to do? When evaluated it has the same value as w, no? This is just adding noise to make it harder to understand.

Most efficient way to calculate the exponential of each element of a matrix

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:
B[i][j] = exp(A[i][j])
where i in [0, Ny] and j in [0, Nx].
Notice that this is different from matrix exponential:
B = exp(A)
which can be accomplished with some unstable/unsupported code in GSL (linalg.h).
I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?
EDIT
Results from the solution post of Drew Hall
All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.
Results when taking into account the {Row,Column}-Major mode to store the matrix:
226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).
Source code for case 1:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
Source code for case 2:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
Source code for case 3:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}
There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.
In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).
No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.
If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.
However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.
Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).

Resources