Numbers not randomized after runs - c

I'm trying to create an openMP program that randomizes double arrays and run the values through the formula: y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
If I run the program multiple times I've realised that the Y[] values are the same even though they are supposed to be randomized when the arrays are initialized in the first #pragma omp for . Any Ideas as to why this might be happening?
#include<stdio.h>
#include <stdio.h>
#include <stdlib.h>
#include<omp.h>
#define ARRAY_SIZE 10
double randfrom(double min, double max);
double randfrom(double min, double max)
{
double range = (max - min);
double div = RAND_MAX / range;
return min + (rand() / div);
}
int main() {
int i;
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE], d[ARRAY_SIZE], e[ARRAY_SIZE], f[ARRAY_SIZE], y[ARRAY_SIZE];
double min, max;
int imin, imax;
/*A[10] consists of random number in between 1 and 100
B[10] consists of random number in between 10 and 50
C[10] consists of random number in between 1 and 10
D[10] consists of random number in between 1 and 50
E[10] consists of random number in between 1 and 5
F[10] consists of random number in between 10 and 80*/
srand(time(NULL));
#pragma omp parallel
{
#pragma omp parallel for
for (i = 0; i < ARRAY_SIZE; i++) {
a[i] = randfrom(1, 100);
b[i] = randfrom(10, 50);
c[i] = randfrom(1, 50);
d[i] = randfrom(1, 50);
e[i] = randfrom(1, 5);
f[i] = randfrom(10, 80);
}
}
printf("This is the parallel Print\n\n\n");
#pragma omp parallel shared(a,b,c,d,e,f,y) private(i)
{
//Y=(A*B)+C+(D*E)+(F/2)
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
/*printf("A[%d]%.2f",i, a[i]);
printf("\n\n");
printf("B[%d]%.2f", i, b[i]);
printf("\n\n");
printf("C[%d]%.2f", i, c[i]);
printf("\n\n");
printf("D[%d]%.2f", i, d[i]);
printf("\n\n");
printf("E[%d]%.2f", i, e[i]);
printf("\n\n");
printf("F[%d]%.2f", i, f[i]);
printf("\n\n");*/
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
printf("Y[%d]=%.2f\n", i, y[i]);
}
}
#pragma omp parallel shared(y, min,imin,max,imax) private(i)
{
//min
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
if (i == 0) {
min = y[i];
imin = i;
}
else {
if (y[i] < min) {
min = y[i];
imin = i;
}
}
}
//max
#pragma omp for schedule(dynamic) nowait
for (i = 0; i < ARRAY_SIZE; i++) {
if (i == 0) {
max = y[i];
imax = i;
}
else {
if (y[i] > max) {
max = y[i];
imax = i;
}
}
}
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", imin, min, imax, max);
return 0;
}

First of all, I would like to emphasize that OpenMP has significant overheads, so you need a reasonable amount of work in your code, otherwise the overhead is bigger than the gain by parallelization. In your code this is the case, so the fastest solution is to use serial code. However, you mentioned that your goal is to learn OpenMP, so I will show you how to do it.
In your previous post's comments #paleonix linked a post ( How to generate random numbers in parallel? ) which answers your question about random numbers. One of the solutions is to use rand_r.
Your code has a data race when searching for minimum and maximum values of array Y. If you need to find the minimum/maximum value only it is very easy, because you can use reduction like this:
double max=y[0];
#pragma omp parallel for default(none) shared(y) reduction(max:max)
for (int i = 1; i < ARRAY_SIZE; i++) {
if (y[i] > max) {
max = y[i];
}
}
But in your case you also need the indices of minimum and maximum value, so it is a bit more complicated. You have to use a critical section to be sure that other threads can not change the max, min, imax and imin values while you updating their values. So, it can be done the following way (e.g. for finding minimum value):
#pragma omp parallel for
for (int i = 0; i < ARRAY_SIZE; i++) {
if (y[i] < min) {
#pragma omp critical
if (y[i] < min) {
min = y[i];
imin = i;
}
}
}
Note that the if (y[i] < min) appears twice, because after the first comparison other threads may change the value of min, so inside the critical region before updating min and imin values you have to check it again. You can do it exactly the same way in the case of finding the maximum value.
Always use your variables at their minimum required scope.
It is also recommend to use default(none) clause in your OpenMP parallel region so, you have to explicitly define the sharing attributes all of your variables.
You can fill the array and find its minimum/maximum values in a single loop and print their values in a different serial loop.
If you set min and max before the loop, you can get rid of the extra comparison if (i == 0) used inside the loop.
Putting it together:
double threadsafe_rand(unsigned int* seed, double min, double max)
{
double range = (max - min);
double div = RAND_MAX / range;
return min + (rand_r(seed) / div);
}
In main:
double min=DBL_MAX;
double max=-DBL_MAX;
#pragma omp parallel default(none) shared(a,b,c,d,e,f,y,imin,imax,min,max)
{
unsigned int seed=omp_get_thread_num();
#pragma omp for
for (int i = 0; i < ARRAY_SIZE; i++) {
a[i] = threadsafe_rand(&seed, 1,100);
b[i] = threadsafe_rand(&seed,10, 50);
c[i] = threadsafe_rand(&seed,1, 10);
d[i] = threadsafe_rand(&seed,1, 50);
e[i] = threadsafe_rand(&seed,1, 5);
f[i] = threadsafe_rand(&seed,10, 80);
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
if (y[i] < min) {
#pragma omp critical
if (y[i] < min) {
min = y[i];
imin = i;
}
}
if (y[i] > max) {
#pragma omp critical
if (y[i] > max) {
max = y[i];
imax = i;
}
}
}
}
// printout
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("Y[%d]=%.2f\n", i, y[i]);
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", imin, min, imax, max);
Update:
I have updated the code according to #Qubit's and #JérômeRichard's suggestions:
I used the 'Really minimal PCG32 code' / (c) 2014 M.E. O'Neill / from https://www.pcg-random.org/download.html. Note that I do not intend to properly handle the seeding of this simple random number generator. If you would like to do so, please use a complete random number generator library.
I have changed the code to use user defined reductions. Indeed, it makes the code much more efficient, but not really beginner friendly. It would require a very long post to explain it, so if you are interested in the details, please read a book about OpenMP.
I have reduced the number of divisions in threadsafe_rand
The updated code:
#include<stdio.h>
#include<stdint.h>
#include<time.h>
#include<float.h>
#include<limits.h>
#include<omp.h>
#define ARRAY_SIZE 10
// *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org
// Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)
typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;
inline uint32_t pcg32_random_r(pcg32_random_t* rng)
{
uint64_t oldstate = rng->state;
// Advance internal state
rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);
// Calculate output function (XSH RR), uses old state for max ILP
uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
uint32_t rot = oldstate >> 59u;
return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
}
inline double threadsafe_rand(pcg32_random_t* seed, double min, double max)
{
const double tmp=1.0/UINT32_MAX;
return min + tmp*(max - min)*pcg32_random_r(seed);
}
struct v{
double value;
int i;
};
#pragma omp declare reduction(custom_min: struct v: \
omp_out = omp_in.value < omp_out.value ? omp_in : omp_out )\
initializer(omp_priv={DBL_MAX,0} )
#pragma omp declare reduction(custom_max: struct v: \
omp_out = omp_in.value > omp_out.value ? omp_in : omp_out )\
initializer(omp_priv={-DBL_MAX,0} )
int main() {
double a[ARRAY_SIZE], b[ARRAY_SIZE], c[ARRAY_SIZE], d[ARRAY_SIZE], e[ARRAY_SIZE], f[ARRAY_SIZE], y[ARRAY_SIZE];
struct v max={-DBL_MAX,0};
struct v min={DBL_MAX,0};
#pragma omp parallel default(none) shared(a,b,c,d,e,f,y) reduction(custom_min:min) reduction(custom_max:max)
{
pcg32_random_t seed={omp_get_thread_num()*7842 + time(NULL)%2299, 1234+omp_get_thread_num()};
#pragma omp for
for (int i=0 ; i < ARRAY_SIZE; i++) {
a[i] = threadsafe_rand(&seed, 1,100);
b[i] = threadsafe_rand(&seed,10, 50);
c[i] = threadsafe_rand(&seed,1, 10);
d[i] = threadsafe_rand(&seed,1, 50);
e[i] = threadsafe_rand(&seed,1, 5);
f[i] = threadsafe_rand(&seed,10, 80);
y[i] = (a[i] * b[i]) + c[i] + (d[i] * e[i]) + (f[i] / 2);
if (y[i] < min.value) {
min.value = y[i];
min.i = i;
}
if (y[i] > max.value) {
max.value = y[i];
max.i = i;
}
}
}
// printout
for (int i = 0; i < ARRAY_SIZE; i++) {
printf("Y[%d]=%.2f\n", i, y[i]);
}
printf("min y[%d] = %.2f\nmax y[%d] = %.2f\n", min.i, min.value, max.i, max.value);
return 0;
}

Related

Clang OpenMP. Find max value in matrix N x N

I need to find max value in matrix using OpenMP. It is my first experience with OpenMP, previously I did this task using pthreads.
I wrote this code but it does not work:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void MatrixFIller(int nrows, int* m) {
for (int i = 0; i < nrows; i++) {
for (int j = 0; j < nrows; j++) {
*(m + i * nrows + j) = rand() % 200;
}
}
};
#define dimension 9
#define number_of_threads 4
int main() {
srand(time(NULL));
int matrix[dimension][dimension];
int local_max=-1;
int final_max=-1;
int j = 0;
MatrixFIller(dimension, &matrix[0][0]);
for (int i = 0; i < dimension; i++) {
for (int j = 0; j < dimension; j++) {
printf("%d\t", matrix[i][j]);
}
printf("\n");
}
omp_set_num_threads(number_of_threads);
#pragma omp parallel private(local_max)
{
#pragma omp for
for (j = 0; j < dimension * dimension; j++) {
if (*(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)(j / dimension)))) > local_max) {
local_max = *(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)((j) / dimension))));
}
}
#pragma omp critical
if (local_max > final_max) { final_max = local_max; };
};
printf("Max value of matrix with dimension %d is %d", dimension, final_max);
};
The idea is that in pragma for each thread finds local max and after that it is compared with global max value in pragma critical. Why it does not correct? Thanks!
When entering the parallel region, local_max gets unitialized: the private clause creates variables that are local to each thread and that's it, they are not initialized to any value. If you want them to be initialized with the content of local_max had before entering the parallel region, you have to use the firstprivate clause instead.
However, it would actually be better to declare (and initialize) local_max inside the parallel region.
Also, you may have a look at the reduction clause (with the max option), which will make the code even simpler:
#pragma omp parallel for reduction(max:final_max)
for (j = 0; j < dimension * dimension; j++) {
if (*(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)(j / dimension)))) > final_max) {
final_max = *(matrix + (int)((j) / dimension) * dimension + (j - dimension * ((int)((j) / dimension))));
}
}
EDIT
Following Laci's comment about the incorrectness of the arithmetic: all of your indeces calculations look correct but are not easy to read. Since you have from the begining a 2D array it is simpler to set two loops. And possibly tell OpenMP to parallelize them both using the collapse clause (and by the way, and as far as possible, declare the loop indeces within the for(): this avoids always wondering which ones should be declared as private or not):
#pragma omp parallel for reduction(max:final_max) collapse(2)
for (int i = 0; i < dimension; i++) {
for (int j = 0; j < dimension; j++) {
if (matrix[i][j] > final_max) {
final_max = matrix[i][j];
}
}
}

The problem with parallelizing the program using OpenMP

I have a program that works with arrays and outputs a single number. To parallelize the program, I use OpenMP, but the problem is that after writing the directives, I started getting answers that are not similar to the answers of the program without parallelization. Can anyone tell me where I made a mistake?
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <math.h>
#include <float.h>
#define LOOP_COUNT 100
#define MIN_1_ARRAY_VALUE 1
#define MAX_1_ARRAY_VALUE 10
#define MIN_2_ARRAY_VALUE 10
#define MAX_2_ARRAY_VALUE 100
#define PI 3.1415926535897932384626433
#define thread_count 4
double generate_random(unsigned int* seed, int min, int max) {
return (double)((rand_r(seed) % 10) + (1.0 / (rand_r(seed) % (max - min + 1))));
}
double map_1(double value) {
return pow(value / PI, 3);
}
double map_2(double value) {
return fabs(tan(value));
}
void dwarf_sort(int n, double mass[]) {
int i = 1;
int j = 2;
while (i < n) {
if (mass[i-1]<mass[i]) {
i = j;
j = j + 1;
}
else
{
double tmp = mass[i];
mass[i] = mass[i - 1];
mass[i - 1] = tmp;
--i;
if (i==0)
{
i = j;
j = j + 1;
}
}
}
}
int main(int argc, char* argv[]) {
printf("lab work 3 in processing...!\n");
int trial_counter;
int array_size = atoi(argv[1]);
struct timeval before, after;
long time_diff;
gettimeofday(&before, NULL);
for (trial_counter = 0; trial_counter < LOOP_COUNT; trial_counter++) {
double arr1[array_size];
double arr2[array_size / 2];
double arr2_copy[array_size / 2];
double arr2_min = DBL_MAX;
unsigned int tempValue = trial_counter;
unsigned int *currentSeed = &tempValue;
//stage 1 - init
#pragma omp parallel num_threads(thread_count)
{
#pragma omp parallel for default(none) shared(arr1, currentSeed, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size; i++) {
arr1[i] = generate_random(currentSeed, MIN_1_ARRAY_VALUE, MAX_1_ARRAY_VALUE);
// printf("arr[%d] = %f\n", i, arr1[i]);
}
#pragma omp parallel for default(none) shared(arr2, arr2_copy, array_size, currentSeed, arr2_min) schedule(guided, thread_count)
for (int i = 0; i < array_size / 2; i++) {
double value = generate_random(currentSeed, MIN_2_ARRAY_VALUE, MAX_2_ARRAY_VALUE);
arr2[i] = value;
arr2_copy[i] = value;
if (value < arr2_min) {
arr2_min = value;
}
}
#pragma omp parallel for default(none) shared(arr1, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size; i++) {
arr1[i] = map_1(arr1[i]);
}
#pragma omp parallel for default(none) shared(arr2, arr2_copy, array_size) schedule(guided, thread_count)
for (int i = 1; i < array_size / 2; i++) {
#pragma omp critical
arr2[i] = map_2(arr2_copy[i] + arr2_copy[i - 1]);
}
#pragma omp parallel for default(none) shared(arr2, arr1, array_size) schedule(guided, thread_count)
for (int i = 0; i < array_size / 2; i++) {
arr2[i] = pow(arr1[i], arr2[i]);
}
#pragma omp parallel sections
{
#pragma omp section
{
dwarf_sort((int) array_size / 2, arr2);
}
}
double final_sum = 0;
for (int i = 0; i < array_size / 2; i++) {
if (((int) arr2[i]) / 2 == 0) {
final_sum += sin(arr2[i]);
}
}
// printf("Iteration %d, value: %f\n", trial_counter, final_sum);
}
}
gettimeofday(&after, NULL);
time_diff = 1000 * (after.tv_sec - before.tv_sec) + (after.tv_usec - before.tv_usec) / 1000;
printf("\nN=%d. Milliseconds passed: %ld\n", array_size, time_diff);
return 0;
}
rand_r is thread-safe only if each thread have its own seed or if threads are guaranteed to operate on the seed in an exclusive way (eg. using an expensive critical section). This is not the case in your code. Indeed, currentSeed is shared between thread. Thus, it causes a race condition since multiple threads can mutate it simultaneously. You need to use a thread-private seed (with a different value so for results not to be deterministic between threads). The seed of each thread can be initialized from a shared array filled from the main thread (eg. naively [0, 1, 2, 3, etc.]).
The thing is you will still get different results suing a different number of threads with this approach. One solution is to split your data set is independent chunks with an associated seed and then possibly compute the chunk in parallel.
Note that using a #pragma omp parallel for in a #pragma omp parallel section causes many threads to be created (ie. over-subscription). This is generally very inefficient. You should use #pragma omp for instead.

OpenMP - how to efficiently synchronize field update

I have the following code:
for (int i = 0; i < veryLargeArraySize; i++){
int value = A[i];
if (B[value] < MAX_VALUE) {
B[value]++;
}
}
I want to use OpenMP worksharing construct here, but my issue is the synchronization on the B array - all parallel threads can access any element of array B, which is very large (which made use of locks difficult since I'd need too many of them)
#pragma omp critical is a serious overhead here. Atomic is not possible, because of the if.
Does anyone have a good suggestion on how I might do this?
Here's what I've found out and done.
I've read on some forums that parallel histogram calculation is generally a bad idea, since it may be slower and less efficient than the sequential calculation.
However, I needed to do it (for the assignment), so what I did is the following:
Parallel processing of the A array(the image) to determine the actual range of values (the histogram - B array) - find MIN and MAX of A[i]
int min_value, max_value;
#pragma omp for reduction(min:min_value), reduction(max:max_value)
for (i = 0; i < veryLargeArraySize; i++){
const unsigned int value = A[i];
if(max_value < value) max_value = value;
if(min_value > value) min_value = value;
}
int size_of_histo = max_value - min_value + 1;`
That way, we can (potentially) reduce the actual histogram size from, e.g., 1M elements (allocated in array B) to 50K elements (allocated in sharedHisto)
Allocate a shared array, such as:
int num_threads = omp_get_num_threads();
int* sharedHisto = (int*) calloc(num_threads * size_of_histo, sizeof(int));
Each thread is assigned a part of the sharedHisto, and can update it without synchronization
int my_id = omp_get_thread_num();
#pragma omp parallel for default(shared) private(i)
for(i = 0; i < veryLargeArraySize; i++){
int value = A[i];
// my_id * size_of_histo positions to the begining of this thread's
// part of sharedHisto .
// i - min_value positions to the actual histo value
sharedHisto[my_id * size_of_histo + i - min_value]++;
}
Now, perform a reduction (as stated here: Reducing on array in OpenMp)
#pragma omp parallel
{
// Every thread is in charge for part of the reduced histogram
// shared_histo with the size: size_of_histo
int my_id = omp_get_thread_num();
int num_threads = omp_get_num_threads();
int chunk = (size_of_histo + num_threads - 1) / num_threads;
int start = my_id * chunk;
int end = (start + chunk > histo_dim) ? histo_dim : start + chunk;
#pragma omp for default(shared) private(i, j)
for(i = start; i < end; i++){
for(j = 0; j < num_threads; j++){
int value = B[i + minHistoValue] + sharedHisto[j * size_of_histo + i];
if(value > MAX_VALUE) B[i + min_value] = MAX_VALUE;
else B[i + min_value] = value;
}
}
}

parallelizing matrix multiplication through threading and SIMD

I am trying to speed up matrix multiplication on multicore architecture. For this end, I try to use threads and SIMD at the same time. But my results are not good. I test speed up over sequential matrix multiplication:
void sequentialMatMul(void* params)
{
cout << "SequentialMatMul started.";
int i, j, k;
for (i = 0; i < N; i++)
{
for (k = 0; k < N; k++)
{
for (j = 0; j < N; j++)
{
X[i][j] += A[i][k] * B[k][j];
}
}
}
cout << "\nSequentialMatMul finished.";
}
I tried to add threading and SIMD to matrix multiplication as follows:
void threadedSIMDMatMul(void* params)
{
bounds *args = (bounds*)params;
int lowerBound = args->lowerBound;
int upperBound = args->upperBound;
int idx = args->idx;
int i, j, k;
for (i = lowerBound; i <upperBound; i++)
{
for (k = 0; k < N; k++)
{
for (j = 0; j < N; j+=4)
{
mmx1 = _mm_loadu_ps(&X[i][j]);
mmx2 = _mm_load_ps1(&A[i][k]);
mmx3 = _mm_loadu_ps(&B[k][j]);
mmx4 = _mm_mul_ps(mmx2, mmx3);
mmx0 = _mm_add_ps(mmx1, mmx4);
_mm_storeu_ps(&X[i][j], mmx0);
}
}
}
_endthread();
}
And the following section is used for calculating lowerbound and upperbound of each thread:
bounds arg[CORES];
for (int part = 0; part < CORES; part++)
{
arg[part].idx = part;
arg[part].lowerBound = (N / CORES)*part;
arg[part].upperBound = (N / CORES)*(part + 1);
}
And finally threaded SIMD version is called like this:
HANDLE handle[CORES];
for (int part = 0; part < CORES; part++)
{
handle[part] = (HANDLE)_beginthread(threadedSIMDMatMul, 0, (void*)&arg[part]);
}
for (int part = 0; part < CORES; part++)
{
WaitForSingleObject(handle[part], INFINITE);
}
The result is as follows:
Test 1:
// arrays are defined as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=1//just one thread
Sequential time: 11129ms
Threaded SIMD matmul time: 14650ms
Speed up=0.75x
Test 2:
//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
A[k] = (float*)malloc(cols * sizeof(float));
B[k] = (float*)malloc(cols * sizeof(float));
X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=1//just one thread
Sequential time: 15907ms
Threaded SIMD matmul time: 18578ms
Speed up=0.85x
Test 3:
//defined arrays as follow
float A[N][N];
float B[N][N];
float X[N][N];
N=2048
Core=2
Sequential time: 10855ms
Threaded SIMD matmul time: 27967ms
Speed up=0.38x
Test 4:
//defined arrays as follow
float **A = (float**)_aligned_malloc(N* sizeof(float), 16);
float **B = (float**)_aligned_malloc(N* sizeof(float), 16);
float **X = (float**)_aligned_malloc(N* sizeof(float), 16);
for (int k = 0; k < N; k++)
{
A[k] = (float*)malloc(cols * sizeof(float));
B[k] = (float*)malloc(cols * sizeof(float));
X[k] = (float*)malloc(cols * sizeof(float));
}
N=2048
Core=2
Sequential time: 16579ms
Threaded SIMD matmul time: 30160ms
Speed up=0.51x
My question: why I don’t get speed up?
Here are the times I get building on your algorithm on my four core i7 IVB processor.
sequential: 3.42 s
4 threads: 0.97 s
4 threads + SSE: 0.86 s
Here are the times on a 2 core P9600 #2.53 GHz which is similar to the OP's E2200 #2.2 GHz
sequential: time 6.52 s
2 threads: time 3.66 s
2 threads + SSE: 3.75 s
I used OpenMP because it makes this easy. Each thread in OpenMP runs over effectively
lowerBound = N*part/CORES;
upperBound = N*(part + 1)/CORES;
(note that that is slightly different than your definition. Your definition can give the wrong result due to rounding for some values of N since you divide by CORES first.)
As to the SIMD version. It's not much faster probably due it being memory bandwidth bound . It's probably not really faster because GCC already vectroizes the loop.
The most optimal solution is much more complicated. You need to use loop tiling and reorder the elements within tiles to get the optimal performance. I don't have time to do that today.
Here is the code I used:
//c99 -O3 -fopenmp -Wall foo.c
#include <stdio.h>
#include <string.h>
#include <x86intrin.h>
#include <omp.h>
void gemm(float * restrict a, float * restrict b, float * restrict c, int n) {
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
for(int j=0; j<n; j++) {
c[i*n+j] += a[i*n+k]*b[k*n+j];
}
}
}
}
void gemm_tlp(float * restrict a, float * restrict b, float * restrict c, int n) {
#pragma omp parallel for
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
for(int j=0; j<n; j++) {
c[i*n+j] += a[i*n+k]*b[k*n+j];
}
}
}
}
void gemm_tlp_simd(float * restrict a, float * restrict b, float * restrict c, int n) {
#pragma omp parallel for
for(int i=0; i<n; i++) {
for(int k=0; k<n; k++) {
__m128 a4 = _mm_set1_ps(a[i*n+k]);
for(int j=0; j<n; j+=4) {
__m128 c4 = _mm_load_ps(&c[i*n+j]);
__m128 b4 = _mm_load_ps(&b[k*n+j]);
c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
_mm_store_ps(&c[i*n+j], c4);
}
}
}
}
int main(void) {
int n = 2048;
float *a = _mm_malloc(n*n * sizeof *a, 64);
float *b = _mm_malloc(n*n * sizeof *b, 64);
float *c1 = _mm_malloc(n*n * sizeof *c1, 64);
float *c2 = _mm_malloc(n*n * sizeof *c2, 64);
float *c3 = _mm_malloc(n*n * sizeof *c2, 64);
for(int i=0; i<n*n; i++) a[i] = 1.0*i;
for(int i=0; i<n*n; i++) b[i] = 1.0*i;
memset(c1, 0, n*n * sizeof *c1);
memset(c2, 0, n*n * sizeof *c2);
memset(c3, 0, n*n * sizeof *c3);
double dtime;
dtime = -omp_get_wtime();
gemm(a,b,c1,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
gemm_tlp(a,b,c2,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
dtime = -omp_get_wtime();
gemm_tlp_simd(a,b,c3,n);
dtime += omp_get_wtime();
printf("time %f\n", dtime);
printf("error %d\n", memcmp(c1,c2, n*n*sizeof *c1));
printf("error %d\n", memcmp(c1,c3, n*n*sizeof *c1));
}
It looks to me that the threads are sharing __m128 mmx* variables, you probably defined them global/static. You must be getting wrong results in your X array too. Define __m128 mmx* variables inside threadedSIMDMatMul function scope and it will run much faster.
void threadedSIMDMatMul(void* params)
{
__m128 mmx0, mmx1, mmx2, mmx3, mmx4;
// rest of the code here
}

OpenMP C Normal Random Generator

I was doing a C assignment for parallel computing, where I have to implement some sort of Monte Carlo simulations with efficient tread safe normal random generator using Box-Muller transform. I generate 2 vectors of uniform random numbers X and Y, with condition that X in (0,1] and Y in [0,1].
But I'm not sure that my way of sampling uniform random numbers from the halfopen interval (0,1] is right.
Did anyone encounter something similar?
I'm using following Code:
double* StandardNormalRandom(long int N){
double *X = NULL, *Y = NULL, *U = NULL;
X = vUniformRandom_0(N / 2);
Y = vUniformRandom(N / 2);
#pragma omp parallel for
for (i = 0; i<N/2; i++){
U[2*i] = sqrt(-2 * log(X[i]))*sin(Y[i] * 2 * pi);
U[2*i + 1] = sqrt(-2 * log(X[i]))*cos(Y[i] * 2 * pi);
}
return U;
}
double* NormalRandom(long int N, double mu, double sigma2)
{
double *U = NULL, stdev = sqrt(sigma2);
U = StandardNormalRandom(N);
#pragma omp parallel for
for (int i = 0; i < N; i++) U[i] = mu + stdev*U[i];
return U;
}
here is the bit of my UniformRandom function also implemented in parallel:
#pragma omp parallel for firstprivate(i)
for (long int j = 0; j < N;j++)
{
if (i == 0){
int tn = omp_get_thread_num();
I[tn] = S[tn];
i++;
}
else
{
I[j] = (a*I[j - 1] + c) % m;
}
}
}
#pragma omp parallel for
for (long int j = 0; j < N; j++)
U[j] = (double)I[j] / (m+1.0);
In the StandardNormalRandom function, I will assume that the pointer U has been allocated to the size N, in which case this function looks fine to me.
As well as the function NormalRandom.
However for the function UniformRandom (which is missing some parts, so I'll have to assume some stuff), if the following line I[j] = (a*I[j - 1] + c) % m + 1; is the body of a loop with a omp parallel for, then you will have some issues. As you can't know the order of execution of the thread, the current thread (with a fixed value of j) can't rely on the value of I[j - 1] as this value could be modified at any time (I should be shared by default).
Hope it helps!

Resources