I have typed this simple code to calculate the number of prime numbers between 2 and 5,000,000.
The algorithm works fine and it outputs the correct answer, however when I try to use OpenMP to speedup the execution it outputs a different answer every time.
#include "time.h"
#include "stdio.h"
#include "omp.h"
int main()
{
clock_t start = clock();
int count = 1;
int x;
bool flag;
#pragma omp parallel for schedule(static,1) num_threads(2) shared(count) private(x,flag)
for (x = 3; x <= 5000000; x+=2)
{
flag = false;
if (x == 2 || x == 3)
count++;
else if (x % 2 == 0 || x % 3 == 0)
continue;
else
{
for (int i = 5; i * i <= x; i += 6)
{
if (x % i == 0 || x % (i + 2) == 0)
{
flag = true;
break;
}
}
if (!flag)
count++;
}
}
clock_t end = clock();
printf("The execution took %f ms\n", (double)end - start / CLOCKS_PER_SEC);
printf("%d\n", count);
}
The code doesn't work for any number of threads, dynamic or static scheduling or different chunk sizes.
I have tried messing with private and shared variables but it still didn't work and declaring x and flag inside the for loop didn't work either.
I am using Visual Studio 2019 and I have OpenMP support enabled.
What's the problem with my code ?
You have race conditions with your count variable where multiple threads can try to update it at the same time. The easy fix is to use an OpenMP reduction() clause to give each thread a private copy of the variable and have them all get added up properly:
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
int main(void)
{
clock_t start = clock();
int count = 1;
#pragma omp parallel for schedule(static,1) num_threads(2) reduction(+:count)
for (int x = 3; x <= 5000000; x+=2)
{
bool flag = false;
if (x == 2 || x == 3)
count++;
else if (x % 2 == 0 || x % 3 == 0)
continue;
else
{
for (int i = 5; i * i <= x; i += 6)
{
if (x % i == 0 || x % (i + 2) == 0)
{
flag = true;
break;
}
}
if (!flag)
count++;
}
}
clock_t end = clock();
printf("The execution took %f ms\n", (double)end - start / CLOCKS_PER_SEC);
printf("%d\n", count);
}
This outputs 348513 (Verified as the right number through other software).
Also note cleaned up headers and moving some variable declarations around to avoid the need for a private() clause.
You could also make count an atomic int, but that's slower than using reduction() in my testing.
Just to add to the answer provided by #Shawn, besides solving the count race condition using the reduction OpenMP clause. You can also analyze if your code has load balancing issues, looking at the iterations of the loop that you are parallelizing it is clear that not all iterations have the same among of work. Since you are assigning work to threads in a static manner you might have one thread doing much more work than the other. Test around with the dynamic schedule to see if you notice any difference.
Besides that, you can significantly simplify your sequential code by removing all those conditional branchings that negatively affect the performance of your parallel version.
First you do not need (x == 2), since int x = 3;. You do not need (x == 3) either, just remove it and make count=2; (instead of count=1;) and int x = 5;, since the loop is incrementing in steps of 2 (i.e., x+=2). With this you can also remove this:
if (x == 2 || x == 3)
count++;
Now because the loop starts at 5, and has an incremental step of 2, you know that it will be iterating over odd numbers only, so we can remove also x % 2 == 0 . Now we only have an if( x % 3 == 0) continue; else{..}, which can be simplified into if(x % 3 != 0){..}.
You can rewrite the code also to remove that break:
#pragma omp parallel for schedule(static,1) num_threads(2) reduction(+:count)
for (int x = 5; x <= 5000000; x += 2) {
boolean flag = false;
if (x % 3 != 0) {
for (i = 5; !flag && i * i <= x; i += 6) {
flag = (x % i == 0 || x % (i + 2) == 0);
}
if (!flag) {
count++;
}
}
}
because you are using C/C++ you can even remove that if as well:
int count = 2;
#pragma omp parallel for schedule(static,1) num_threads(2) reduction(+:count)
for (int x = 5; x <= 5000000; x += 2) {
if (x % 3 != 0) {
int flag = 1;
for (int i = 5; flag && i * i <= x; i += 6) {
flag = x % i != 0 && x % (i + 2) != 0;
}
count += flag;
}
}
printf("%d\n", count);
IMO the code is more readable now, we could further improve it by given a good name to the variable flag.
I have build a led cube using transistors, shift registers and an arduino nano (and it kinda works). I know shift registers may be a poor design choice but I have to work with what I got so please don't get stuck on that in your answers.
There is this piece of code:
bool input[32];
void resetLeds()
{
input[R1] = 0;
input[G1] = 0;
input[B1] = 0;
input[R2] = 0;
input[G2] = 0;
input[B2] = 0;
input[R3] = 0;
input[G3] = 0;
input[B3] = 0;
input[R4] = 0;
input[G4] = 0;
input[B4] = 0;
input[X1Z1] = 1;
input[X1Z2] = 1;
input[X1Z3] = 1;
input[X1Z4] = 1;
input[X2Z1] = 1;
input[X2Z2] = 1;
input[X2Z3] = 1;
input[X2Z4] = 1;
input[X3Z1] = 1;
input[X3Z2] = 1;
input[X3Z3] = 1;
input[X3Z4] = 1;
input[X4Z1] = 1;
input[X4Z2] = 1;
input[X4Z3] = 1;
input[X4Z4] = 1;
}
void loop()
{
T = micros();
for(int I = 0; I < 100; I++)
{
counter++;
if(counter >= 256 / DIVIDER) counter = 0;
for(int i = 0; i < 64; i += 4)
{
x = i / 16;
z = (i % 16) / 4;
resetLeds();
input[XZ[x][z]] = 0;
for(y = 0; y < 4; y++)
{
index = i + y;
if(counter < xyz[index][0]) input[Y[y][RED]] = 1;
if(counter < xyz[index][1]) input[Y[y][GREEN]] = 1;
if(counter < xyz[index][2]) input[Y[y][BLUE]] = 1;
}
PORTB = 0;
for(int j = 0; j < 32; j++)
{
bitWrite(OUT_PORT, 4, 0);
bitWrite(OUT_PORT, 3, input[j]);
PORTB = OUT_PORT;
bitWrite(OUT_PORT, 4, 1);
PORTB = OUT_PORT;
}
bitWrite(OUT_PORT, 0, 1);
PORTB = OUT_PORT;
}
}
T = micros() - T;
Serial.println(T / 100);
}
The runtime of a single iteration is reported to be 1274 microseconds, but I need it to be even lower to build a pwm function of sorts (manually turning a transistor on and off through a shift register). While optimizing I found this strange behavior I cannot explain. There is this line in the code:
bitWrite(OUT_PORT, 3, input[j]);
When I remove this line or change input[j] to 0 the runtime is halved. Apparently, an array lookup takes about 20 microseconds. But I find it very weird since I am indexing this array in more places in the code (when writing) and there it takes 23 microseconds for 28 writes.
Can somebody please explain to me what is going on and/or how to make this piece of code run faster? I guess you can do writes in a pipelined manner but a read stalls the code completely since you cannot continue before you receive the value from cache. But then again, I hardly doubt a read from cache should take 23 whole microseconds.
[EDIT 27/10/2020 16:55]
The part of the code that writes to the shiftregisters has a lot of bitWrites which are time consuming. Instead of writing the bits every time I implemented preconfigured options for the byte to write to PORTB:
PORTB = LATCH_LOW;
for(int j = 0; j < 32; j++)
{
if(input[j] == 0)
{
PORTB = CLOCK_OFF_DATA_0;
PORTB = CLOCK_ON_DATA_0;
}
else
{
PORTB = CLOCK_OFF_DATA_1;
PORTB = CLOCK_ON_DATA_1;
}
}
PORTB = LATCH_HIGH;
This cuts my running time roughly in half, which is kind of fast enough but I wonder I could get it to run even faster. When I remove everything from the loop except for the writing to the shift registers and I remove the input[j] read, I get a runtime of 200 microseconds. This means that if I could remove dependence on the input[j] read and compute its value inline I should be able to get at least another 2 times speed up. To achieve 8 bit PWM I calculated I need the running time to be 40 microseconds or less so I am going to stick with 16 (instead of 256) brightness levels for now to prevent flicker.
[EDIT 28/10/2020 19:38]
I went into the platform.txt and change the optimization flag to -Ofast. This got my iteration time down another 200 microseconds!
I try to send a maximum of 8 bytes of data. The first 4 bytes are always the same and involve defined commands and an address. The last 4 bytes should be variable.
So far I'm using this approach. Unfortunatly I was told to not use any for loops in this case.
// Construct data
local_transmit_buffer[0] = EEPROM_CMD_WREN;
local_transmit_buffer[1] = EEPROM_CMD_WRITE;
local_transmit_buffer[2] = High(MSQ_Buffer.address);
local_transmit_buffer[3] = Low(MSQ_Buffer.address);
uint_fast8_t i = 0;
for(i = 0; i < MSQ_Buffer.byte_lenght || i < 4; i++){ // assign data
local_transmit_buffer[i + 4] = MSQ_Buffer.dataPointer[i];
}
This is some test code I'm trying to solve my problem:
#include <stdio.h>
__UINT_FAST8_TYPE__ local_transmit_buffer[8];
__UINT_FAST8_TYPE__ MSQ_Buffer_data[8];
void print_local(){
for (int i = 0; i < 8; i++)
{
printf("%d ", local_transmit_buffer[i]);
}
printf("\n");
}
void print_msg(){
for (int i = 0; i < 8; i++)
{
printf("%d ", MSQ_Buffer_data[i]);
}
printf("\n");
}
int main(){
// assign all local values to 0
for (int i = 0; i < 8; i++)
{
local_transmit_buffer[i] = 0;
} print_local();
// assign all msg values to 1
for (int i = 0; i < 8; i++)
{
MSQ_Buffer_data[i] = i + 1;
} print_msg();
*(local_transmit_buffer + 3) = (__UINT_FAST32_TYPE__)MSQ_Buffer_data;
printf("\n");
print_local();
return 0;
}
The first loops fills up the local_transmit_buffer with 0's and the MSQ_Buffer with 0,1,2,...
local_transmit_buffer -> 0 0 0 0 0 0 0 0
MSQ_Buffer_data -> 1 2 3 4 5 6 7 8
Now i want to assign the first 4 values of MSQ_Buffer_data to local_transmit_buffer like this:
local_transmit_buffer -> 0 0 0 0 1 2 3 4
Is there another way of solving this problem without using for loops or a bit_field?
Solved:
I used the memcpy function to solve my problem
// uint_fast8_t i = 0;
// for(i = 0; i < MSQ_Buffer.byte_lenght || i < 4; i++){ // assign data
// local_transmit_buffer[i + 4] = MSQ_Buffer.dataPointer[i];
// }
// copy a defined number data from the message to the local buffer to send
memcpy(&local_transmit_buffer[4], &MSQ_Buffer.dataPointer, local_save_data_length);
Either just unroll the loop manually by typing out each line, or simply use memcpy. In this case there's no reason why you need abstraction layers, so I'd write the sanest possible code, which is just manual unrolling (and get rid of icky macros):
uint8_t local_transmit_buffer [8];
...
local_transmit_buffer[0] = EEPROM_CMD_WREN;
local_transmit_buffer[1] = EEPROM_CMD_WRITE;
local_transmit_buffer[2] = (uint8_t) ((MSQ_Buffer.address >> 8) & 0xFFu);
local_transmit_buffer[3] = (uint8_t) (MSQ_Buffer.address & 0xFFu);
local_transmit_buffer[4] = MSQ_Buffer.dataPointer[0];
local_transmit_buffer[5] = MSQ_Buffer.dataPointer[1];
local_transmit_buffer[6] = MSQ_Buffer.dataPointer[2];
local_transmit_buffer[7] = MSQ_Buffer.dataPointer[3];
It is not obvious why you can't use a loop though, this doesn't look like the actual EEPROM programming (where overhead code might cause hiccups), but just preparations for it. Start to question such requirements.
Also note that you should not use __UINT_FAST8_TYPE__ but uint8_t. Never use homebrewed types but always stdint.h. But you should not be using fast types for a RAM buffer used for EEPROM programming, because it cannot be allowed to contain padding, ever. This is a bug.
I am given a set of elements from, say, 10 to 21 (always sequential),
I generate arrays of the same size, where size is determined runtime.
Example of 3 generated arrays (arrays # is dynamic as well as # of elements in all arrays, where some elements can be 0s - not used):
A1 = [10, 11, 12, 13]
A2 = [14, 15, 16, 17]
A3 = [18, 19, 20, 21]
these generated arrays will be given to different processes to to do some computations on the elements. My aim is to balance the load for every process that will get an array. What I mean is:
With given example, there are
A1 = 46
A2 = 62
A3 = 78
potential iterations over elements given for each thread.
I want to rearrange initial arrays to give equal amount of work for each process, so for example:
A1 = [21, 11, 12, 13] = 57
A2 = [14, 15, 16, 17] = 62
A3 = [18, 19, 20, 10] = 67
(Not an equal distribution, but more fair than initial). Distributions can be different, as long as they approach some optimal distribution and are better than the worst (initial) case of 1st and last arrays. As I see it, different distributions can be achieved using different indexing [where the split of arrays is made {can be uneven}]
This works fine for given example, but there may be weird cases..
So, I see this as a reflection problem (due to the lack of knowledge of proper definition), where arrays should be seen with a diagonal through them, like:
10|111213
1415|1617
181920|21
And then an obvious substitution can be done..
I tried to implement like:
if(rest == 0)
payload_size = (upper-lower)/(processes-1);
else
payload_size = (upper-lower)/(processes-1) + 1;
//printf("payload size: %d\n", payload_size);
long payload[payload_size];
int m = 0;
int k = payload_size/2;
int added = 0; //track what been added so far (to skip over already added elements)
int added2 = 0; // same as 'added'
int p = 0;
for (i = lower; i <= upper; i=i+payload_size){
for(j = i; j<(i+payload_size); j++){
if(j <= upper){
if((j-i) > k){
if(added2 > j){
added = j;
payload[(j-i)] = j;
printf("1 adding data: %d at location: %d\n", payload[(j-i)], (j-i));
}else{
printf("else..\n");
}
}else{
if(added < upper - (m+1)){
payload[(j-i)] = upper - (p*payload_size) - (m++);
added2 = payload[(j-i)];
printf("2 adding data: %d at location: %d\n", payload[(j-i)], (j-i));
}else{
payload[(j-i)] = j;
printf("2.5 adding data: %d at location: %d\n", payload[(j-i)], (j-i));
}
}
}else{ payload[(j-i)] = '\0'; }
}
p++;
k=k/2;
//printf("send to proc: %d\n", ((i)/payload_size)%(processes-1)+1);
}
..but failed horribly.
You definitely can see the problem in the implementation, because it is poorly scalable, not complete, messy, badly written and so on, and on, and on, ...
So, I need help either with the implementation or with an idea of a better approach to do what I want to achieve, given the description.
P.S. I need the solution to be as 'in-liney' as possible (avoid loop nesting) - that is why I am using bunch of flags and global indexes.
Surely this can be done with extra loops and unnecessary iterations. I invite people that can and appreciate t̲h̲e̲ ̲a̲r̲t̲ ̲o̲f̲ ̲i̲n̲d̲e̲x̲i̲n̲g̲ when it comes to arrays.
I am sure there is a solution somewhere out there, but I just cannot make an appropriate Google query to find it.
Hint? I thought of using index % size_of_my_data to achieve this task..
P.S. Application: described here
Here is an O(n) solution I wrote using deque (double-ended queue, a deque is not necessary and a simple array can be used, but a deque makes the code clean because of popRight and popLeft). The code is Python, not pseudocode, but it should be pretty to understand (because it's Python).:
def balancingSumProblem(seqStart = None, seqStop = None, numberOfArrays = None):
from random import randint
from collections import deque
seq = deque(xrange(seqStart or randint(1, 10),
seqStop and seqStop + 1 or randint(11,30)))
arrays = [[] for _ in xrange(numberOfArrays or randint(1,6))]
print "# of elements: {}".format(len(seq))
print "# of arrays: {}".format(len(arrays))
averageNumElements = float(len(seq)) / len(arrays)
print "average number of elements per array: {}".format(averageNumElements)
oddIteration = True
try:
while seq:
for array in arrays:
if len(array) < averageNumElements and oddIteration:
array.append(seq.pop()) # pop() is like popright()
elif len(array) < averageNumElements:
array.append(seq.popleft())
oddIteration = not oddIteration
except IndexError:
pass
print arrays
print [sum(array) for array in arrays]
balancingSumProblem(10,21,3) # Given Example
print "\n---------\n"
balancingSumProblem() # Randomized Test
Basically, from iteration to iteration, it alternates between grabbing large elements and distributing them evenly in the arrays and grabbing small elements and distributing them evenly in the arrays. It goes from out to in (though you could go from in to out) and tries to use what should be the average number of elements per array to balance it out further.
It's not 100 percent accurate with all tests but it does a good job with most randomized tests. You can try running the code here: http://repl.it/cJg
With a simple sequence to assign, you can just iteratively add the min and max elements to each list in turn. There are some termination details to fix up, but that's the general idea. Applied to your example the output would look like:
john-schultzs-macbook-pro:~ jschultz$ ./a.out
10 21 13 18 = 62
11 20 14 17 = 62
12 19 15 16 = 62
A simple reflection assignment like this will be optimal when num_procs evenly divides num_elems. It will be sub-optimal, but still decent, when it doesn't:
#include <stdio.h>
int compute_dist(int lower, int upper, int num_procs)
{
if (lower > upper || num_procs <= 0)
return -1;
int num_elems = upper - lower + 1;
int num_elems_per_proc_floor = num_elems / num_procs;
int num_elems_per_proc_ceil = num_elems_per_proc_floor + (num_elems % num_procs != 0);
int procs[num_procs][num_elems_per_proc_ceil];
int i, j, sum;
// assign pairs of (lower, upper) to each process until we can't anymore
for (i = 0; i + 2 <= num_elems_per_proc_floor; i += 2)
for (j = 0; j < num_procs; ++j)
{
procs[j][i] = lower++;
procs[j][i+1] = upper--;
}
// handle left overs similarly to the above
// NOTE: actually you could use just this loop alone if you set i = 0 here, but the above loop is more understandable
for (; i < num_elems_per_proc_ceil; ++i)
for (j = 0; j < num_procs; ++j)
if (lower <= upper)
procs[j][i] = ((0 == i % 2) ? lower++ : upper--);
else
procs[j][i] = 0;
// print assignment results
for (j = 0; j < num_procs; ++j)
{
for (i = 0, sum = 0; i < num_elems_per_proc_ceil; ++i)
{
printf("%d ", procs[j][i]);
sum += procs[j][i];
}
printf(" = %d\n", sum);
}
return 0;
}
int main()
{
compute_dist(10, 21, 3);
return 0;
}
I have used this implementation, which I mentioned in this report (Implementation works for cases I've used for testing (1-15K) (1-30K) and (1-100K) datasets. I am not saying that it will be valid for all the cases):
int aFunction(long lower, long upper, int payload_size, int processes)
{
long result, i, j;
MPI_Status status;
long payload[payload_size];
int m = 0;
int k = (payload_size/2)+(payload_size%2)+1;
int lastAdded1 = 0;
int lastAdded2 = 0;
int p = 0;
int substituted = 0;
int allowUpdate = 1;
int s;
int times = 1;
int times2 = 0;
for (i = lower; i <= upper; i=i+payload_size){
for(j = i; j<(i+payload_size); j++){
if(j <= upper){
if(k != 0){
if((j-i) >= k){
payload[(j-i)] = j- (m);
lastAdded2 = payload[(j-i)];
}else{
payload[(j-i)] = upper - (p*payload_size) - (m++) + (p*payload_size);
if(allowUpdate){
lastAdded1 = payload[(j-i)];
allowUpdate = 0;
}
}
}else{
int n;
int from = lastAdded1 > lastAdded2 ? lastAdded2 : lastAdded1;
from = from + 1;
int to = lastAdded1 > lastAdded2 ? lastAdded1 : lastAdded2;
int tempFrom = (to-from)/payload_size + ((to-from)%payload_size>0 ? 1 : 0);
for(s = 0; s < tempFrom; s++){
int restIndex = -1;
for(n = from; n < from+payload_size; n++){
restIndex = restIndex + 1;
payload[restIndex] = '\0';
if(n < to && n >= from){
payload[restIndex] = n;
}else{
payload[restIndex] = '\0';
}
}
from = from + payload_size;
}
return 0;
}
}else{ payload[(j-i)] = '\0'; }
}
p++;
k=(k/2)+(k%2)+1;
allowUpdate = 1;
}
return 0;
}
I am working with some C code and I'm totally stuck in this function. It should compare two buffers with some deviator. For example if EEPROM_buffer[1] = 80, so TxBuffer values from 78 to 82 should be correct!
So the problem is that it always returns -1. I checked both buffers, data is correct and they should match, but won't. Program just runs while until reach i = 3 and returns -1..
I compile with atmel studio 6.1, atmel32A4U microcontroller..
int8_t CheckMatching(t_IrBuff * tx_buffer, t_IrBuff * tpool)
{
uint8_t i = 0;
uint16_t * TxBuffer = (uint16_t*) tx_buffer->data;
while((TxBuffer->state != Data_match) || (i != (SavedBuff_count))) // Data_match = 7;
{
uint16_t * EEPROM_buffer = (uint16_t*) tpool[i].data;
for(uint16_t j = 0; j < tpool[i].usedSize; j++) // tpool[i].usedSize = 67;
{
if(abs(TxBuffer[j] - EEPROM_buffer[j]) > 3)
{
i++;
continue;
}
}
i++;
TxBuffer->state = Data_match; // state value before Data_match equal 6!
}
tx_buffer->state = Buffer_empty;
if(i == (SavedBuff_count)) // SavedBuff_count = 3;
{
return -1;
}
return i;
}
Both your TxBuffer elements and EEPROM_buffer elements are uint16_t. When deducting 81 from 80 as uint16_t it would give 0xffff, with no chance of abs to help you. Do a typecast to int32_t and you will be better off.