Program is taking way longer than expected, is it running properly? - c

not sure this is the right place...
I am running a brute-force code to solve an asymmetric traveler sales problem.
It has 17 cities, one is fixed, so it would have 16! (> 20 trillions) permutations to check.
unsigned long TotalCost(unsigned long *Matrix, short *Path, short
Dimention)
{
unsigned long result = 0;
unsigned long Cost;
int iD;
for (iD = 1; iD <= Dimention; iD++)
{
Cost = Matrix[Dimention*Path[iD - 1] + Path[iD]];
if (Cost > 0)
{
result = result + Cost;
}
else
{
return 4099999999;
}
}
return result;
}
void swapP(short *x, short *y)
{
short temp;
temp = *x;
*x = *y;
*y = temp;
}
void permute(unsigned long *Matrix, short Dimention, unsigned long *CurrentMin, short *PerPath, short **MinPath, short l, short r)
{
short i;
unsigned long CCost;
if (l == r)
{
CCost = TotalCost(Matrix, PerPath, Dimention);
if (CCost < (*CurrentMin))
{
for (i = 0; i <= Dimention; i++)
{
(*MinPath)[i] = PerPath[i];
}
(*CurrentMin) = CCost;
PrintResults(Matrix, PerPath, Dimention, 2);
}
}
else
{
for (i = l; i <= r; i++)
{
swapP((PerPath+l), (PerPath+i));
permute(Matrix, Dimention, CurrentMin, PerPath, MinPath, l+1, r);
swapP((PerPath+l), (PerPath+i)); //backtrack
}
}
}
int main (void)
{
// The ommited code here, allocs memory for the matrix, HcG and HrGR array
// it also initializes them
permute(Matrix, Dimention, &TotalMin, HcG, &HrGR, 1, Dimention - 1);
}
I tested the above code for an instance of five cities and it returned successfully as expected in a few milliseconds.
For the 17 cities, i initially thought it would take a few hours to solve, and then a couple days. It is running for 4 days now and i'm beginning to suspect the program, for some reason, is no longer running, like it's frozen.
I'm not getting any errors, but it's taking way longer than i expected, the program prints the total cost and the path every time it finds a path with lower cost, but it stopped printing half an hour since it started.
I am using ubuntu 18.04, the program is "running" on terminal, the system monitor tells Memory: N/A, does that mean it's not using memory?
It also tells CPU: 6%, can i increase it?
Is there a way to check if it is running properly? Or estimate how long it will take to finish?
I'm so unsure about it's integrity that i think i should stop the process, but at the same time i really wanted to see the results.

I only glanced through your code, but I have done things like this many times in the past. My general approach for this is as follows (although it adds a small cost) ...
add a print statement in a way (perhaps with a mod counter) that you would expect the print to come out approximately once every 2 to 3 minutes. Include some information in the print so that you can tell how far along your simulation is progressing. (note, among that information you probably want to be sure to print out variables that, if they get trashed, could cause infinite looping, for example "Dimention" (which you have misspelled btw)
I would personally not have jumped from 5 cities to 17. Rather 5 to 7, then maybe 9 or 10 ... just to confirm all is working and to get an idea how much time increase to expect with your particular CPU.
Finally, in the situation you are in now, is it possible to get another window and run "ps" to see if your job is getting any CPU time? If not, my approach would be to kill it and implement as I described above. HTH.
Note also, the code you have omitted (memory allocation, etc) is critical: the code as written has the potential to go out of bounds, and possibly not crash (if only slightly out of bounds) but rather end up trashing variables (depending on memory layout) that could (as mentioned above) create an infinite or near-infinite loop.

Related

How do I create a "twirly" in a C program task?

Hey guys I have created a program in C that tests all numbers between 1 and 10000 to check if they are perfect using a function that determines whether a number is perfect. Once it finds these it prints them to the user, they are 6, 28, 496 and 8128. After this the program then prints out all the factors of each perfect number to the user. This is all fine. Here is my problem.
The final part of my task asks me to:
"Use a "twirly" to indicate that your program is happily working away. A "twirly" is the following characters printed over the top of each other in the following order: '|' '/' '-' '\'. This has the effect of producing a spinning wheel - ie a "twirly". Hint: to do this you can use \r (instead of \n) in printf to give a carriage return only (instead of a carriage return linefeed). (Note: this may not work on some systems - you do not have to do it this way.)"
I have no idea what a twirly is or how to implement one. My tutor said it has something to do with the sleep and delay functions which I also don't know how to use. Can anyone help me with this last stage, it sucks that all my coding is complete but I can't get this "twirly" thing to work.
if you want to simultaneously perform the task of
Testing the numbers and
Display the twirly on screen
while the process goes on then you better look into using threads. using POSIX threads you can initiate the task on a thread and the other thread will display the twirly to the user on terminal.
#include<stdlib.h>
#include<pthread.h>
int Test();
void Display();
int main(){
// create threads each for both tasks test and Display
//call threads
//wait for Test thread to finish
//terminate display thread after Test thread completes
//exit code
}
Refer chapter 12 for threads
beginning linux programming ebook
Given the program upon which the user is "waiting", I believe the problem as stated and the solutions using sleep() or threads are misguided.
To produce all the perfect numbers below 10,000 using C on a modern personal computer takes about 1/10 of a second. So any device to show the computer is "happily working away" would either never be seen or would significanly intefere with the time it takes to get the job done.
But let's make a working twirly for perfect number search anyway. I've left off printing the factors to keep this simple. Since 10,000 is too low to see the twirly in action, I've upped the limit to 100,000:
#include <stdio.h>
#include <string.h>
int main()
{
const char *twirly = "|/-\\";
for (unsigned x = 1; x <= 100000; x++)
{
unsigned sum = 0;
for (unsigned i = 1; i <= x / 2; i++)
{
if (x % i == 0)
{
sum += i;
}
}
if (sum == x)
{
printf("%d\n", x);
}
printf("%c\r", twirly[x / 2500 % strlen(twirly)]);
}
return 0;
}
No need for sleep() or threads, just key it into the complexity of the problem itself and have it update at reasonable intervals.
Now here's the catch, although the above works, the user will never see a fifth perfect number pop out with a 100,000 limit and even with a 100,000,000 limit, which should produce one more, they'll likely give up as this is a bad (slow) algorithm for finding them. But they'll have a twirly to watch.
i as integer
loop i: 1 to 10000
loop j: 1 to i/2
sum as integer
set sum = 0
if i%j == 0
sum+=j
return sum==i
if i%100 == 0
str as character pointer
set *str = "|/-\\"
set length = 4
print str[p] using "%c\r" as format specifier
Increment p and assign its modulo by len to p

Why does CILK_NWORKERS affect program with only one cilk_spawn?

I am trying to paralellize a matrix processing program. After using OpenMP I decided to also check out CilkPlus and I noticed the following:
In my C code, I only apply parallelism in one part i.e.:
//(test_function declarations)
cilk_spawn highPrep(d, x, half);
d = temp_0;
r = malloc(sizeof(int)*(half));
temp_1 = r;
x = x_alloc + F_EXTPAD;
lowPrep(r, d, x, half);
cilk_sync;
//test_function return
According to the documentation I have read so far, cilk_spawn is expected to -maybe- (CilkPlus does not enforce parallelism) take the highPrep() function and execute it in a different hardware thread should one be available, and then continue with the execution of the rest of the code, including the function lowPrep(). The threads then should synchronize at cilk_sync before the execution proceeds with the rest of the code.
I am running this on an 8core/16thread Xeon E5-2680, that does not execute anything else at any given time apart from my experiments. My question at this point is that I notice that when I change the environment variable CILK_NWORKERS and try values such as 2, 4, 8, 16 the time that the test_function requires to be executed changes with a big variation. In particular, the higher the CILK_NWORKERS is set (after 2) the slower the function becomes. This seems counter intuitive to me since I would expect the available number of threads not to change the operation of cilk_spawn. I would expect that if 2 threads are available then the function highPrep is executed on another thread. Anything more than 2 threads I would expected to remain idle.
The highPrep and lowPrep functions are:
void lowPrep(int *dest, int *src1, int *src2, int size)
{
double temp;
int i;
for(i = 0; i < size; i++)
{
temp = -.25 * (src1[i] + src1[i + 1]) + .5;
if (temp > 0)
temp = (int)temp;
else
{
if (temp != (int)temp)
temp = (int)(temp - 1);
}
dest[i] = src2[2*i] - (int)(temp);
}
}
void highPrep(int *dest, int *src, int size)
{
double temp;
int i;
for(i=0; i < size + 1; i++)
{
temp = (-1.0/16 * (src[-4 + 2*i] + src[2 + 2*i]) + 9.0/16 * (src[-2 + 2*i] + src[0 + 2*i]) + 0.5);
if (temp > 0)
temp = (int)temp;
else
{
if (temp != (int)temp)
temp = (int)(temp - 1);
}
dest[i] = src[-1 + 2*i] - (int)temp;
}
}
There must be a reasonable explanation behind this, is it reasonable to expect different execution times for a program like this?
Clarification: Cilk does "continuation stealing", not "child stealing", so highPrep always runs on the same hardware thread as its caller. It's the "rest of the code" that might end up running on a different thread. See this primer for a fuller explanation.
As to the slowdown, it's probably an artifact of the implementation being biased towards high degrees of parallelism that can consume all threads. The extra threads are looking for work, and in the process of doing so, eat up some memory bandwidth, and for hyperthreaded processors eat up some core cycles. The Linux "Completely Fair Scheduler" given us some grief in this area because sleep(0) no longer gives up the timeslice. It's also possible that the extra threads cause the OS to map software threads onto the machine less efficiently.
The root of the problem is a tricky tradeoff: Running thieves aggressively enables them to pick up work faster if it appears, but also causes them to unnecessarily consume resources if no work is available. Putting thieves to sleep when there is no work available saves the resources, but adds significant overhead to spawn, since now a spawning thread has to check if there are sleeping threads to be woken up. TBB pays this overhead, but it's not much for TBB because TBB's spawn overhead is much higher anyway. The current Cilk implementation does pay this tax: it only sleeps workers during sequential execution.
The best (but imperfect) advice I can give is to find more parallelism so that no worker threads loiter for long and cause trouble.

What could be some possible problems with this use of OpenMP?

I was trying to figure out how to parallelize a segment of code in OpenMP, where the inside of the for loop is independent from the rest of it.
Basically the project is dealing with particle systems, but I don't think that should relevant to the parallelization of the code. Is it a caching problem where the for loop divides the threads in a way such that the particles are not cached in each core in an efficient manner?
Edit: As mentioned by an answer below, I'm wondering why I'm not getting speedup.
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles[i].pos = s->particles[i].pos + dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
// printf("%d", omp_get_thread_num());
}
If you're asking whether it's parallelized correctly, it looks fine. I don't see any data-races or loop-dependencies that could break it.
But I think you're wondering on why you aren't getting any speedup with parallelism.
Since you mentioned that the trip count, psize-n_dead will be on the order of 4000. I'd say that's actually pretty small given the amount of work in the loop.
In other words, you don't have much total work to be worth parallelizing. So threading overhead is probably eating up any speedup that you should be gaining. If possible, you should try parallelizing at a higher level.
EDIT: You updated your comment to include up to 200000.
For larger values, it's likely that you'll be memory bound in some way. Your loop merely iterates through all the data doing very little work. So using more threads probably won't help much (if at all).
There is no correctness issues such as data races in this piece of code.
Assuming that the number of particles to process is big enough to warrant parallelism, I do not see OpenMP related performance issues in this code. By default, OpenMP will split the loop iterations statically in equal portions across all threads, so any cache conflicts may only occur at the boundaries of these portions, i.e. just in a few iterations of the loop.
Unrelated to OpenMP (and so to the parallel speedup problem), possibly performance improvement can be achieved by switching from array-of-structs to struct-of-arrays, as this might help compiler to vectorize the code (i.e. use SIMD instructions of a target processor):
#pragma omp parallel for
for (unsigned i = 0; i < psize-n_dead; ++i)
{
s->particles.pos[i] = s->particles.pos[i] + dt * s->particles.vel[i];
s->particles.vel[i] = (1 - dt*.1) * s->particles.vel[i] + dt*s->force;
}
Such reorganization assumes that most time all particles are processed in a loop like this one. Working with an individual particle requires more cache lines to be loaded, but if you process them all in a loop, the net amount of cache lines loaded is nearly the same.
How sure are you that you're not getting speedup?
Trying it both ways - array of structs and struct of arrays, compiled with gcc -O3 (gcc 4.6), on a dual quad-core nehalem, I get for psize-n_dead = 200000, running 100 iterations for better timer accuracy:
Struct of arrays (reported time are in milliseconds)
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 90.984000
Took time 45.992000
Took time 22.996000
Took time 11.998000
Array of structs:
$ for t in 1 2 4 8; do export OMP_NUM_THREADS=$t; time ./foo; done
Took time 58.989000
Took time 28.995000
Took time 14.997000
Took time 8.999000
However, I because the operation is so short (sub-ms) I didn't see any speedup without doing 100 iterations because of timer accuracy. Also, you'd have to have a machine with good memory bandwidth to to get this sort of behaviour; you're only doing ~3 FMAs and another multiplication for every two pieces of data you read in.
Code for array-of-structs follows.
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
typedef struct particle_struct {
double pos;
double vel;
} particle;
typedef struct simulation_struct {
particle *particles;
double force;
} simulation;
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
void update(simulation *s, unsigned psize, double dt) {
#pragma omp parallel for
for (unsigned i = 0; i < psize; ++i)
{
s->particles[i].pos = s->particles[i].pos+ dt * s->particles[i].vel;
s->particles[i].vel = (1 - dt*.1) * s->particles[i].vel + dt*s->force;
}
}
void init(simulation *s, unsigned np) {
s->force = 1.;
s->particles = malloc(np*sizeof(particle));
for (unsigned i=0; i<np; i++) {
s->particles[i].pos = 1.;
s->particles[i].vel = 1.;
}
int main(void)
{
const unsigned np=200000;
simulation s;
struct timeval clock;
init(&s, np);
tick(&clock);
for (int iter=0;iter< 100; iter++)
update(&s, np, 0.75);
double elapsed=tock(&clock)*1000.;
printf("Took time %lf\n", elapsed);
free(s.particles);
}

Recursive Fib with Threads, Segmentation Fault?

Any ideas why it works fine for values like 0, 1, 2, 3, 4... and seg faults for values like >15?
#include
#include
#include
void *fib(void *fibToFind);
main(){
pthread_t mainthread;
long fibToFind = 15;
long finalFib;
pthread_create(&mainthread,NULL,fib,(void*) fibToFind);
pthread_join(mainthread,(void*)&finalFib);
printf("The number is: %d\n",finalFib);
}
void *fib(void *fibToFind){
long retval;
long newFibToFind = ((long)fibToFind);
long returnMinusOne;
long returnMinustwo;
pthread_t minusone;
pthread_t minustwo;
if(newFibToFind == 0 || newFibToFind == 1)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1;
long newFibToFind2 = ((long)fibToFind) - 2;
pthread_create(&minusone,NULL,fib,(void*) newFibToFind1);
pthread_create(&minustwo,NULL,fib,(void*) newFibToFind2);
pthread_join(minusone,(void*)&returnMinusOne);
pthread_join(minustwo,(void*)&returnMinustwo);
return returnMinusOne + returnMinustwo;
}
}
Runs out of memory (out of space for stacks), or valid thread handles?
You're asking for an awful lot of threads, which require lots of stack/context.
Windows (and Linux) have a stupid "big [contiguous] stack" idea.
From the documentation on pthreads_create:
"On Linux/x86-32, the default stack size for a new thread is 2 megabytes."
If you manufacture 10,000 threads, you need 20 Gb of RAM.
I built a version of OP's program, and it bombed with some 3500 (p)threads
on Windows XP64.
See this SO thread for more details on why big stacks are a really bad idea:
Why are stack overflows still a problem?
If you give up on big stacks, and implement a parallel language with heap allocation
for activation records
(our PARLANSE is
one of these) the problem goes away.
Here's the first (sequential) program we wrote in PARLANSE:
(define fibonacci_argument 45)
(define fibonacci
(lambda(function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? 1)
?
(+ (fibonacci (-- ?))
(fibonacci (- ? 2))
)+
)ifthenelse
)lambda
)define
Here's an execution run on an i7:
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonaccisequential
Starting Sequential Fibonacci(45)...Runtime: 33.752067 seconds
Result: 1134903170
Here's the second, which is parallel:
(define coarse_grain_threshold 30) ; technology constant: tune to amortize fork overhead across lots of work
(define parallel_fibonacci
(lambda (function natural natural )function
`Given n, computes nth fibonacci number'
(ifthenelse (<= ? coarse_grain_threshold)
(fibonacci ?)
(let (;; [n natural ] [m natural ] )
(value (|| (= m (parallel_fibonacci (-- ?)) )=
(= n (parallel_fibonacci (- ? 2)) )=
)||
(+ m n)
)value
)let
)ifthenelse
)lambda
)define
Making the parallelism explicit makes the programs a lot easier to write, too.
The parallel version we test by calling (parallel_fibonacci 45). Here
is the execution run on the same i7 (which arguably has 8 processors,
but it is really 4 processors hyperthreaded so it really isn't quite 8
equivalent CPUs):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallelcoarse
Parallel Coarse-grain Fibonacci(45) with cutoff 30...Runtime: 5.511126 seconds
Result: 1134903170
A speedup near 6+, not bad for not-quite-8 processors. One of the other
answers to this question ran the pthreads version; it took "a few seconds"
(to blow up) computing Fib(18), and this is 5.5 seconds for Fib(45).
This tells you pthreads
is a fundamentally bad way to do lots of fine grain parallelism, because
it has really, really high forking overhead. (PARLANSE is designed to
minimize that forking overhead).
Here's what happens if you set the technology constant to zero (forks on every call
to fib):
C:\DMS\Domains\PARLANSE\Tools\PerformanceTest>run fibonacciparallel
Starting Parallel Fibonacci(45)...Runtime: 15.578779 seconds
Result: 1134903170
You can see that amortizing fork overhead is a good idea, even if you have fast forks.
Fib(45) produces a lot of grains. Heap allocation
of activation records solves the OP's first-order problem (thousands of pthreads each
with 1Mb of stack burns gigabytes of RAM).
But there's a second order problem: 2^45 PARLANSE "grains" will burn all your memory too
just keeping track of the grains even if your grain control block is tiny.
So it helps to have a scheduler that throttles forks once you have "a lot"
(for some definition of "a lot" significantly less that 2^45) grains to prevent the
explosion of parallelism from swamping the machine with "grain" tracking data structures.
It has to unthrottle forks when the number of grains falls below a threshold
too, to make sure there is always lots of logical, parallel work for the physical
CPUs to do.
You are not checking for errors - in particular, from pthread_create(). When pthread_create() fails, the pthread_t variable is left undefined, and the subsequent pthread_join() may crash.
If you do check for errors, you will find that pthread_create() is failing. This is because you are trying to generate almost 2000 threads - with default settings, this would require 16GB of thread stacks to be allocated alone.
You should revise your algorithm so that it does not generate so many threads.
I tried to run your code, and came across several surprises:
printf("The number is: %d\n", finalFib);
This line has a small error: %d means printf expects an int, but is passed a long int. On most platforms this is the same, or will have the same behavior anyways, but pedantically speaking (or if you just want to stop the warning from coming up, which is a very noble ideal too), you should use %ld instead, which will expect a long int.
Your fib function, on the other hand, seems non-functional. Testing it on my machine, it doesn't crash, but it yields 1047, which is not a Fibonacci number. Looking closer, it seems your program is incorrect on several aspects:
void *fib(void *fibToFind)
{
long retval; // retval is never used
long newFibToFind = ((long)fibToFind);
long returnMinusOne; // variable is read but never initialized
long returnMinustwo; // variable is read but never initialized
pthread_t minusone; // variable is never used (?)
pthread_t minustwo; // variable is never used
if(newFibToFind == 0 || newFibToFind == 1)
// you miss a cast here (but you really shouldn't do it this way)
return newFibToFind;
else{
long newFibToFind1 = ((long)fibToFind) - 1; // variable is never used
long newFibToFind2 = ((long)fibToFind) - 2; // variable is never used
// reading undefined variables (and missing a cast)
return returnMinusOne + returnMinustwo;
}
}
Always take care of compiler warnings: when you get one, usually, you really are doing something fishy.
Maybe you should revise the algorithm a little: right now, all your function does is returning the sum of two undefined values, hence the 1047 I got earlier.
Implementing the Fibonacci suite using a recursive algorithm means you need to call the function again. As others noted, it's quite an inefficient way of doing it, but it's easy, so I guess all computer science teachers use it as an example.
The regular recursive algorithm looks like this:
int fibonacci(int iteration)
{
if (iteration == 0 || iteration == 1)
return 1;
return fibonacci(iteration - 1) + fibonacci(iteration - 2);
}
I don't know to which extent you were supposed to use threads—just run the algorithm on a secondary thread, or create new threads for each call? Let's assume the first for now, since it's a lot more straightforward.
Casting integers to pointers and vice-versa is a bad practice because if you try to look at things at a higher level, they should be widely different. Integers do maths, and pointers resolve memory addresses. It happens to work because they're represented the same way, but really, you shouldn't do this. Instead, you might notice that the function called to run your new thread accepts a void* argument: we can use it to convey both where the input is, and where the output will be.
So building upon my previous fibonacci function, you could use this code as the thread main routine:
void* fibonacci_offshored(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
*pointer_to_number = fibonacci(input);
return NULL;
}
It expects a pointer to an integer, and takes from it its input, then writes it output there.1 You would then create the thread like that:
int main()
{
int value = 15;
pthread_t thread;
// on input, value should contain the number of iterations;
// after the end of the function, it will contain the result of
// the fibonacci function
int result = pthread_create(&thread, NULL, fibonacci_offshored, &value);
// error checking is important! try to crash gracefully at the very least
if (result != 0)
{
perror("pthread_create");
return 1;
}
if (pthread_join(thread, NULL)
{
perror("pthread_join");
return 1;
}
// now, value contains the output of the fibonacci function
// (note that value is an int, so just %d is fine)
printf("The value is %d\n", value);
return 0;
}
If you need to call the Fibonacci function from new distinct threads (please note: that's not what I'd advise, and others seem to agree with me; it will just blow up for a sufficiently large amount of iterations), you'll first need to merge the fibonacci function with the fibonacci_offshored function. It will considerably bulk it up, because dealing with threads is heavier than dealing with regular functions.
void* threaded_fibonacci(void* pointer)
{
int* pointer_to_number = pointer;
int input = *pointer_to_number;
if (input == 0 || input == 1)
{
*pointer_to_number = 1;
return NULL;
}
// we need one argument per thread
int minus_one_number = input - 1;
int minus_two_number = input - 2;
pthread_t minus_one;
pthread_t minus_two;
// don't forget to check! especially that in a recursive function where the
// recursion set actually grows instead of shrinking, you're bound to fail
// at some point
if (pthread_create(&minus_one, NULL, threaded_fibonacci, &minus_one_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_create(&minus_two, NULL, threaded_fibonacci, &minus_two_number) != 0)
{
perror("pthread_create");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_one, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
if (pthread_join(minus_two, NULL) != 0)
{
perror("pthread_join");
*pointer_to_number = 0;
return NULL;
}
*pointer_to_number = minus_one_number + minus_two_number;
return NULL;
}
Now that you have this bulky function, adjustments to your main function are going to be quite easy: just change the reference to fibonacci_offshored to threaded_fibonacci.
int main()
{
int value = 15;
pthread_t thread;
int result = pthread_create(&thread, NULL, threaded_fibonacci, &value);
if (result != 0)
{
perror("pthread_create");
return 1;
}
pthread_join(thread, NULL);
printf("The value is %d\n", value);
return 0;
}
You might have been told that threads speed up parallel processes, but there's a limit somewhere where it's more expensive to set up the thread than run its contents. This is a very good example of such a situation: the threaded version of the program runs much, much slower than the non-threaded one.
For educational purposes, this program runs out of threads on my machine when the number of desired iterations is 18, and takes a few seconds to run. By comparison, using an iterative implementation, we never run out of threads, and we have our answer in a matter of milliseconds. It's also considerably simpler. This would be a great example of how using a better algorithm fixes many problems.
Also, out of curiosity, it would be interesting to see if it crashes on your machine, and where/how.
1. Usually, you should try to avoid to change the meaning of a variable between its value on input and its value after the return of the function. For instance, here, on input, the variable is the number of iterations we want; on output, it's the result of the function. Those are two very different meanings, and that's not really a good practice. I didn't feel like using dynamic allocations to return a value through the void* return value.

Intermittent bugs - sometimes this code works and sometimes it doesn't!

This code intermittently works. It's running on a small microcontroller. It will work fine even after restarting the processor, but if I change some part of the code, it breaks. This makes me think that it's some kind of pointer bug or memory corruption. What's happening is the coordinate, p_res.pos.x is sometimes read as 0 (the incorrect value) and 96 (the correct value) when it is passed to write_circle_outlined. y seems to be correct most of the time. If anyone can spot anything obviously wrong please point it out!
int demo_game()
{
long int d;
int x, y;
struct WorldCamera p_viewer;
struct Point3D_LLA p_subj;
struct Point2D_CalcRes p_res;
p_viewer.hfov = 27;
p_viewer.vfov = 32;
p_viewer.width = 192;
p_viewer.height = 128;
p_viewer.p.lat = 51.26f;
p_viewer.p.lon = -1.0862f;
p_viewer.p.alt = 100.0f;
p_subj.lat = 51.20f;
p_subj.lon = -1.0862f;
p_subj.alt = 100.0f;
while(1)
{
fill_buffer(draw_buffer_mask, 0x0000);
fill_buffer(draw_buffer_level, 0xffff);
compute_3d_transform(&p_viewer, &p_subj, &p_res, 10000.0f);
x = p_res.pos.x;
y = p_res.pos.y;
write_circle_outlined(x, y, 1.0f / p_res.est_dist, 0, 0, 0, 1);
p_viewer.p.lat -= 0.0001f;
//p_viewer.p.alt -= 0.00001f;
d = 20000;
while(d--);
}
return 1;
}
The code for compute_3d_transform is:
void compute_3d_transform(struct WorldCamera *p_viewer, struct Point3D_LLA *p_subj, struct Point2D_CalcRes *res, float cliph)
{
// Estimate the distance to the waypoint. This isn't intended to replace
// proper lat/lon distance algorithms, but provides a general indication
// of how far away our subject is from the camera. It works accurately for
// short distances of less than 1km, but doesn't give distances in any
// meaningful unit (lat/lon distance?)
res->est_dist = hypot2(p_viewer->p.lat - p_subj->lat, p_viewer->p.lon - p_subj->lon);
// Save precious cycles if outside of visible world.
if(res->est_dist > cliph)
goto quick_exit;
// Compute the horizontal angle to the point.
// atan2(y,x) so atan2(lon,lat) and not atan2(lat,lon)!
res->h_angle = RAD2DEG(angle_dist(atan2(p_viewer->p.lon - p_subj->lon, p_viewer->p.lat - p_subj->lat), p_viewer->yaw));
res->small_dist = res->est_dist * 0.0025f; // by trial and error this works well.
// Using the estimated distance and altitude delta we can calculate
// the vertical angle.
res->v_angle = RAD2DEG(atan2(p_viewer->p.alt - p_subj->alt, res->est_dist));
// Normalize the results to fit in the field of view of the camera if
// the point is visible. If they are outside of (0,hfov] or (0,vfov]
// then the point is not visible.
res->h_angle += p_viewer->hfov / 2;
res->v_angle += p_viewer->vfov / 2;
// Set flags.
if(res->h_angle < 0 || res->h_angle > p_viewer->hfov)
res->flags |= X_OVER;
if(res->v_angle < 0 || res->v_angle > p_viewer->vfov)
res->flags |= Y_OVER;
res->pos.x = (res->h_angle / p_viewer->hfov) * p_viewer->width;
res->pos.y = (res->v_angle / p_viewer->vfov) * p_viewer->height;
return;
quick_exit:
res->flags |= X_OVER | Y_OVER;
return;
}
Structure for the results:
typedef struct Point2D_Pixel { unsigned int x, y; };
// Structure for storing calculated results (from camera transforms.)
typedef struct Point2D_CalcRes
{
struct Point2D_Pixel pos;
float h_angle, v_angle, est_dist, small_dist;
int flags;
};
The code is part of an open source project of mine so it's okay to post a lot of code here.
I see some of your calculation depends on p_viewer->yaw, but I do not see any intialization for p_viewer->yaw. Is this your problem?
A couple of things that seem sketchy:
You can return from compute_3d_transform without setting many of the fields in p_res/res but the caller never checks for this situation.
You consistently read from res->flags without initializing it first.
Whenever the output differs, it possibly means some value is not initialized and the outcome depends on the garbage value present in a variable. Keeping that in mind, I looked for uninitialized variables. the structure p_res is not initialized.
if(res->est_dist > cliph)
goto quick_exit;
that means if condition may turn out to be true or false depending on what garbage value is stored in res->est_dist. When if condition turns out to true, it goes straight to quick_exit label and doesn't update p_res.pos.x. If condition turned out to be false then its updated.
When I used to program C, I would use a divide and conquer debugging technique for this kind of problem to try to isolate the offending operation (paying attention to whether the symptoms change as debugging code is added, which is indicative of dangling pointer type bugs).
Essentially, start with the first line where the value is known to be good (and prove that it is consistently good at that line). Then identify where is it known to be bad. Then approx. halfway between the two points insert a test to see if it's bad. If not, then insert a test halfway between the mid-point and the known bad location, if it is bad then insert a test halfway between the mid-point and the known good location, and so on.
If the line identified is itself a function call, this process can be repeated in that called function, and so on.
When using this kind of approach, it's important to minimize the amount of added code and the artificial "noise", which can create timing changes.
Use this if you don't have (or can't use) an interactive debugger, or if the problem does not manifest when using one.

Resources