About Dijkstra omp

About Dijkstra omp - c

Recently I've download a source code from internet of the OpenMP Dijkstra.
But I found that the parallel time will always larger than when it is run by one thread (whatever I use two, four or eight threads.)
Since I'm new to OpenMP I really want to figure out what happens.

The is due to the overheard of setting up the threads. The execution time of the work itself is theoretically the same, but the system has to set up the threads that manage the work (even if there's only one). For little work, or for only one thread, this overhead time makes your time-to-solution slower than the serial time-to-solution.
Alternatively, if you see the time increasing dramatically as you increase the thread-count, you could only be using 1 core on your computer and tricking it into thinking its 2,4,8, etc threads.
Finally, it's possible that the way you're implementing dijkstra's method is largely serial. But without looking at your code it would be too hard to say.

Related

Getting reliable performance measurements for short bits of code

I'm trying to profile a few set of functions which implement different versions of the same algorithm in different ways. I've increased the number of times each function is run so that the total time spent in a single function is roughly 1 minute (to reveal performance differences).
Now, running several times the test produces baffling results. There is a huge variability (+- 50 %) between several executions of the same function, and determining which function is the fastest (which is the goal of the test) is nearly impossible because of that.
Is there something special I should take care of before running the tests, so that I get smoother measurements? Failing that, is running the test several times and compute the average for each function the way to go?

There are lots of things to check!
First, make sure your functions are actually CPU-bound. If so, make sure you have all CPU throttling, turbo modes, and power-saving modes disabled (in BIOS) for the test. If you still have trouble, try pinning your process to a single core. Disable hyper-threading too perhaps.
The goal of all this is to make sure you get your code running hot on a single core without much interruption. If you're on Linux, you can remove a single core from the OS list of available cores and use that (so there is no chance of interference on that core).
Running the test several times is a good idea, but using the average (arithmetic mean) is not. Instead, use the median or minimum or some other measurement which won't be influenced by outliers. Usually, the occasional long test run can be thrown out entirely (unless you're building a real-time system!).

Understanding pthreads a little more in C

So I only very recently heard about these pthreads and my understanding of them is very limited so far but I just wanted to know if it would be able to do what I want before I get real into learning about them.
I have written a program that generates two output pulses from a micro-controller which happen with different frequencies, periods and duty cycles. At the moment the functions to output the pulses are happening in a loop and it works well because the timings I am using are multiples of each other so stopping one while not interrupting the other is not too much hassle.
However I want it be a lot more dynamic so I can change the duty cycles or periods easily without having to make some complicated loop specific for those timings... Below shows a quick sketch of what I am trying to achieve and I hope you can understand it...
So basically my question is, is something like this possible with pthreads in C, ie do they run simultaneously so one could be pulsing on and off while the one is waiting for a delay to finish?
If not is there anything that I could use for this instead?

In general, it's not worth using threads for such functionality on a uC. The cost of extra stacks etc. for such limited operations is not worth it, tempting it might be from a simplicity POV.
A hardware timer, interrupt and a delta-queue of events is probably the best you could do.

Memory Sharing C - Performance

I'm playing around with process creation/ scheduling in Linux. As part of that, I have a number of concurrent threads computing a basic hash function from a shared in memory buffer. Each thread is created using clone, I'm trying and I'm playing around with the various flags, stack size, to measure process creation time, etc. (hence the use of clone)
My experiments are run on a 2 core i7 with hyperthreading enabled.
In this context, I find that, with all flags enabled (CLONE_VM, CLONE_SIGHAND, CLONE_FILES, CLONE_FS), the time it takes to compute n hash functions doubles when I run 4 processes (ak one per logical cpu) over when I run 2 processes. My understanding is that hyperthreading helps when a process is waiting on IO, so for a CPU bound process, it has almost no effect. Is this correct?
The second observation is that I observe pretty high variance (up to 2 seconds) when computing these hash functions (I compute a hash 1 000 000 times). No other process is running on he system (though there are some background threads). I'm struggling to understand why so much variance? Is it strictly due to how the scheduler happens to schedule the processes? I understand that without using sched_affinity, there is no guarantee that they will be located on different cpus, so can that just be explained by them being placed on the same CPU?
Are there any other ways to guarantee improved reliability without relying on sched_affinity?
The third observation is that, even when I run with just 2 threads (so when each should be scheduled on a diff CPU), I find that the performance goes down (not by much, but a little bit). I'm struggling to understand why that is the case? It's the same read-only buffer, and fits in the cache. Is there some contention in accessing the page table? Would it then be preferable to create two processes with distinct address spaces and explicitly share the segment, marking it as read only?

Different threads still run in the context of one process so they should run on the same CPU the process is run on (usually one process is run on one CPU but that is not guaranteed).
When you run two threads instead of processes you have an overhead of switching threads, the more calculations you do the more this switching needs to be done so it will be slower than the same calculations done in one thread.
Furthermore if you run the same calculations in different processes then there is an even bigger overhead of switching between processes but there is more chance you will run on different CPUs so in the long run this will probably be faster, not so much for short calculations.
Even if you don't think you have other processes running the OS has a lot to do all the time and switches to it's own processes that you aren't always aware of.
All of this emanates from the randomness of switching. Hope I helped a bit.

Is there a way to suspend OS scheduling for the duration of a program?

I have an assignment where I am analyzing the runtime of various sorting algorithms. I have written the code but I think it's an unfair comparison.
My code basically grabs the the clock time before and after the sorting is finished to compute the elapsed time. However, what if the OS decides to interrupt more frequently during the runtime of a specific sorting algorithm, or if it rather decides that some other background application should be given more of the time domain when it's thread comes back up?
I am not a CS major so I may not be entirely correct here, but from what I've read previously I was concerned this might have an impact on the results.
I also realize that if OS scheduling is suspended and the program hangs then there might be a serious problem; I am just wondering if it possible.

Normally, there's no real reason for it. The scheduler will slightly increase the execution time, but if the code runs for a few seconds, the change will be tiny.
So unless you're running heavy applications on the same computer, the amount of noise this will add to your tests is negligible.
In Linux, you can use isolcpus parameter to mark CPUs that won't be used by the scheduler. You can find information here. I'm not sure what's the minimal kernel version.
If you use it, you'll need to use sched_setaffinity, to put your theread on an isolated CPU, because the scheduler won't put it there.

It is not possible, not in user space code. Otherwise, any malicious process could steal the CPU from others.
If you want precise time counting for your process only, I suggest using time command. You can read about it here: What do 'real', 'user' and 'sys' mean in the output of time(1)?
Quick answer: you are most likely interested in user time, assuming your code doesn't make a heavy use of syscalls (which would be rather strange for a sorting algorithm)

On an up-to-date POSIX system (basically Linux) you can use clock_gettime with CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID if you make sure the process doesn't migrate between CPUs (you can set its affinity for example).
The difference in times returned by clock_gettime with those arguments results in exact time the process/thread spent executing. Only pitfall as I mentioned is process migration as the man page says:
The CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID clocks are realized on many platforms using timers from the CPUs (TSC on i386, AR.ITC on Itanium). These registers may differ between CPUs and as a consequence these clocks may return bogus results if a process is migrated to another CPU.
This means that you don't really need to suspend all other processes just to measure the execution time of your program.

Can't get any speedup from parallelizing Quicksort using Pthreads

I'm using Pthreads to create a new tread for each partition after the list is split into the right and left halves (less than and greater than the pivot). I do this recursively until I reach the maximum number of allowed threads.
When I use printfs to follow what goes on in the program, I clearly see that each thread is doing its delegated work in parallel. However using a single process is always the fastest. As soon as I try to use more threads, the time it takes to finish almost doubles, and keeps increasing with number of threads.
I am allowed to use up to 16 processors on the server I am running it on.
The algorithm goes like this:
Split array into right and left by comparing the elements to the pivot.
Start a new thread for the right and left, and wait until the threads join back.
If there are more available threads, they can create more recursively.
Each thread waits for its children to join.
Everything makes sense to me, and sorting works perfectly well, but more threads makes it slow down immensely.
I tried setting a minimum number of elements per partition for a thread to be started (e.g. 50000).
I tried an approach where when a thread is done, it allows another thread to be started, which leads to hundreds of threads starting and finishing throughout. I think the overhead was way too much. So I got rid of that, and if a thread was done executing, no new thread was created. I got a little more speedup but still a lot slower than a single process.
The code I used is below.
http://pastebin.com/UaGsjcq2
Does anybody have any clue as to what I could be doing wrong?

Starting a thread has a fair amount of overhead. You'd probably be better off creating a threadpool with some fixed number of threads, along with a thread-safe queue to queue up jobs for the threads to do. The threads wait for an item in the queue, process that item, then wait for another item. If you want to do things really correctly, this should be a priority queue, with the ordering based on the size of the partition (so you always sort the smallest partitions first, to help keep the queue size from getting excessive).
This at least reduces the overhead of starting the threads quite a bit -- but that still doesn't guarantee you'll get better performance than a single-threaded version. In particular, a quick-sort involves little enough work on the CPU itself that it's probably almost completely bound by the bandwidth to memory. Processing more than one partition at a time may hurt cache locality to the point that you lose speed in any case.

First guess would be that creating, destroying, and especially the syncing your threads is going to eat up and possible gain you might receive depending on just how many elements you are sorting. I'd actually guess that it would take quite a long while to make up the overhead and that it probably won't ever be made up.
Because of the way you have your sort, you have 1 thread waiting for another waiting for another... you aren't really getting all that much parallelism to begin with. You'd be better off using a more linear sort, perhaps something like a Radix, that splits the threads up with more further data. That's still having one thread wait for others a lot, but at least the threads get to do more work in the mean time. But still, I don't think threads are going to help too much even with this.

I just have a quick look at your code. And i got a remark.
Why are you using lock.
If I understand what you are doing is something like:
quickSort(array)
{
left, right = partition(array);
newThread(quickSort(left));
newThread(quickSort(right));
}
You shouldn't need lock.
Normally each call to quick sort do not access the other part of the array.
So no sharing is involve.

Unless each thread is running on a separate processor or core they will not truly run concurrently and the context switch time will be significant. The number of threads should be restricted to the number of available execution units, and even then you have to trust the OS will distribute them to separate processors/cores, which it may not do if they are also being used for other processes.
Also you should use a static thread pool rather than creating and destroying threads dynamically. Creating/destroying a thread includes allocating/releasing a stack from the heap, which is non-deterministic and potentially time-consuming.
Finally are the 16 processors on the server real or VMs? And are they exclusively allocated to your process?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight