How to make best use of multithreading with monte carlo simulations?

How to make best use of multithreading with monte carlo simulations? - c

I am writing a code to benchmark simulation algorithms using the basics of Monte carlo simulations - so I am generating a random system (just an integer) and running the simulation algorithms on the randomly generated system. I need to do this many many times and although the algorithms are relatively conceptually simple they take a few seconds to run because they contain many loops.
for i=1:number of algorithms
for i=1:number of repeats
if algo = 1
//run the first algorithm [for loops]
if algo = 2
//run the second algorithm [while]
if algo = 3
//run the third algorithm [while]
where each algorithm works differently. The first algorithm can be further broken down into for loops where it is run many times and the highest score is selected so I imagine even the algorithm could be multithreaded. The other two would be much more complex to make multithreaded.
My question is how to split the program into different threads. There appears to be many different ways I could approach this and I am very new to multithreading so I have no idea what would be best.
Option 1: Split the threads immediately and run different algorithms on each thread.
Option 2: Split the threads with the second for loop, so the number of repeats are split up over the different threads.
Option 3: Try to break down the algorithm steps into smaller chunks which can be parallelized

It depends in how long would it take to each repetition and each algorithm to be executed. Assuming that each repetition takes the same as the others (for each particular algorithm), most likely the best for these kind of cases is to trivially split the outer for in different threads (switching the two for loops to have repetitions that take the same time).
Running each algorithm in a different thread instead would have no advantage, since the algorithms are not going to take exactly the same time and you will end up wasting computational power.
Option 3 sounds very unlikely for a case like this. Besides the fact that you will have to think and make a significantly more complex program, I doubt you can gain something from paralellizing the different parts of the algorithm and I think is more likely that the code will be slower due to the different threads having to wait each other.
As a side note, as I said in the comments, for this very simple cases of parallelization I would recommend you to consider splitting the runs outside the C code but in a shell script. Each job you launch will be run in a different core and you will gain a lot of flexibility. You will also be able to run in in a cluster with almost no changes if any.

Related

How to print data about the execution of my code ?

When programming in haskell we have the interpreter option :set +s. It prints some information about the code you ran. When on ghci, prints the time spent on running the code and the number of bytes used. when on hugs, prints the number of reductions made by the interpreter and the number of bytes used. How can I do the same thing in C ? I know how to print the time spent running my c code and how to print the number of clocks spent by the processor to run it. But what about the number of bytes and reductions ? I want to know a good way to compare two differents codes that do the same thing and compare which is the most efficient for me.
Thanks.

If you want to compare performance, just compare time and used memory. Allow both programs exploit the same number of processor cores, write equivalent programs in both language and run benchmarks. If you are using a Unix, time(1) is your friend.
Everything else is not relevant to performance. If a program performed 10x more functions calls than another one, but ran in half of the time, it is still the one having the better performance.
The benchmark game web site compares different language using time/space criteria. You may wish to follow the same spirit.
For more careful profiling of portions of the programs, rather than the whole program, you can either use a profiler (in C) or turn on the profiling options (in GHC Haskell). Criterion is also a popular Haskell library to benchmark Haskell programs. Profiling is typically useful to spot the "hot points" in the code: long-running loops, frequently called functions, etc. This is useful because it allows the programmer to know where optimization is needed. For instance, if a function cumulatively runs for 0.05s, obtaining a 10x speed increase on that is far less useful than a 5% optimization on a function cumulatively running for 20 minutes (0.045s vs 60s gain).

Nussinov Parallel

I am a biologist and I am trying to study computer languages. But, when I was trying to learn about the lpthread library, it seems odd as the result was lower than the sequential version.
In fact I am still reading the Tanenbaum book. But my main focus is to learn the basics of the calculations of the secondary structure of RNAs. So I found the explanation to the nussinov algorithm in a book and did indeed implement it. But when I tried to make a parallel version I believe that I might be missing the whole point, as this is my first contact with parallel implementations.
My questions are:
1. How should I implement a data-parallelism version for this algorithm ?
2. Why is my implementation slightly slower than the sequential one?
The code is available on: https://gist.github.com/drenge/6395472 (each file is a different version parallel/sequential)

there are two ways to make a parallel version of an algorithm/program.
You study the algorithm and write the serial program. Afterwards, you start profiling the program to see where you can obtain speed gains. Those are the places where parallelism might come in handy (might, not will). I call this method the "desparate man's tool". This method is useful (!), but most of the times, the method beneath can provide better performance gains. This way of doing the optimisation method only takes programming and user experience into account.
You take the algorithm and try to figure out an other algorithm that permits parallel handling of the problem. Are there independent calculations or steps in the algorithm, are there parts of the algorithm that can be done before other parts completely finish, ... This could be called "the theoretical approach". Keep in mind that every thread has its overhead, and you don't want the overhead to be bigger than the gain you wish to obtain.
In fact, a combination of both is the best way to go (if parallelism is really necessary): first concentrate on method 2 (optimise the algorithm so that is stays scientifically correct, but can be treated in multi threading). Then look at the critical thread (can be found while profiling) and start optimising that thread.
As Kerrek SB already told: parallel programming is a very complex topic, with lots of possible pitfalls. And at the end of the road, you should ask yourself: is it worth the effort. After all: loosing weeks of study and programming time to gain some minutes is not worth your while.
On the other hand, if your program will run thousands of times, frustrating users due to long waiting times or a lack of responsiveness, than maybe, it could be useful to make a more performant version after all. But again: can't you reach the same goal by optimising a sequential version without the parallel clutter? Lot's of algorithms are of order O(exp(x)) or worse and can be reduced to O(x) or even O(log(x)).
Kind regards,
PB

Avoiding CUDA thread divergence for MISD type operation

As part of a bigger code, I have a CUDA RK4 solver that integrates a large number of ODEs (Can be 1000+) in parallel. One step of this operation is calculating 'xdot', which is different for each equation (or data element). As of now, I have a switch-case branching setup to calculate the value for each data element in the kernel. All the different threads use the same 3-6 data elements to calculate their output, but in a different way. For example, for thread 1, it could be
xdot = data[0]*data[0] + data[1];
while for thread 2 it could be,
xdot = -2*data[0] + data[2];
and so on.
So if I have a hundred data elements, the execution path is different for each of them.
Is there any way to avoid/decrease the thread-divergence penalty in such a scenario?
Would running only one thread per block be of any help ?

Running one thread per block simply nulls 31/32 threads in the single warp you launch and wastes a lot of cycles and opportunities to hide latency. I would never recommend it, no matter how much branch divergence penalty your code incurred.
Your application sounds pretty orthognal to the basic CUDA programming paradigm and there really isn't going to be much you can do to avoid branch divergence penalties. One approach which could slightly improve things would be to perform some prior analysis of the expressions for each equation and group those with common arithmetic terms together. Recent hardware can run a number of kernels simultaneously, so it might be profitable to group calculations sharing like terms into different kernels and launch them simultaneously, rather than a single large kernel. CUDA supports C++ templating, and that can be a good way of generating a lot of kernel code from a relatively narrow base and make a lot of logic statically evaluable, which can help the compiler. But don't expect miracles - your problem is probay better suited to a different architecture than the GPU (Intel's Xeon Phi, for example).

Is there a simple way to run a C/C++ program parallelly without recoding?

I have a multi-cores machine but when i tried to run this old C program (http://www.statmt.org/moses/giza/mkcls.html) it only utilizes one core. Is there a way to run the C code and send the cycles/threads to the other cores?
Is recoding the code into CUDA the only way?

I have a multi-cores machine but when i tried to run this old C
program (http://www.statmt.org/moses/giza/mkcls.html) it only utilizes
one core. Is there a way to run the C code and send the cycles/threads
to the other cores?
Without recompiling, definitely not.
You may be able to make some minor tweaks and use a tool that takes your source and parallelizes it automatically, but since each core is quite separate - they are "quite far apart" - you can't just spread the instructions between the two cores. The code has to be compiled in such a way that there are two "streams of instructions" - if you were to just send every other instruction to every other core in a dual core system, it would probably run 10-100 times slower than if you run all code on one core, because of all the extra overhead in communication between the cores that would be needed [each core already has the ability to run several instructions in parallel, and the main reason for multi-core processors in the first place is that this ability to run things in parallel only goes so far at making things faster - there are only so many instructions that can be run before you need the result of a previous instruction, etc, etc].
Is recoding the code into CUDA the only way?
No, there are many other alternatives. OpenMP, hand-coding using multiple threads. Or, the simplest approach, start the program two or four times over, with different input data, and let them run completely separately. This obviously only works if there is something you can run multiple variants of at the same time...
A word on "making things parallel". It's not a magical thing that will make all code faster. Calculating something where you need the result of the previous calculation would be pretty hopeless - say you want to calculate Fibonacci series for example - f(n) = f(n-1) + f(n-2) - you can't do that with parallel calculations, because you need the result from the other calculation(s) to proceed this. On the other hand, if you have a dozen really large numbers that you want to check if they are prime-numbers, then you'd be able to do that about four times faster with a 4 core processor and four threads.
If you have a large matrix that needs to be multiplied by another large matrix or vector, that would be ideal to split up so you do part of the calculation on each core.
I haven't looked at the code for your particular project, but just looking at the description, I think it may parallelise quite well.

Yes, this is called automatic parallelization and it is an active area of research.
However, I know of no free tools for this. The Wikipedia article "automatic parallelization" has a list of tools. You will need access to the original source code, and you might have to add parallelization directives to the code.

You can run it in multiple processes and write another program that forwards tasks to either of those processes.
CUDA? You only need that if you want it to run on your graphics-card, so in this case that makes no sense.

The limits of parallelism (job-interview question)

Is it possible to solve a problem of O(n!) complexity within a reasonable time given infinite number of processing units and infinite space?
The typical example of O(n!) problem is brute-force search: trying all permutations (ordered combinations).

It sure is. Consider the Traveling Salesman Problem in it's strict NP form: given this list of costs for traveling from each point to each other point, can you put together a tour with cost less than K? With the new infinite-core CPU from Intel, you just assign one core to each possible permutation, and add up the costs (this is fast), and see if any core flags a success.
More generally, a problem in NP is a decision problem such that a potential solution can be verified in polynomial time (i.e., efficiently), and so (since the potential solutions are enumerable) any such problem can be efficiently solved with sufficiently many CPUs.

It sounds like what you're really asking is whether a problem of O(n!) complexity can be reduced to O(n^a) on a non-deterministic machine; in other words, whether Not-P = NP. The answer to that question is no, there are some Not-P problems that are not NP. For example, a limited halting problem (that asks if a program halts in at most n! steps).

The problem would be distributing the work and collecting the results.
If all the CPUs can read the same piece of memory at once, and if each one has a unique CPU-ID that is known to it, then the ID may be used to select a permutation, and the distribution problem is solveable in constant time.
Gathering the results would be tricky, though. Each CPU could compare with its (numerical) neighbor, and then that result compared to the result of the two closest neighbors, etc. This will be a O(log(n!)) process. I don't know for sure, but I suspect that O(log(n!)) is hyperpolynomial, so I don't think that's a solution.

No, N! is even higher than NP. Thinking unlimited parallelism could solve NP problem in polynomial time, which is usually considered as a "reasonable" time complexity, N! problem is still higher than polynomial on such a setup.

You mentioned search as a "typical" problem, but were you actually asked specifically about a search problem? If so, then yes, search is typically parallelizable, but as far as I can tell O(n!) in principle does not imply the degree of concurrency available, does it? You could have a completely serial O(n!) problem, which means infinite computers won't help. I once had an unusual O(n^4) problem that actually was completely serial.
So, available concurrency is the first thing, and IMHO you should get points for bringing up Amdahl's law in an interview. Next potential pitfall is inter-processor communication, and in general the nature of the algorithm. Consider, for example, this list of application classes: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine. FWIW the O(n^4) code I mentioned earlier sort of falls into the FSM category.
Another somewhat related anecdote: I've heard an engineer from a supercomputer vendor claim that if 10% of their CPU time were being spent in MPI libraries, they consider the parallelization a solid success (though that may have just been limited to codes in the computational chemistry domain).

If the problem is one of checking permutations/answers to a problem of complexity O(n!), then of course you can do it efficiently with an infinite number of processors.
The reason is that you can easily distribute atomic pieces of the problem (an atomic piece of the problem might, say, be one of the permutations to check) with logarithmic efficiency.
As a simple example, you could set up the processors as a 'binary tree', so to speak. You could be at the root, and have the processors deliver permutations of the problem (or whatever the smallest pieces of the problem might be) to the leaf processors to solve, and you'd end up solving the problem in log(n!) time.
Remember it's the delivery of the permutations to the processors that takes a long time. Each part of the problem itself will actually be solved instantly.
Edit: Fixed my post according to the comments below.

Sometimes the correct answer is, "How many times does this come up with your code base?" but in this case, there is a real answer.
The correct answer is no, because not all problems can be solved using perfect parallel processing. For example, a travelling salesman-like problem must commit to one path for the second leg of the journey to be considered.
Assuming a fully connected matrix of cities, should you want to display all possible non-cyclic routes for our weary salesman, you're stuck with a O(n!) problem, which can be decomposed to an O(n)*O((n-1)!) problem. The issue is that you need to commit to one path (on the O(n) side of the equation) before you can consider the remaining paths (on the O((n-1)!) side of the equation).
Since some of the computations must be performed prior to other computations, then there is no way to scatter the results perfectly in a single scatter / gather pass. That means the solution will be waiting on the results of calculations which must come before the "next" step can be started. This is the key, as the need for prior partial solutions provide a "bottle neck" in the ability to proceed with the computation.
Since we've proven we can make a number of these infinitely fast, infinitely numerous, CPUs wait (even if they are waiting on themselves), we know that the runtime cannot be O(1), and we only need to pick a very large N to guarantee an "unacceptable" run time.

This is like asking if an infinite number of monkeys typing on a monkey-destruction proof computer with a word-processor can come up with all the works of Shakespeare; given an infinite amount of time. The realist would say not since the conditions are no physically possible. The idealist will say yes; in theory it can happen. Since Software Engineering (Software Engineering, not Computer Science) focuses on real system we can see and touch, then the answer is no. If you doubt me, then go build it and prove me wrong! IMHO.

Disregarding the cost of setup (whatever that might be...assigning a range of values to a processing unit, for instance), then yes. In such a case, any value less than infinity could be solved in one concurrent iteration across an equal number of processing units.
Setup, however, is something significant to disregard.

Each problem could be solved by one CPU, but who would deliver these jobs to all infinite CPU's? In general, this task is centralized, so if we have infinite jobs to deliver to all infinite CPU's, we could take infinite time to do so.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight