Suppose that I am using MCMC to sample from some distribution whose full conditionals can be written as:
$$f(\beta|\alpha_1,\ldots,\alpha_{K}),$$
$$g_s(\alpha_{s}|\beta) \mbox{ for }s=1,\ldots,K,$$
and the computation of sampling is dominated by $g_s(\alpha_{s}|\beta)$'s.
Apparently we can first sample each $g_s(\alpha_{s}|\beta)$ separately (in parallel) within each iteration and collect $\alpha_{s}$'s, and then sample $f(\beta|\alpha_1,\ldots,\alpha_{K})$. We can do this in each iteration of the MCMC.
My understanding is that there is a lot of communication (collecting $\alpha_{s}$'s and then sampling $f(\beta|\alpha_1,\ldots,\alpha_{K})$) going on. Hence the parallelization needs not be very efficient.
I was wondering to how much extent do I need to worry about the communication in-efficiency in this scenario. Is there a better way to improve the sampling efficiency? Any suggestions would be really welcome.
Related
I am writing a code to benchmark simulation algorithms using the basics of Monte carlo simulations - so I am generating a random system (just an integer) and running the simulation algorithms on the randomly generated system. I need to do this many many times and although the algorithms are relatively conceptually simple they take a few seconds to run because they contain many loops.
for i=1:number of algorithms
for i=1:number of repeats
if algo = 1
//run the first algorithm [for loops]
if algo = 2
//run the second algorithm [while]
if algo = 3
//run the third algorithm [while]
where each algorithm works differently. The first algorithm can be further broken down into for loops where it is run many times and the highest score is selected so I imagine even the algorithm could be multithreaded. The other two would be much more complex to make multithreaded.
My question is how to split the program into different threads. There appears to be many different ways I could approach this and I am very new to multithreading so I have no idea what would be best.
Option 1: Split the threads immediately and run different algorithms on each thread.
Option 2: Split the threads with the second for loop, so the number of repeats are split up over the different threads.
Option 3: Try to break down the algorithm steps into smaller chunks which can be parallelized
It depends in how long would it take to each repetition and each algorithm to be executed. Assuming that each repetition takes the same as the others (for each particular algorithm), most likely the best for these kind of cases is to trivially split the outer for in different threads (switching the two for loops to have repetitions that take the same time).
Running each algorithm in a different thread instead would have no advantage, since the algorithms are not going to take exactly the same time and you will end up wasting computational power.
Option 3 sounds very unlikely for a case like this. Besides the fact that you will have to think and make a significantly more complex program, I doubt you can gain something from paralellizing the different parts of the algorithm and I think is more likely that the code will be slower due to the different threads having to wait each other.
As a side note, as I said in the comments, for this very simple cases of parallelization I would recommend you to consider splitting the runs outside the C code but in a shell script. Each job you launch will be run in a different core and you will gain a lot of flexibility. You will also be able to run in in a cluster with almost no changes if any.
As part of a bigger code, I have a CUDA RK4 solver that integrates a large number of ODEs (Can be 1000+) in parallel. One step of this operation is calculating 'xdot', which is different for each equation (or data element). As of now, I have a switch-case branching setup to calculate the value for each data element in the kernel. All the different threads use the same 3-6 data elements to calculate their output, but in a different way. For example, for thread 1, it could be
xdot = data[0]*data[0] + data[1];
while for thread 2 it could be,
xdot = -2*data[0] + data[2];
and so on.
So if I have a hundred data elements, the execution path is different for each of them.
Is there any way to avoid/decrease the thread-divergence penalty in such a scenario?
Would running only one thread per block be of any help ?
Running one thread per block simply nulls 31/32 threads in the single warp you launch and wastes a lot of cycles and opportunities to hide latency. I would never recommend it, no matter how much branch divergence penalty your code incurred.
Your application sounds pretty orthognal to the basic CUDA programming paradigm and there really isn't going to be much you can do to avoid branch divergence penalties. One approach which could slightly improve things would be to perform some prior analysis of the expressions for each equation and group those with common arithmetic terms together. Recent hardware can run a number of kernels simultaneously, so it might be profitable to group calculations sharing like terms into different kernels and launch them simultaneously, rather than a single large kernel. CUDA supports C++ templating, and that can be a good way of generating a lot of kernel code from a relatively narrow base and make a lot of logic statically evaluable, which can help the compiler. But don't expect miracles - your problem is probay better suited to a different architecture than the GPU (Intel's Xeon Phi, for example).
I am constructing the partial derivative of a function in C. The process is mainly consisted of a large number of small loops. Each loop is responsible for filling a column of the matrix. Because the size of the matrix is huge, the code should be written efficiently. I have a number of plans in mind for the implementation which I do not want get into the details.
I know that the smart compilers try to take advantage of the cache automatically. But I would like to know more the details of using cache and writing an efficient code and efficient loops. It is appreciated if provide with some resources or websites so I can know more about writing the efficient codes in terms of reducing memory access time and taking advantage guy.
I know that my request my look sloppy, but I am not a computer guy. I did some research but with no success.
So, any help is appreciated.
Thanks
Well written code tends to be efficient (though not always optimal). Start by writing good clean code, and if you actually have a performance problem that can be isolated and addressed.
It is probably best that you write the code in the most readable and understandable way you can and then profile it to see where the bottlenecks really are. Often times your conception of where you need efficiency doesn't match up with reality.
Modern compilers do a decent job with many aspects of optimization and it seems unlikely that the process of looping will itself be a problem. Perhaps you should consider focusing on simplifying the calculation done by each loop.
Otherwise, you'll be looking at things such as accessing your matrix row by row so that you take advantage of the row-major storage order C uses (see this question).
You'll want to build your for loops without if statements inside because if statements create what is called "branching". The computer essentially guesses which option will be right and pays a sometimes hefty option if it is wrong.
To extend that theme, you want to do as little inside the for loop as possible. You'll also want to define it with static limits, e.g.:
for(int i=1;i<100;i++) //This is better than
for(int i=1;i<N/i;i++) //this
Static limits means that very little effort is expended determining if the for loop should keep going. They also permit you to use OpenMP to divy up the work in the loops, which can sometimes speed things up considerably. This is simple to do:
#pragma omp parallel for
for(int i=0;i<100;i++)
And, walla! the code is parallelized.
For these days I was working on C-mex code in order to improve speed in DBSCAN matlab code. In fact, at the moment I finished a DBSCAN on C-mex. But instead, it takes more time (14.64 seconds in matlab, 53.39 seconds in C-Mex) with my test data which is a matrix 3 x 14414. I think this is due to the use of mxRealloc function in several parts of my code. Would be great that someone give me some suggestion with the aim to get better results.
Here is the code DBSCAN1.c:
https://www.dropbox.com/sh/mxn757a2qmniy06/PmromUQCbO
Using mxRealloc in every iteration of a loop is indeed a performance killer. You can use vector or similar class instead. Dynamic allocation is not needed at all in your distance function.
If your goal is not to implement DBSCAN as a mex but to speed it up, I will offer you a different solution.
I don't know which Matlab implementation are you using, but you won't make a trivial n^2 implementation much faster by just rewriting it to C in the same way. Most of the time is spent calculating the nearest neighbors which won't be faster in C than it is in Matlab. DBSCAN can run in nlogn time by using an index structure to get the nearest neighbors.
For my application, I am using this implementation of dbscan, but I have changed the calculation of nearest neighbors to use a KD-tree (available here). The speedup was sufficient for my application and no reimplementation was required. I think this will be faster than any n^2 c implementation no matter how good you write it.
how can i know if my serial code will run faster if i used a GPU? i know it depends on a lot of things... ie if the code could be parallalized in an SMID fation and all this stuff... but what considerations should i take into account to be "sure" that i will gain speed? should the algorithm be embarrassingly parallel? therefore i wouldn't bother trying the GPU if parts of the algorithm cannot be parallelized? should i take into consideration how much memory is required for a sample input?
what are the "specs" of a serial code that would make it run faster on a GPU? can a complex algorithm gain speed on a GPU?
i don't want to waste time and try to code my algorithm on GPU and i am 100% sure that speed will be gained.... that is my problem....
i think that my algorithm could be parallelized on GPU... would it be worth trying it?
It depends upon two factors:
1) The speedup of having many cores performing the floating point operations
This is dependent upon the inherent parallelization of the operations you are performing, the number of cores on your GPU, and the differences in clock rates between your CPU and GPU.
2) The overhead of transferring the data back and forth between main memory and GPU memory.
This is mainly dependent upon the "memory bandwidth" of your particular GPU, and is greatly reduced by the Sandy Bridge architecture where the CPU and GPU are on the same die. With older architectures, some operations such as matrix multiplication where the inner dimensions are small get no improvement. This is because it takes longer to transfer the inner vectors back and forth across the system bus than it does to dot product the vectors on the CPU.
Unfortunately these two factors are tough to estimate and there is no way to "know" without trying it. If you currently use BLAS for your SIMD operations, it is fairly simple to substitute in CUBLAS which has the same API except it sends the operations over to the GPU to perform.
When looking for a parallel solution you should typically ask yourself the questions
The amount of data you have.
The amount of floating point computation you have.
How complicated is your algorithm i.e. conditions and branches in the algorithm. Is there any data localization?
what kind of speedup is required?
Is it Realtime computation or not?
Do alternate algorithms exist (but maybe they are not the most efficient serial algorithm)?
What kind of sw/hw you have access to.
Depending on the answers you are looking for you may want to use GPGPU, cluster computation or distributed computation or a combination of GPU and cluster/distributed machines.
If you could share the any information on your algorithm and size of data then it would be easier to comment.
Regular C code can be converted to CUDA remarkably easily. If the heavy hitters in your algorithm's profile can be parallelized, try it and see if it helps.