Dealing with 2D grids with MPI - c

I'm using MPI to parallelise some simple C code, but I'm having a hard time fundamentally understanding it.
Everything was quite reasonable when I was first learning about MPI, but this has stumped me. If you create a 2D grid using two nested loops, i.e.
for(i=0; i<imax; i++) {
for(j=0; j<jmax; j++) {
grid[i][j] = ...
}
}
How would you even go about parallelising this? I initially thought you would use a master thread to gather all of the data that each thread calculates, but apparently this scales badly.
I believe you need to call MPI_Gather(...) for all processes, but I don't really understand how you would then collect this data into one coherent 2D grid.
I should probably note that the ... in the loop is just where it calculates a small value for each grid point. It uses a red/black scheme, so there are no issues with race conditions, etc.
Thanks!

Related

Indexing ones into a matrix of zeros without loops

I'm fairly new to MATLAB and I need some help with this problem.
the problem is to write a function that creates an (n-n) square matrix of zeros with ones on the reverse diagonal
I tried this code:
function s=reverse_diag(n)
s=zeros(n);
i=1;j=n;
while i<=n && j>=1
s(i,j)=1;
i=i+1;
j=j-1;
end
but I want another way for solving it without using loops or diag and eye commands.
Thanks in advance
The easiest and most obvious way to achieve this without loops would be to use
s=fliplr(eye(n))
since you stated that you don't want to use eye (for whatever reason), you could also work with sub2ind-command. It would look like this:
s=zeros(n);
s(sub2ind(size(s),1:size(s,1),size(s,2):-1:1))=1

Cleanest way to choose between functions inside loop to avoid overhead?

I have a function that computes the Minimum Spanning Tree of a graph, and I have two different ways to compute the edge weights in the tree.
The edge_weight() function is called inside a few for loops over and over, and which function is used depends on the type of tree, which is specified outside of the main function (by the user).
My problem is that everything inside the for loops besides the edge_weight() function is identical, so I don't want to copy/paste code.
But at the same time I don't want to condition inside the for loops on which edge_weight() function to use, in order to avoid the overhead of a repeated if condition.
So what's the cleanest way to specify which function to use before the for loops start without copy/pasting the same code?
Can I have pointers to functions (does that slow the code down, though?)?
Can I put functions in variables, arrays?
EDIT: Speed is crucial for this application, hence my trying to avoid conditioning inside the for loops.
Here's some pseudo-code:
while(nodes < threshold)
{
for (int i = 0; i < K; i++)
{
if (nodes[i] == something)
{
weight = one_of_two_edge_weight_functions();

Efficient/Parallel Apply an operation to each element in an Array in Java <= 1.7?

I am interested in ways of applying an operation (independetly) to each element in a Java Array. For example, clipping each element of a numeric array to be no more than a given element.
Example:
myArray.clip(5)
Or
Utils.clip(myArray, 5)
woud set each element greater than 5 to be 5
The most straightforward method is to iterate over the elements, but I don't like it for two reasons:
It is Iterative and doesn't exploit the parrallization chances
The code doesn't look beautiful, a mapping or vecotorized syntax would be nicer
I need to do such a clipping operation about 5000 times, each time on a 70 X 10 (2D Array)
If Java <= 1.7 doesn't provide ways to do that in the standard library. How can I achieve what I want (two points above) using other libs or Java 1.8 Features.
The traditional Java 7 technique for operating on the data in a collection in parallel is the "fork/join" pattern. The notion is that one decomposes the work to be done into smaller pieces, forks off a task to complete those pieces, and then join each task at the end.
I'm not sure your case warrants this, though. Depending on how complicated a "clip" operation is, it might be more straightforward to create a pair of threads to operate on the two axes simultaneously. Though, I may be misunderstanding your requirements.
Another solution might be a time/space trade-off. It may be more useful to maintain a clipped version of the data as you update the primary collection. Sort of a producer-consumer model, where there is something that looks for changes to the collection, and maintains a clipped version.
I am not sure about Arrays themselves, but if you have a "Collection" compatible type you can use parallel aggregate operations in Java 8.
Refer to https://docs.oracle.com/javase/tutorial/collections/streams/index.html
Using Java 8 features, this is how I went about it, I iterated over the rows and did parallel operations on it. I didn't find any standard methods that takes a
double [] []
and use it as a stream or a
parallelSetAll(double [] [] ...)
My code, I use Apache commons-math library for my matrix implementation because it provides utility functions based on the RealMatrix type implemented in the library
Can this code be optimized further without having to go too much too level. Like take things out of the for loop or do the assignments later , etc
public static RealMatrix clipLower(RealMatrix m, double lowerBound){
// TODO see if you really need to modify m or just return a modified copy of m
//process the matrix row by row, each row gets processed in parallel
// TODO See if you can directly parallelize on the entire matrix not row by row
// OPEN implement using Arrays.parallelSetAll Method?
int nrOfRows = m.getRowDimension();
for(int i = 0; i < nrOfRows; i++){
// TODO move the declaration outside the for loop
double[] currRow = m.getRow(i);
double[] newRow = Arrays.stream(currRow)
.parallel()
.map((number) -> (number < lowerBound) ? lowerBound : number)
.toArray();
m.setRow(i, newRow);
}
return m;
}

Initial Hidden Markov Model for the Baum Welch algorithm

While trying to make a program for hidden markov models, I did the simplest assumption for the initial HMM of the Baum-Welch algorithm : put everything as a uniform distribution. That is,
A[i][j] = 1/statenumber;
B[i][j] = 1/observationnumber;
P[i] = 1/statenumber;
up to a logarithm to avoid underflowing. It has the benefit of not requiring to check for normalization.
But so far, I've run into the algorithm not actually doing much. The emission matrix changes at the first iteration, but not after that, and the transition matrix and initialization vector do not evolve at all. It seems to be that the gamma matrix does not change at all.
At first I thought it was my algorithm not working out too well, but after trying it on some other HMM libraries, I seem get the same type of results.
Is it impossible to converge to the correct HMM using such an initialization, and what is the ideal method to initialize those arrays?
The Baum Welch algorithm won't work with a uniform initial distribution -- the updates will be degenerate. Try to randomize it instead.

How would you avoid False Sharing in a scenario like this?

In the code below I have parallelised using OpenMP's standard parallel for clause.
#pragma omp parallel for private(i, j, k, d_equ) shared(cells, tmp_cells, params)
for(i=0; i<some_large_value; i++)
{
for(j=0; j<some_large_value; j++)
{
....
// Some operations performed over here which are using private variables
....
// Accessing a shared array is causing False Sharing
for(k=0; k<10; k++)
{
cells[i * width + j].speeds[k] = some_calculation(i, j, k, cells);
}
}
}
This has given me a significant improvement to runtime (~140s to ~40s) but there is still one area I have noticed really lags behind - the innermost loop I marked above.
I know for certain the array above is causing False Sharing because if I make the change below, I see another huge leap in performance (~40s to ~13s).
for(k=0; k<10; k++)
{
double somevalue = some_calculation(i, j);
}
In other words, as soon as I changed the memory location to write to a private variable, there was a huge speed up improvement.
Is there any way I can improve my runtime by avoiding False Sharing in the scenario I have just explained? I cannot seem to find many resources online that seem to help with this problem even though the problem itself is mentioned a lot.
I had an idea to create an overly large array (10x what is needed) so that enough margin space is kept between each element to make sure when it enters the cache line, no other thread will pick it up. However this failed to create the desired effect.
Is there any easy (or even hard if needs be) way of reducing or removing the False Sharing found in that loop?
Any form of insight or help will be greatly appreciated!
EDIT: Assume some_calculation() does the following:
(tmp_cells[ii*params.nx + jj].speeds[kk] + params.omega * (d_equ[kk] - tmp_cells[ii*params.nx + jj].speeds[kk]));
I cannot move this calculation out of my for loop because I rely on d_equ which is calculated for each iteration.
Before anwsering your question, I have to ask is it really a false sharing situation when you use the whole cells as the input of the function some_calcutation()? It seems you are sharing the whole array actrually. You may want to provide more info about this function.
If yes, go on with the following.
You've already show that private variable double somevaluewill improve the performance. Why not just use this approach?
Instead of using a single double variable, you could define a private array private_speed[10] just before the for k loop, calculate them in the loop, and copy it back to cells after the loop with Something like
memcpy(cells[i*width+j].speed, private_speed, sizeof(...));

Resources