MPI Master/Slave with 2D array - arrays

I am quite new in using MPI parallel process.
I am dealing with the following problem related to the MASTER/SLAVE approach.
I have a 2D-squared array of SIZE=500, and I need to break it into several blocks of dimension:
D < SIZE.
I should implement a Master/Slave MPI where each processor receives, and sends back to the master N blocks, where N depends on the number of processors involved and the dimension D of the subblocks.
I managed to solve the problem by dividing the original array in stripes, but I don't know how to deal with squares!

In order to simplify your problem, D must be a divider of 500. Now, the total number of blocks should be blocks = sqr(500/D). and N should be something along the lines n = blocks/cpus.
IMHO simplest way would be to create a square of DxD elements from the array and send that chunk of data to the client. Depending on the language and method, you can build small objects and send them to the client or just replicate the full matrix and send the coordinates for the chunk.

An other option is to use MPI_Type_create_subarray() in order to create a derived datatype for a given (sub) square/rectangle of your array.
On the down side, this derived datatype cannot be used with collective operations such as MPI_Scatter[v]() and MPI_Gather[v](), which is usually the "natural" MPI way of distributing / re-assembling data.

Related

Optimizing sparse matrix solve time

Our simulations have large very sparse sets of SPD equations (a resistive network with current sources). We solve Ax = b, where A is conductance matrix and b is the current vector. We have effective solution methods (Eigen/sparse and/or Tim Davis's LDL). During the simulation, only a few of the elements in A change between time steps, but we need to factorize the entire matrix for the new solution (though we can avoid the ordering step in many cases).
We are wondering if there are methods that could segregate the fixed portion in A from the dynamic portions, factorize the fixed portions separately from the dynamic, then combine the two for the solution (forward/back substitution). From a top level understanding of standard solution methods, my sense is that this is not possible. But .... ??
Thanks in advance
Kevin

How do I fill a histogram in Matlab if one gets extremely many different copies of the vector to be histogramed?

I was trying to collect statistics of a 6D vector and plot a 1D histogram for each coordinate. I get 729000000 different copies of this vector (each 6 dimensional). For this I create an array of zeros of size 729000000x6 before I get any of the actual W's and this seems to be a problem in matlab since it says:
Error using zeros
Requested 729000000x6 (32.6GB) array exceeds maximum array size preference. Creation of arrays
greater than this limit may take a long time and cause MATLAB to become unresponsive. See array
size limit or preference panel for more information.
The reason I did this at first was because it was easy to fill W_history and then just feed it to the histogram plotter:
histogram(W_history(:,d),nbins,'Normalization','probability')
however filling W_history seemed impossible for high number of copies of W. Is there a way to do this in matlab automatically? It feels that there should be and didn't want to re-invent the wheel.
I am sure I could potentially create for each coordinate some array of counters where I count how many times a specific value of the coordinate W falls. However, implementing that and having the checks for in which bin each one should fall seemed inefficient or even unnecessary. Is this really the only solution or what do matlab experts people recommend? Is this re-inventing the wheel? Seems also inefficient if I implement it myself?
Also, I thought I could manually have matlab put thing in memory then bring them back etc (as in store W_history in disk as it fills and then put more back in disk as it fills and eventually somehow plug it in to the histogram plotter), that seemed overwork. I hope I can avoid a solution like this one. It feels a wrong solution since it should be "easy" and high level to use matlab and going down to disk and memory doesn't seem to me what matlab is intended.
Currently through the comment that was given the best solution that I have so far is using histcounts as follow:
for i=2:iter+1
%
W = get_new_W(W)
%
[W_hist_counts_current, edges2] = histcounts(W,edges);
W_hist_counts = W_hist_counts + W_hist_counts_current;
end
however, after this it seems difficult to convert W_hist_counts to pdf/probability or other values since it seems they have to be processed manually. Is there no official way to do this processing without the user having to implement the normalizations again?

Improving performance when looping in a big data set

I am making some spatio-temporal analysis (with MATLAB) on a quite big data set and I am not sure what is the best strategy to adopt in terms of performance for my script.
Actually, the data set is split in 10 yearly arrays of dimension (latitude,longitude,time)=(50,60,8760).
The general structure of my analysis is:
for iterations=1:Big Number
1. Select a specific site of spatial reference (i,j).
2. Do some calculation on the whole time series of site (i,j).
3. Store the result in archive array.
end
My question is:
Is it better (in terms of general performance) to have
1) all data in big yearly (50,60,8760) arrays as global variables loaded for once. At each iteration the script will have to extract one particular "site" (i,j,:) from those arrays for data process.
2) 50*60 distinct files stored in a folder. Each file containing a particular site time series (a vector of dimension (Total time range,1)). The script will then have to open, data process and then close at each iteration a specific file from the folder.
Because your computations are computed on the entire time series, I would suggest storing the data that way in a 3000x8760 vector and doing the computations that way.
Your accesses then will be more cache-friendly.
You can reformat your data using the reshape function:
newdata = reshape(olddata,50*60,8760);
Now, instead of accessing olddata(i,j,:), you need to access newdata(sub2ind([50 60],i,j),:).
After doing some experiments it is clear that the second proposition with 3000 distinct files is much slower than having to manipulate big arrays loaded in workspace. But I didn't try to load all the 3000 files in workspace before computing (A tad to much).
It looks like Reshaping data help's a little bit.
Thanks to all contributors for your suggestions.

which size of chunk will yield to best performance using master-worker with MPI?

Im using MPI to parrlel a program that is trying to solve the Metric TSP problem. I have P processors , and N cities to pass .
Each thread asks for work from the master, recieves a chunk - which is a range of permutation that he should check and calculates the minimal among them. I am optimizing this by pruning bad routes in advance.
There are total (N-1)! routes to calculate. each worker get a chunk with a number that represnt the first route he has to check and the also the last. In addition the master sends him the most recent best result known , so can easly prone bad routes in advance with some lower bound on thier remains.
Each time a worker is finding result that is better that the global , he asyncrounsly sends it to the all other workers and to the master.
Im not looking for better solution- I'm just trying to determine which chunk size is the best.
The best chunk size i've found so far is (n!)/(n/2)! , but it doesnt yield so good result .
please help me understand which chunk size is the best here. I'm trying to balance between the amount of computation and communication
thanks
This depends heavily on factors beyond your control: MPI implementation, total load on the machine, etc. However, I'd hazard a guess that it also heavily depends on how many worker processes there are. On that note, understand that MPI spawns processes, not threads.
Ultimately, as is often the case with most optimization questions, the answer is simply "test a lot of different settings and see which one is best". You may want to do this manually, or write a tester app that implements some sort of heuristic (e.g. a genetic algorithm).

Fast way to implement 2D convolution in C

I am trying to implement a vision algorithm, which includes a prefiltering stage with a 9x9 Laplacian-of-Gaussian filter. Can you point to a document which explains fast filter implementations briefly? I think I should make use of FFT for most efficient filtering.
Are you sure you want to use FFT? That will be a whole-array transform, which will be expensive. If you've already decided on a 9x9 convolution filter, you don't need any FFT.
Generally, the cheapest way to do convolution in C is to set up a loop that moves a pointer over the array, summing the convolved values at each point and writing the data to a new array. This loop can then be parallelised using your favourite method (compiler vectorisation, MPI libraries, OpenMP, etc).
Regarding the boundaries:
If you assume the values to be 0 outside the boundaries, then add a 4 element border of 0 to your 2d array of points. This will avoid the need for `if` statements to handle the boundaries, which are expensive.
If your data wraps at the boundaries (ie it is periodic), then use a modulo or add a 4 element border which copies the opposite side of the grid (abcdefg -> fgabcdefgab for 2 points). **Note: this is what you are implicitly assuming with any kind of Fourier transform, including FFT**. If that is not the case, you would need to account for it before any FFT is done.
The 4 points are because the maximum boundary overlap of a 9x9 kernel is 4 points outside the main grid. Thus, n points of border needed for a 2n+1 x 2n+1 kernel.
If you need this convolution to be really fast, and/or your grid is large, consider partitioning it into smaller pieces that can be held in the processor's cache, and thus calculated far more quickly. This also goes for any GPU-offloading you might want to do (they are ideal for this type of floating-point calculation).
Here is a theory link
http://hebb.mit.edu/courses/9.29/2002/readings/c13-1.pdf
And here is a link to fftw, which is a pretty good FFT library that I've used in the past (check licenses to make sure it is suitable) http://www.fftw.org/
All you do is FFT your image and kernel (the 9x9 matrix). Multiply together, then back transform.
However, with a 9x9 matrix you may still be better doing it in real coordinates (just with a double loop over the image pixels and the matrix). Try both ways!
Actually you don't need to use a FFT size large enough to hold the entire image. You can do a lot of smaller overlapping 2d ffts. You can search for "fast convolution" "overlap save" "overlap add".
However, for a 9x9 kernel. You may not see much advantage speedwise.

Resources