MPI - sending and receiving rows of matrix - c

I am making program, where will be 2-4 processes in MPI with C language. In elimination I am sending each row below actual row to different processes by cyclic mapping.
e.g. for matrix 5 * 6, when there is active row 0, and 4 processes, I am sending row 1 to process 1, row 2 to the process 2, row 3 to the process 3 and row 4 to the process 1 again. In theese processes I want to make some computation and return some values back to the process 0, which will add theese into its original matrix. My question is, which Send and Recv call should I use?
There is some theoretical code:
if(master){
for(...)
//sending values to other processes
}
for(...)
//recieving values from other processes
}
}else{//for other non-master processes
recieve value from master
doing some computing
sending value back to master
}
If I use only simple blocking send, I don't know what will happen on process 1, because it gets row 1 and 4, and before I will cal recv in master, my value in process 1 - row 1 will be overwritten by row 4, so I will miss one set of data, am I right? My question on therfore, what kind of MPI_Send should I use?

You might be able to make use of nonblocking if, say, proc 1 can start doing something with row 1 while waiting for row 4. But blocking should be fine to start with, too.
There is a lot of synchronization built into the algorithm. Everyone has to work based on the current row. So the recieving processes will need to know how much work to expect for each iteration of this procedure. That's easy to do if they know the total number of rows, and which iteration they're currently working on. So if they're expecting two rows, they could do two blocking recvs, or launch 2 nonblocking recvs, wait on one, and start processing the first row right away. But probably easiest to get blocking working first.
You may find it advantageous even at the start to have the master process doing isends so that all the sends can be "launched" simulataneously; then a waitall can process them in whatever order.
But better than this one-to-many communication would probably be to use a scatterv, which one could expect to proceed more efficiently.
Something that has been raised in many answers and comments to this series of questions but which I don't think you've ever actually addressed -- I really hope that you're just working through this project for educational purposes. It would be completely, totally, crazy to actually be implementing your own parallel linear algebra solvers when there are tuned and tested tools like Scalapack, PETSc, and Plapack out there free for the download.

Related

Column Wise Reading a Matrix from text file in C and storing them seprately

So I have 2 text files which arbitrarily contain an matrix whose size I don't know,I have a program running parallelly which computes the matrix multiplication of these two,both these programs(p1 and p2) will be running in a round robin fashion for some time quantum t, I will be using threads to parallelly read the files in p1,and have to pass these to P2 simultaneously, so i was thinking that i will be reading file 1 row wise and file 2 column wise and pass these to p2 so that whenever p1 gets preempted in by p2 ,p2 has something to work on rather than wait for turn of p1 again until it reads the whole matrix, since during the multiplication we need the rows from first matrix and columns from second one
while searching for ways to read the file column wise all solutions I found were to read the whole file simultaneously and parse it into columns or something like that.
what i want to know is how to read the columns of second file without reading the rest so that p2 gets the required data to start the multiplication without waiting for the whole matrix
any other way to do this without reading columns is also welcome
I know what you are trying to do with this question. You are besmirching the name of your institution and I am incredibly heartbroken and disappointed in the actions you are taking.
I wish you had read our honorable Professor's words on the difference between discussion and plagiarism. Will you be mentioning the entire internet as your collaborator in the report?
Indeed, does your group even know this reckless, shameful, inhuman, unfair, unprincipled and disrespectful action you are attempting to do?
I would personally, with a heavy heart, recommend you immediately look inside yourself, and find a way to answer this question that satisfies your inner soul, with honesty, decency, integrity, and honor.

Starvation of one of 2 streams in ConnectedStreams

Background
We have 2 streams, let's call them A and B.
They produce elements a and b respectively.
Stream A produces elements at a slow rate (one every minute).
Stream B receives a single element once every 2 weeks. It uses a flatMap function which receives this element and generates ~2 million b elements in a loop:
(Java)
for (BElement value : valuesList) {
out.collect(updatedTileMapVersion);
}
The valueList here contains ~2 million b elements
We connect those streams (A and B) using connect, key by some key and perform another flatMap on the connected stream:
streamA.connect(streamB).keyBy(AClass::someKey, BClass::someKey).flatMap(processConnectedStreams)
Each of the b elements has a different key, meaning there are ~2 million keys coming from the B stream.
The Problem
What we see is starvation. Even though there are a elements ready to be processed they are not processed in the processConnectedStreams.
Our tries to solve the issue
We tried to throttle stream B to 10 elements in a 1 second by performing a Thread.sleep() every 10 elements:
long totalSent = 0;
for (BElement value : valuesList) {
totalSent++;
out.collect(updatedTileMapVersion);
if (totalSent % 10 == 0) {
Thread.sleep(1000)
}
}
The processConnectedStreams is simulated to take 1 second with another Thread.sleep() and we have tried it with:
* Setting parallelism of 10 to all the pipeline - didn't work
* Setting parallelism of 15 to all the pipeline - did work
The question
We don't want to use all these resources since stream B is activated very rarely and for stream A elements having high parallelism is an overkill.
Is it possible to solve it without setting the parallelism to more than the number of b elements we send every second?
It would be useful if you shared the complete workflow topology. For example, you don't mention doing any keying or random partitioning of the data. If that's really the case, then Flink is going to pipeline multiple operations in one task, which can (depending on the topology) lead to the problem you're seeing.
If that's the case, then forcing partitioning prior to the processConnectedStreams can help, as then that operation will be reading from network buffers.

Flink app-level barriers

I'm trying to figure out what's the proper way to make kind of a barrier on merging multiple streams in Flink.
So let's say I have 4 keyed streams each calculating some aggregated statistics over batches of data. Next I want to combine results of these 4 streams into one stream (Y) and perform some additional computation on received 4 summaries.
The problem is how to make Y node wait until it received all the summaries with X=N before going forward with X=N+1.
In the picture node 3 sent its summary X=N later than node 4 sent its X=N+1
so node Y must wait until it has received node 3 summary while caching summaries with X=N+1 from other nodes somehow.
I couldn't find anything similar in documentation so I'd really appreciate any hints.
I figured out this task can be solved by simply doing the following:
.keyBy(X)
.countWindow(4)
.fold(...)

Rendering image using Multithread

I have a ray tracing algorithm, which works with only 1 thread and I am trying to make it work with any number of threads.
My question is, which way can I divide this task among threads.
At first my Instructor told me to just divide the width of the image, for example if I have an 8x8 image, and I want 2 threads to do the task, let thread 1 render 0 to 3 horizontal area ( of course all the way down vertically ) and thread 2 render 4 to 7 horizontal area.
I found this approach to work perfect when both my image length and number of threads are powers of 2, but I have no idea how can I deal with odd number of threads or any number of threads that cant divide width without a reminder.
My approach to this problem was to let threads render the image by alternating, for example if I have an 8x8 image, andlets say if I have 3 threads.
thread 1 renders pixels 0,3,6 in horizontal direction
thread 1 renders pixels 1,4,7 in horizontal direction
thread 1 renders pixels 2,5 in horizontal direction
Sorry that I cant provide all my code, since there are more than 5 files with few hundreds line of code in each one.
Here is the for loops that loop trough horizontal area, and the vertical loop is inside these but I am not going to provide it here.
My Instructor`s suggestion
for( int px=(threadNum*(width/nthreads)); px < ((threadNum+1)*(width/nthreads)); ++px )
threadNum is the current thread that I am on (meaning thread 0,1,2 and so on)
width is the width of the image
nthreads is the overall number of threads.
My solution to this problem
for( int px= threadNum; px< width; px+=nthreads )
I know my question is not so clear, and sorry but I cant provide the whole code here, but basically all I am asking is which way is the best way to divide the rendering of the image among given number of threads ( can be any positive number). Also I want threads to render the image by columns, meaning I cant touch the part of the code which handles vertical rendering.
Thank you, and sorry for chaotic question.
First thing, let me tell you that under the assumption that the rendering of each pixel is independent from the other pixels, your task is what in the HPC field is called an "embarassing parallel problem"; that is, a problem that can be efficiently divided between any number of thread (until each thread has a single "unit of work"), without any intercommunication between the processes (which is very good).
That said, it doesn't mean that any parallelization scheme is as good as any other. For your specific problem, I would say that the two main factors to keep in mind are load balancing and cache efficiency.
Load balancing means that you should divide the work among threads in a way that each thread has roughly the same amount of work: in this way you prevent one or more threads from waiting for that one last thread that has to finish it's last job.
E.g.
You have 5 threads and you split your image in 5 big chunks (let's say 5 horizontal strips, but they could be vertical and it wouldn't change the point). Being the problem embarassing parallel, you expect a 5x speedup, and instead you get a meager 1.2x.
The reason might be that your image has most of computationally expensive details in the lower part of the image (I know nothing of rendering, but I assume that a reflective object might take far more time to render than a flat empty space), because is composed by a set of polished metal marbles on the floor on an empty frame.
In this scenario, only one thread (the one with the bottom 1/5 of the image) does all the work anyway, while the other 4 remains idling after finishing their brief tasks.
As you can imagine, this isn't a good parallelization: keeping load balancing in mind alone, the best parallelization scheme would be to assign interleaved pixels to each core for them to process, under the (very reasonable) assumption that the complexity of the image would be averaged on each thread (true for natural images, might yield surprises in very very limited scenarios).
With this solution, your image is eavenly distributed among pixels (statistically) and the worst case scenario is N-1 threads waiting for a single thread to compute a single pixel (you wouldn't notice, performance-wise).
To do that you need to cycle over all pixels forgetting about lines, in this way (pseudo code, not tested):
for(i = thread_num; i < width * height; i+=thread_num)
The second factor, cache efficiency deals with the way computers are designed, specifically, the fact that they have many layers of cache to speed up computations and prevent the CPUs to starve (remain idle while waiting for data), and accessing data in the "right way" can speed up computations considerably.
It's a very complex topic, but in your case, a rule of thumb might be "feeding to each thread the right amount of memory will improve the computation" (emphasys on "right amount" intended...).
It means that, even if passing to each thread interleaved pixels is probably the perfect balancing, it's probably the worst possible memory access pattern you could devise, and you should pass "bigger chunks" to them, because this would keep the CPU busy (note: memory aligment comes also heavily into play: if your image has padding after each line keep them multiples of, say, 32 bytes, like some image formats, you should keep it into consideration!!)
Without expanding an already verbose answer to alarming sizes, this is what I would do (I'm assuming the memory of the image is consecutive, without padding between lines!):
create a program that splits the image into N consecutive pixels (use a preprocessor constant or a command argument for N, so you can change it!) for each of M threads, like this:
1111111122222222333333334444444411111111
do some profiling for various values of N, stepping from 1 to, let's say, 2048, by powers of two (good values to test might be: 1 to get a base line, 32, 64, 128, 256, 512, 1024, 2048)
find out where the perfect balance is between perfect load balancing (N=1), and best caching (N <= the biggest cache line in your system)
a try the program on more than one system, and keep the smalles value of N that gives you the best test results among the machines, in order to make your code run fast everywhere (as the caching details vary among systems).
b If you really really want to squeeze every cycle out of every system you install your code on, forget step 4a, and create a code that automatically finds out the best value of N by rendering a small test image before tackling the appointed task :)
fool around with SIMD instructions (just kidding... sort of :) )
A bit theoretical (and overly long...), but still I hope it helps!
An alternating division of the columns will probably lead to a suboptimal cache usage. The threads should operate on a larger continuous range of data. By the way, if your image is stored row-wise it would also be better to distribute the rows instead of the columns.
This is one way to divide the data equally with any number of threads:
#define min(x,y) (x<y?x:y)
/*...*/
int q = width / nthreads;
int r = width % nthreads;
int w = q + (threadNum < r);
int start = threadNum*q + min(threadNum,r);
for( int px = start; px < start + w; px++ )
/*...*/
The remainder r is distributed over the first r threads. This is important when calculating the start index for a thread.
For the 8x8 image this would lead to:
thread 0 renders columns 0-2
thread 1 renders columns 3-5
thread 2 renders columns 6-7

Efficient data structure and strategy for synchronizing several item collections

I want one primary collection of items of a single type that modifications are made to over time. Periodically, several slave collections are going to synchronize with the primary collection. The primary collection should send a delta of items to the slave collections.
Primary Collection: A, C, D
Slave Collection 1: A, C (add D)
Slave Collection 2: A, B (add C, D; remove B)
The slave collections cannot add or remove items on their own, and they may exist in a different process, so I'm probably going to use pipes to push the data.
I don't want to push more data than necessary since the collection may become quite large.
What kind of data structures and strategies would be ideal for this?
For that I use differential execution.
(BTW, the word "slave" is uncomfortable for some people, with reason.)
For each remote site, there is a sequential file at the primary site representing what exists on the remote site.
There is a procedure at the primary site that walks through the primary collection, and as it walks it reads the corresponding file, detecting differences between what currently exists on the remote site and what should exist.
Those differences produce deltas, which are transmitted to the remote site.
At the same time, the procedure writes a new file representing what will exist at the remote site after the deltas are processed.
The advantage of this is it does not depend on detecting change events in the primary collection, because often those change events are unreliable or can be self-cancelling or made irrelevant by other changes, so you cut way down on needless transmissions to the remote site.
In the case that the collections are simple lists of things, this boils down to having local copies of the remote collections and running a diff algorithm to get the delta.
Here are a couple such algorithms:
If the collections can be sorted (like your A,B,C example), just run a merge loop:
while(ix<nx && iy<ny){
if (X[ix] < Y[iy]){
// X[ix] was inserted in X
ix++;
} else if (Y[iy] < X[ix]){
// Y[iy] was deleted from X
iy++;
} else {
// the two elements are equal. skip them both;
ix++; iy++;
}
}
while(ix<nx){
// X[ix] was inserted in X
ix++;
}
while(iy<ny>){
// Y[iy] was deleted from X
iy++;
}
If the collections cannot be sorted (note relationship to Levenshtein distance),
Until we have read through both collections X and Y,
See if the current items are equal
else see if a single item was inserted in X
else see if a single item was deleted from X
else see if 2 items were inserted in X
else see if a single item was replaced in X
else see if 2 items were deleted from X
else see if 3 items were inserted in X
else see if 2 items in X replaced 1 items in Y
else see if 1 items in X replaced 2 items in Y
else see if 3 items were deleted from X
etc. etc. up to some limit
Performance is generally not an issue, because the procedure does not have to be run at high frequency.
There's a crude video demonstrating this concept, and source code where it is used for dynamically changing user interfaces.
If one doesn't push all data, sort of a log is required, which, instead of using pipe bandwidth, uses main memory. The parameter to find a good balance between CPU & memory usage would be the 'push' frequency.
From your question, I assume, you have more than one slave process. In this case, some shared memory or CMA (Linux) approach with double buffering in the master process should outperform multiple pipes by far, as it doesn't even require multithreaded pushing, which would be used to optimize the overall pipe throughput during synchronization.
The slave processes could be notified using a global synchronization barrier for reading from masterCollectionA without copying, while master modifies masterCollectionB (which is initialized with a copy from masterCollectionA) and vice versa. Access to a collection should be interlocked between slaves and master. The slaves could copy that collection (snapshot), if they would block it past the next update attempt from master, thus, allowing it to continue. Modifications in slave processes could be implemented with a copy on write strategy for single elements. This cooperative approach is rather simple to implement and in case the slave processes don't copy whole snapshots everytime, the overall memory consumption is low.

Resources