C - MPI: Parallel Processing of Column Arrays - c

I have a matrix (c) of 10x10 (M = 10) elements in which I divide the matrix by rows to be executed by 5 different processes (slaves = 5) with each process corresponding to 2 rows of that matrix.
offset = 0;
rows = (M / slaves);
MPI_Send(&c[offset][0], rows*M, MPI_DOUBLE, id_slave,0,MPI_COMM_WORLD);
offset= offset+rows;
Now I want to divide the matrix but by columns. I did the test as follows by changing array indices but not working:
MPI_Send(&c[0][offset], rows*M, MPI_DOUBLE, id_slave,0,MPI_COMM_WORLD);
Do you know how to do it? Thank you.

You are using the wrong datatype. As noted by Jonathan Dursi, you need to create a strided datatype that tells MPI how to access the memory in such a way that it matches the data layout of a column or a set of consecutive columns.
In your case, instead of
MPI_Send(&c[0][offset], rows*M, MPI_DOUBLE, id_slave, 0, MPI_COMM_WORLD);
you have to do:
MPI_Datatype dt_columns;
MPI_Type_vector(M, rows, M, MPI_DOUBLE, &dt_columns);
MPI_Type_commit(&dt_columns);
MPI_Send(&c[0][offset], 1, dt_columns, id_slave, 0, MPI_COMM_WORLD);
MPI_Type_vector(M, rows, M, MPI_DOUBLE, &dt_columns) creates a new MPI datatype that consists of M blocks of rows elements of MPI_DOUBLE each with the heads of the consecutive blocks M elements apart (stride M). Something like this:
|<------------ stride = M ------------->|
|<---- rows --->| |
+---+---+---+---+---+---+---+---+---+---+--
| x | x | x | x | | | | | | | ^
+---+---+---+---+---+---+---+---+---+---+ |
| x | x | x | x | | | | | | | |
+---+---+---+---+---+---+---+---+---+---+
. . . . . . . . . . . M blocks
+---+---+---+---+---+---+---+---+---+---+
| x | x | x | x | | | | | | | |
+---+---+---+---+---+---+---+---+---+---+ |
| x | x | x | x | | | | | | | v
+---+---+---+---+---+---+---+---+---+---+--
>> ------ C stores such arrays row-wise ------ >>
If you set rows equal to 1, then you create a type that corresponds to a single column. This type cannot be used to send multiple columns though, e.g., two columns, as MPI will look for the second one there, where the first one ends, which is at the bottom of the matrix. You have to tell MPI to pretend that a column is just one element wide, i.e. resize the datatype. This can be done using MPI_Type_create_resized:
MPI_Datatype dt_temp, dt_column;
MPI_Type_vector(M, 1, M, MPI_DOUBLE, &dt_temp);
MPI_Type_create_resized(dt_temp, 0, sizeof(double), &dt_column);
MPI_Type_commit(&dt_column);
You can use this type to send as many columns as you like:
// Send one column
MPI_Send(&c[0][offset], 1, dt_column, id_slave, 0, MPI_COMM_WORLD);
// Send five columns
MPI_Send(&c[0][offset], 5, dt_column, id_slave, 0, MPI_COMM_WORLD);
You can also use dt_column in MPI_Scatter[v] and/or MPI_Gather[v] to scatter and/or gather entire columns.

The problem with your code is the following:
your c array is continuous in memory, and in C it stored row-major order, and the dividing it by row like you do will just add constant offset from the beginning.
and the way you are going to divide it by columns just gives you wrong offset.
You can imagine it for 3x3 matrix and 3 slave processes:
a[3][3] = {{a00 a01 a02},
{a10 a11 a12},
{a20 a21 a22}}
which is actually in memory looks like:
A = {a00,a01,a02,a10,a11,a12,a20,a21,a22}
For example we want to send data to CPU with id = 1. In this case a[1][0] will point you to the forth element of A and the a[0][1] will point you to the second element of A. And the in both cases you just send the rows*M elements from the specific point in A.
In first case it will be:
a10,a11,a12
And in second case:
a01,a02,a10
One of the way to solve things you want is to transpose your matrix and the send it.
And also it is much natural to use MPI_Scatter than MPI_Send for this problem,
something like it explained here: scatter

Related

Comparing rows in spark dataframe to obtain a new column

I'm a beginner in spark and I'm dealing with a large dataset (over 1.5 Million rows and 2 columns). I have to evaluate the Cosine Similarity of the field "features" beetween each row. The main problem is this iteration beetween the rows and finding an efficient and rapid method. I will have to use this method with another dataset of 42.5 Million rows and it would be a big computational problem if I won't find the most efficient way of doing it.
| post_id | features |
| -------- | -------- |
| Bkeur23 |[cat,dog,person] |
| Ksur312kd |[wine,snow,police] |
| BkGrtTeu3 |[] |
| Fwd2kd |[person,snow,cat] |
I've created an algorithm that evaluates this cosine similarity beetween each element of the i-th and j-th row but i've tried using lists or creating a spark DF / RDD for each result and merging them using the" union" function.
The function I've used to evaluate the cosineSimilarity is the following. It takes 2 lists in input ( the lists of the i-th and j-th rows) and returns the maximum value of the cosine similarity between each couple of elements in the lists. But this is not the problem.
def cosineSim(lista1,lista2,embed):
#embed = hub.KerasLayer(os.getcwd())
eps=sys.float_info.epsilon
if((lista1 is not None) and (lista2 is not None)):
if((len(lista1)>0) and (len(lista2)>0)):
risultati={}
for a in lista1:
tem = a
x = tf.constant([tem])
embeddings = embed(x)
x = np.asarray(embeddings)
x1 = x[0].tolist()
for b in lista2:
tem = b
x = tf.constant([tem])
embeddings = embed(x)
x = np.asarray(embeddings)
x2 = x[0].tolist()
sum = 0
suma1 = 0
sumb1 = 0
for i,j in zip(x1, x2):
suma1 += i * i
sumb1 += j*j
sum += i*j
cosine_sim = sum / ((sqrt(suma1))*(sqrt(sumb1))+eps)
risultati[a+'-'+b]=cosine_sim
cosine_sim=0
risultati=max(risultati.values())
return risultati
The function I'm using to iterate over the rows is the following one:
def iterazione(df,numero,embed):
a=1
k=1
emp_RDD = spark.sparkContext.emptyRDD()
columns1= StructType([StructField('Source', StringType(), False),
StructField('Destination', StringType(), False),
StructField('CosinSim',FloatType(),False)])
first_df = spark.createDataFrame(data=emp_RDD,
schema=columns1)
for i in df:
for j in islice(df, a, None):
r=cosineSim(i[1],j[1],embed)
if(r>0.45):
z=spark.createDataFrame(data=[(i[0],j[0],r)],schema=columns1)
first_df=first_df.union(z)
k=k+1
if(k==numero):
k=a+1
a=a+1
return first_df
The output I desire is something like this:
| Source | Dest | CosinSim |
| -------- | ---- | ------ |
| Bkeur23 | Ksur312kd | 0.93 |
| Bkeur23 | Fwd2kd | 0.673 |
| Ksur312kd | Fwd2kd | 0.76 |
But there is a problem in my "iterazione" function.
I ask you to help me finding the best way to iterate all over this rows. I was thinking also about copying the column "features" as "features2" and applying my function using WithColumn but I don't know how to do it and if it will work. I want to know if there's some method to do it directly in a spark dataframe, avoiding the creation of other datasets and merging them later, or if you know some method more rapid and efficient. Thank you!

Transpose matrix with MPI_Alltoallv

I'd like to transpose a matrix b using MPI_Alltoallv and store it in bt.
Each process contain nlocal rows of b. For example:
Proc0: 0 | 10 | 20 | 30
Proc0: 1 | 11 | 21 | 31
Proc1: 100 | 110 | 120 | 130
Proc1: 101 | 111 | 121 | 131
I'd like bt being like:
Proc0: 0 | 1 | 100 | 101
Proc0: 10 | 11 | 110 | 111
Proc1: 20 | 21 | 120 | 121
Proc1: 30 | 31 | 130 | 131
The submatrices are stored in 2d arrays (b[0] contains the first row and b[1] the second row).
As suggested, I used MPI_Alltoallv(). Here is what I did:
int n=4; //matrix of size 4x4
int nlocal=n/nbProc; // the submatrices are of size 2x4
int sc[nbProc];
int dis[nbProc];
int rdis[nbProc];
for(int i=0;i<nbProc;i++){
sc[i]=1;
}
for(int i=0;i<nlocal;i++){//loop on the b[i]
for(int j=0;j<nlocal;j++){ //loop on the bt[j]
for(int k=0;k<nbProc;k++){
dis[k] = j+k*nlocal;
rdis[k] = i+k*nlocal;
}
MPI_Alltoallv(b[i],sc,dis,MPI_INT,bt[j],sc,rdis,MPI_INT,MPI_COMM_WORLD);
}
}
However, I have 3 loops, I thought that I might have less. Is there anything wrong ?
I found a solution to call MPI_Alltoallv just once. All the submatrices have to be stored in one vector. But instead of storing for n first values, the first line, then the n next values, the second line etc... we store the m values which will go in the first process, the p next the ones which will go in the second process, etc... In the example I wrote, we would have buf = {0,10,1,11,20,30,21,31} for the first process. Then we use MPI_Alltoallv to send the values in a vector (called buft for example) and we then distribute the values in the matrix.
Note that there are two ways of storing the values in buf. The second one is buf={0,1,10,11,20,21,30,31} which is the one I used since I thought it was easier to implement, but the idea remains the same.
Also, this works even if the processes don't have submatrices of the same size (for example if our Proc1 would just have had 1 line in b).

make x in a cell equal 8 and total

I need an excel formula that will look at the cell and if it contains an x will treat it as a 8 and add it to the total at the bottom of the table. I have done these in the pass and I am so rusty that I cannot remember how I did it.
Generally, I try and break this sort of problem into steps. In this case, that'd be:
Determine if a cell is 'x' or not, and create new value accordingly.
Add up the new values.
If your values are in column A (for example), in column B, fill in:
=if(A1="x", 8, 0) (or in R1C1 mode, =if(RC[-1]="x", 8, 0).
Then just sum those values (eg sum(B1:B3)) for your total.
A | B
+---------+---------+
| VALUES | TEMP |
+---------+---------+
| 0 | 0 <------ '=if(A1="x", 8, 0)'
| x | 8 |
| fish | 0 |
+---------+---------+
| TOTAL | 8 <------ '=sum(B1:B3)'
+---------+---------+
If you want to be tidy, you could also hide the column with your intermediate values in.
(I should add that the way your question is worded, it almost sounds like you want to 'push' a value into the total; as far as I've ever known, you can really only 'pull' values into a total.)
Try this one for total sum:
=SUMIF(<range you want to sum>, "<>" & <x>, <range you want to sum>)+ <x> * COUNTIF(<range you want to sum>, <x>)

Find product of integers at interval of X and update value at position 'i' in an array for N queries

I have given an array of integers of length up to 10^5 & I want to do following operation on array.
1-> Update value of array at any position i . (1 <= i <= n)
2-> Get products of number at indexes 0, X, 2X, 3X, 4X.... (J * X <= n)
Number of operation will be up to 10^5.
Is there any log n approach to answer query and update values.
(Original thought is to use Segment Tree but I think that it is not needed...)
Let N = 10^5, A:= original array of size N
We use 0-based notation when we saying indexing below
Make a new array B of integers which of length up to M = NlgN :
First integer is equal to A[0];
Next N integers is of index 1,2,3...N of A; I call it group 1
Next N/2 integers is of index 2,4,6....; I call it group 2
Next N/3 integers 3,6,9.... I call it group 3
Here is an example of visualized B:
B = [A[0] | A[1], A[2], A[3], A[4] | A[2], A[4] | A[3] | A[4]]
I think the original thoughts can be used without even using Segment Tree..
(It is overkill when you think for operation 2, we always will query specific range on B instead of any range, i.e. we do not need that much flexibility and complexity to maintain the data structure)
You can create the new array B described above, also create another array C of length M, C[i] := products of Group i
For operation 1 simply use O(# factors of i) to see which Group(s) you need to update, and update the values in both B and C (i.e. C[x]/old B[y] *new B[y])
For operation 2 just output corresponding C[i]
Not sure if I was wrong but this should be even faster and should pass the judge, if the original idea is correct but got TLE
As OP has added a new condition: for operation 2, we need to multiply A[0] as well, so we can special handle it. Here is my thought:
Just declare a new variable z = A[0], for operation 1, if it is updating index 0, update this variable; for operation 2, query using the same method above, and multiply by z afterwards.
I have updated my answer so now I simply use the first element of B to represent A[0]
Example
A = {1,4,6,2,8,7}
B = {1 | 4,6,2,8,7 | 6,8 | 2 | 8 | 7 } // O(N lg N)
C = {1 | 2688 | 48 | 2 | 8 | 7 } // O (Nlg N)
factorization for all possible index X (X is the index, so <= N) // O(N*sqrt(N))
opeartion 1:
update A[4] to 5: factors = 1,2,4 // Number of factors of index, ~ O(sqrt(N))
which means update Group 1,2,4 i.e. the corresponding elements in B & C
to locate the corresponding elements in B & C maybe a bit tricky,
but that should not increase the complexity
B = {1 | 4,6,2,5,7 | 6,5 | 2 | 5 | 7 } // O(sqrt(N))
C = {1 | 2688 | 48/8*5 | 2 | 8/8*5 | 7 } // O(sqrt(N))
update A[0] to 2:
B = {2 | 4,6,2,5,7 | 6,5 | 2 | 5 | 7 } // O(1)
C = {2 | 2688/8*5 | 48/8*5 | 2 | 8/8*5 | 7 } // O(1)
// Now A is actually {2,4,6,2,5,7}
operation 2:
X = 3
C[3] * C[0] = 2*2 = 4 // O(1)
X = 2
C[2] * C[0] = 30*2 = 60 // O(1)

Halo exchange not working properly in MPI

I'm writing some code which does calculations on a large 3D grid, and uses a halo exchange procedure so that it can work using MPI. I'm getting the wrong results from my code, and I'm pretty sure it's because of the halo exchange not working properly.
Basically I have a large 3D array, a chunk of which is held on each process. Each process has an array which is 2 elements bigger in each dimension than the chunk of data it is holding - so that we can halo exchange into each face of the array without affecting the data stored in the rest of the array. I have the following code to do the halo exchange communication:
MPI_Type_vector(g->ny, g->nx, g->nx, MPI_DOUBLE, &face1);
MPI_Type_commit(&face1);
MPI_Type_vector(2*g->ny, 1, g->nx, MPI_DOUBLE, &face2);
MPI_Type_commit(&face2);
MPI_Type_vector(g->nz, g->nx, g->nx * g->ny, MPI_DOUBLE, &face3);
MPI_Type_commit(&face3);
/* Send to WEST receive from EAST */
MPI_Sendrecv(&(g->data)[current][0][0][0], 1, face1, g->west, tag,
&(g->data)[current][0][0][0], 1, face1, g->east, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to EAST receive from WEST */
MPI_Sendrecv(&(g->data)[current][g->nz-1][0][0], 1, face1, g->east, tag,
&(g->data)[current][g->nz-1][0][0], 1, face1, g->west, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to NORTH receive from SOUTH */
MPI_Sendrecv(&(g->data)[current][0][0][0], 1, face2, g->north, tag,
&(g->data)[current][0][0][0], 1, face2, g->south, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to SOUTH receive from NORTH */
MPI_Sendrecv(&(g->data)[current][0][g->ny-1][0], 1, face2, g->south, tag,
&(g->data)[current][0][0][0], 1, face2, g->north, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to UP receive from DOWN */
MPI_Sendrecv(&(g->data)[current][0][0][0], 1, face3, g->up, tag,
&(g->data)[current][0][0][0], 1, face3, g->down, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to DOWN receive from UP */
MPI_Sendrecv(&(g->data)[current][0][0][g->nx-1], 1, face3, g->down, tag,
&(g->data)[current][0][0][g->nx-1], 1, face3, g->up, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
g->nx, g->ny and g->nz are the sizes of the array chunk that this process is holding, and g->west, g->east, g->north, g->south, g->up and g->down are the ranks of the adjacent processes in each direction, found using the following code:
/* Who are my neighbours in each direction? */
MPI_Cart_shift( cart_comm, 2, 1, &g->north, &g->south);
MPI_Cart_shift( cart_comm, 1, 1, &g->west, &g->east);
MPI_Cart_shift( cart_comm, 0, 1, &g->up, &g->down);
The array on each process is defined as:
array[2][g->nz][g->ny][g->nx]
(It has two copies because I need one to update into each time through my update routine, once I've done the halo exchange).
Can anyone tell me if I'm doing the communication correctly? Particularly the defining of the vector types. Will the vector types I've defined in the code extract each face of a 3D array? And do the MPI_Sendrecv calls look right?
I'm completely lost as to why my code isn't working, but I'm pretty sure it's communications related.
So I'm a big fan of using MPI_Type_create_subarray for pulling out slices of arrays; it's easier to keep straight than vector types. In general, you can't use a single vector type to describe multi-d guardcells (because there are multiple strides, you need to create vectors of vectors), but I think because you're only using 1 guardcell in each direction here that you're ok.
So let's consider the x-face GC; here you're sending an entire y-z plane to your x-neighbour. In memory, this looks like this given your array layout:
+---------+
| #|
| #|
| #|
| #| z=2
| #|
+---------+
| #|
| #|
| #| z=1
| #|
| #|
+---------+
| #|
^| #|
|| #| z=0
y| #|
| #|
+---------+
x->
so you're looking to send count=(ny*nz) blocks of 1 value, each strided by nx. I'm assuming here nx, ny, and nz include guardcells, and that you're sending the corner values. If you're not sending corner values, subarray is the way to go. I'm also assuming, crucially, that g->data is a contiguous block of nx*ny*nz*2 (or 2 contiguous blocks of nx*ny*nz) doubles, otherwise all is lost.
So your type create should look like
MPI_Type_vector((g->ny*g->nz), 1, g->nx, MPI_DOUBLE, &face1);
MPI_Type_commit(&face1);
Note that we are sending a total of count*blocksize = ny*nz values, which is right, and we are striding over count*stride = nx*ny*nz memory in the process, which is also right.
Ok, so the y face looks like this:
+---------+
|#########|
| |
| |
| | z=2
| |
+---------+
|#########|
| |
| | z=1
| |
| |
+---------+
|#########|
^| |
|| | z=0
y| |
| |
+---------+
x->
So you have nz blocks of nx values, each separated by stride nx*ny. So your type create should look like
MPI_Type_vector(g->nz, g->nx, (g->nx)*(g->ny), MPI_DOUBLE, &face2);
MPI_Type_commit(&face2);
And again double-checking, you're sending count*blocksize = nz*nx values, striding count*stride = nx*ny*nz memory. Check.
Finally, sending z-face data involves sending an entire x-y plane:
+---------+
|#########|
|#########|
|#########| z=2
|#########|
|#########|
+---------+
| |
| |
| | z=1
| |
| |
+---------+
| |
^| |
|| | z=0
y| |
| |
+---------+
x->
MPI_Type_vector(1, (g->nx)*(g->ny), 1, MPI_DOUBLE, &face3);
MPI_Type_commit(&face3);
And again double-checking, you're sending count*blocksize = nx*ny values, striding count*stride = nx*ny memory. Check.
Update:
I didn't take a look at your Sendrecvs, but there might be something there, too. Notice that you have to use a pointer to the first piece of data you're sending with an vector data type.
First off, if you have array size nx in the x direction, and you have two guardcells (one on either side), your left guardcell is 0, right is nx-1, and your 'real' data extends from 1..nx-2. So to send your westmost data to your west neighbour, and to receive into your eastmost guardcell from your east neighbour, you would want
/* Send to WEST receive from EAST */
MPI_Sendrecv(&(g->data)[current][0][0][g->nx-2], 1, face1, g->west, westtag,
&(g->data)[current][0][0][0], 1, face1, g->east, westtag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
/* Send to EAST receive from WEST */
MPI_Sendrecv(&(g->data)[current][0][0][1], 1, face1, g->east, easttag,
&(g->data)[current][0][0][g->nx-1], 1, face1, g->west, easttag,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
(I like to use different tags for each stage of communication, helps keep things sorted.)
Likewise for the other directions.

Resources