I am trying to think of an approach to multiply a Matrix and a Vector using Collective communication in MPI. The result of a Matrix M X N and a vector Nx1 will be a M x 1 vector. To parallelize we could run M+1 processes and each of the M processes calculates the one element of a row in the resulting vector. I am trying to come up with an approach that involves collective communication API like MPI_Reduce. Not sure what would be an ideal approach in this case.
My Thoughts:
Consider one row of the matrix x1,x2,...x,n and the vector is [v1,v2,v3...vn] the first row of the resulting vector will be single element x1v1+x2v2+...xnvn Now the only way to use the reduce function would be to get n processes to calculate each product xivi and then use MPI_Reduce to calculate the sum of all these. So the parent process will use reduce to calculate each row value. This does sound like it will work. I am not sure if there is a better way
Related
I have 2 sparse matrix files in matrix-market format:
row col val
1 1 3.0
1 2 1.0
2 3 2.0
etc...
Currently, I have split the files into 6 arrays:
row_A[], col_A[], val_A[], row_B[] …
Which contain the row index, column index and value respectively.
I would like to multiply the two matrices easily without having to convert them to dense matrix format first. Is there an algorithm for doing this?
I found this pseudocode on Quora but I'm unsure if its the it's the best implementation, or how it would be implemented in C: https://www.quora.com/What-is-the-C-program-for-the-multiplication-of-two-sparse-matrices
multiply(A,B):
for r in A.rows:
for c in A.rows[r]:
for k in B.rows[c]:
C[r,k] += A[r,c]*B[c,k]
Thanks.
There are sparse packages that do the multiplication for you. Have you tried Csparse inside SuiteSparse?
http://faculty.cse.tamu.edu/davis/suitesparse.html
You don't even need to convert the matrix format. However there are ways to feed in triplet format too.
It is not a good idea to multiply sparse matrices, because the resulting matrix will be dense anyway. You algorithm, apparently, consists of sequential multiplication of sparse matrices by a vector: a matrix is multiplied by a vector, then another matrix is multiplied by the obtained product, and so on. It would be wise to stick to that computational pattern, thus saving CPU time and the amount of RAM (the latter would be large if you intend to obtain and store the product of two sparse matrices).
I am trying to perform a specific downsampling process. It is described by the following pseudocode.
//Let V be an input image with dimension of M by N (row by column)
//Let U be the destination image of size floor((M+1)/2) by floor((N+1)/2)
//The floor function is to emphasize the rounding for the even dimensions
//U and V are part of a wrapper class of Pixel_FFFF vImageBuffer
for i in 0 ..< U.size.rows {
for j in 0 ..< U.size.columns {
U[i,j] = V[(i * 2), (j * 2)]
}
}
The process basically takes pixel values on every other locations spanning on both dimensions. The resulting image will be approximately half of the original image.
On a one-time call, the process is relatively fast running by itself. However, it becomes a bottleneck when the code is called numerous times inside a bigger algorithm. Therefore, I am trying to optimize it. Since I use Accelerate in my app, I would like to be able to adapt this process in a similar spirit.
Attempts
First, this process can be easily done by a 2D convolution using the 1x1 kernel [1] with a stride [2,2]. Hence, I considered the function vImageConvolve_ARGBFFFF. However, I couldn't find a way to specify the stride. This function would be the best solution, since it takes care of the image Pixel_FFFF structure.
Second, I notice that this is merely transferring data from one array to another array. So, I thought vDSP_vgathr function is a good solution for this. However, I hit a wall, since the resulting vector of vectorizing a vImageBuffer would be the interleaving bits structure A,R,G,B,A,R,G,B,..., which each term is 4 bytes. vDSP_vgathr function transfers every 4 bytes to the destination array using a specified indexing vector. I could use a linear indexing formula to make such vector. But, considering both even and odd dimensions, generating the indexing vector would be as inefficient as the original solution. It would require loops.
Also, neither of the vDSP 2D convolution functions fit the solution.
Is there any other functions in Accelerate that I might have overlooked? I saw that there's a stride option in the vDSP 1D convolution functions. Maybe, does someone know an efficient way to translate 2D convolution process with strides to 1D convolution process?
I am a beginner in Python, trying to implement computer vision algorithms.I have to iterate over image read as a 2 dimensional array several times and I want to avoid using for loops.
For example, I want to multiply camera matrix P(3x4 dimension) with each row of coordinate matrix, where each row is dimension 1x4. I will of course take transpose of the row vector for matrix multiplication. Here is how I have implemented it using for loop. I initialize an empty array. Cameras is an object instance. So I loop over the object to find the total number of cameras. Counter gives me the total number of cameras. Then I read through each row of matrix v_h and perform the multiplication. I would like to accomplish the below task without using for loop in python. I believe it's possible but I don't know how to do it. For the number of points in thousands, using for loop is becoming very inefficient. I know my code is very inefficient and would appreciate any help.
for c in cameras:
counter=counter+1
for c in cameras:
v_to_s=np.zeros((v_h.shape[0],c.P.shape[0],counter),dtype=float)
for i in range(0,v_h.shape[0]):
v_to_s[i,:,cam_count]=np.dot(c.P,v_h[i,:].T)
numpy has matmul() which can perform multiplication
I want to have a time series of 2x2 complex matrices,Ot, and I then want to have 1-line commands to multiply an array of complex vectors Vt, by the array Ot where the position in the array is understood as the time instant. I will want Vtprime(i) = Ot(i)*Vt(i). Can anyone suggest a simple way to implement this?
Suppose I have a matrix, M(t), where the elements m(j,k) are functions of t and t is an element of some series (t = 0:0.1:3). Can I create an array of matrices very easily?
I understand how to have an array in Matlab, and even a two dimensional array, where each "i" index holds two complex numbers (j=0,1). That would be a way to have a "time series of complex 2-d vectors". A way to have a time series of complex matrices would be a three dimensional array. (i,j,k) denotes the "ith" matrix and j=0,1 and k=0,1 give the elements of that matrix.
If I go a head and treat matlab like a programming language with no special packages, then I end up having to write the matrix multiplications in terms of loops etc. This then goes towards all the matrix operations. I would prefer to use commands that will make all this very easy if I can.
This could be solved with Matlab array iterations like
vtprime(:) = Ot(:)*Vt(:)
if I understand your problem correctly.
Since Ot and Vt are both changing with time index, I think the best way to do this is in a loop. (If only one of Ot or Vt was changing with time, you could set it up in one big matrix multiplication.)
Here's how I would set it up: Ot is a complex 2x2xI 3D matrix, so that
Ot(:,:,i)
references the matrix at time instant i.
Vt is a complex 2xI matrix, so that
Vt(:,i)
references the vector at time instant i.
To do the multiplication:
for i = 1:I
Vtprime(:,i) = Ot(:,:,i) * Vt(:,i);
end
The resulting Vtprime is a 2xI matrix set up so that Vtprime(:,i) is the output at time instant i.
I have sequential code for looking for maximum value in matrix columns. Because this matrix could be even 5000 x 5000, I am thinking of speeding it in MPI. I don't know how to achieve this now, but I looked up functions MPI_Scatter for distributing items from a columns (maybe block mapping) and MPI_Gather for getting max values from all processes (in my case max 3 processes) and then comparing them... Do you think this could have some benefit in lesser computing time? If so, can someone give me a kick off?
Is all you want to do find out the maximum entry in the matrix (or within a part of the matrix)?
If so, the easiest way for you is probably split the matrix up to different processes, search for the maximum value in the part each process gets assigned and then compare them using MPI_Allreduce, which is able to send the maximum value of a variable which has different values in each processes to all processes.
No matter if you are dealing with a whole matrix or just a column, this technique can of course always be applied. You just have to think about a good way of splitting the area up to different processes
Of course this will speed up your computation only from a certain matrix size upwards. I assume if you are dealing with a 10 x 10 matrix and want to split it into 3 processes, the overhead for MPI is larger than the gain from parallelization. :)