Gaussian elimination (with no pivoting) in CUDA - c

I am trying to solve Gaussian elimination with CUDA.
I have a N*N matrix. To get new elements of this matrix, I use the CPU code below, where C.width=N:
for(int z=0; z< C.width-1; z++)
{
for ( int c = z+1 ; c < C.width ; c++ )
{
for (int d = z ; d < C.width ; d++ )
{
C.elements[c*C.width+d]=C.elements[c*C.width+d] - (B.elements[c*C.width+z]*C.elements[z*C.width+d]);
}
}
}
I am trying to implement it with CUDA. For example, for N=512
dim3 dimBlock(16,16,1);
dim3 dimGrid(32,32,1);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
I think for every iteration I should use N-i*N threads to calculate the elements update, that is
if(idx>511 || idy>510)
return;
for(int i=1; i<512;i++)
{
if(idx>=i-1 && idy>=i-1)
C.elements[(idy+1)*C.width+idx]=C.elements[(idy+1)*C.width+idx]-((C.elements[(idy+1)*C.width+(i-1)]/C.elements[(i-1)*C.width+(i-1)])*C.elements[(i-1)*C.width+idx]);
__syncthreads();
}
}
The results obtained on GPU and CPU are the same, but the processing time is Time(CPU)=2*Time(GPU)
For N=512: Time(CPU) = 1900 ms; Time(GPU) = 980 ms
For N=1024: Time(CPU) = 14000 ms; Time(GPU) = 7766 ms`
.
.
.
I think the speed-up should be larger than what I have now. Is there any mistake in my parallel code? Can you help me how can I rewrite my code?
Thanks for any help!

Gaussian elimination can be seen as a two steps procedure. The first step aims at transforming the linear system to an upper triangular linear system and the second consists of solving the so obtained upper triangular linear system. The second step is trivial in CUDA and can be efficiently performed by cublasStrsm. The first step, which you are addressing in your post, is the tricky part.
There are several optimized approaches to solve the first step. I think you approach is somewhat naive and I recommend studying the literature to achieve decent speedups.
Basically, performing the transformation of the original system to an upper triangular one can be performed by a tiling approach which, from some points of view, resembles the tiling approach which is used to perform the matrix-matrix multiplication in the classical example of the CUDA C Programming Guide.
The tiling approach can be performed either by purposely written kernels or by making massive use of cuBLAS routines.
Last month (November 2013), the following paper
Manuel Carcenac, "From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices", Journal of Supercomputing, DOI 10.1007/s11227-013-1043-3
has proposed a tiling/stripping approach based on the use of cuBLAS.
All the above mentioned approaches are summarized in a presentation available at M. Carcenac's webpage entitled Application: linear system resolution with Gauss method.
Furthermore, a downloadable Visual Studio 2010 project implementing all of them with some performance testing is available at the Gaussian elimination with CUDA post. From the available code, you can make your own tests for your architecture of interest and experience the improvements the approach by M. Carcenac is introducing with respect to the others.

Related

Understanding polyline in AI

While using a computer display, polylines are used. You want to use an algorithm which will reduce points in the polyline. The polyline should be decimated within a specified tolerance. Which of the following algorithm would you use?
A) Flood fill Algorithm
B) Lee Algorithm
C) Floyd's Cycle Detection Algorithm
D) Vertex Reduction
First of all this has nothing to do with artificial intelligence. All of the mentioned algorithms are algorithms that solve some sort of graph related problem. Simplified they can be describes as follows:
Flood fill is an algorithm that "colors" an arbitrarily shaped region in a "maze" together
Lee's algorithm finds the best path between two points through a "maze" with obstacles
Cycle Detection algorithms find cycles (a non-trivial path where the start is also the end)
Vertex reduction algorithms remove vertices from a graph such that the graph still stays in it's original shape but with fewer features.
The answer to the problem should therefore be obvious ;)

Inverse matrix calculation in real time

I have been developing a C language control software working in real time. The software implements among others discrete state space observer of the controlled system. For implementation of the observer it is necessary to calculate inverse of the matrix with 4x4 dimensions. The inverse matrix calculation has to be done each 50 microseconds and it is worthwhile to say that during this time period also other pretty time consuming calculation will be done. So the inverse matrix calculation has to consume much less than 50 microseconds. It is also necessary to say that the DSP used does not have ALU with floating point operations support.
I have been looking for some efficient way how to do that. One idea which I have is to prepare general formula for calculation the determinant of the matrix 4x4 and general formula for calculation the adjoint matrix of the 4x4 matrix and then calculate the inverse matrix according to below given formula.
What do you think about this approach?
As I understand the consensus among those who study numerical linear algebra, the advice is to avoid computing matrix inverses unnecessarily. For example if the inverse of A appears in your controller only in expressions such as
z = inv(A)*y
then it is better (faster, more accurate) to solve for z the equation
A*z = y
than to compute inv(A) and then multiply y by inv(A).
A common method to solve such equations is to factorize A into simpler parts. For example if A is (strictly) positive definite then the cholesky factorization finds lower triangular matrix L so that
A = L*L'
Given that we can solve A*z=y for z via:
solve L*u = y for u
solve L'*z = u for z
and each of these is easy given the triangular nature of L
Another factorization (that again only applies to positive definite matrices) is the LDL which in your case may be easier as it does not involve square roots. It is described in the wiki article linked above.
More general factorizations include the LUD and QR These are more general in that they can be applied to any (invertible) matrix, but are somewhat slower than cholesky.
Such factorisations can also be used to compute inverses.
To be pedantic describing adj(A) in your post as the adjoint is, perhaps, a little old fashioned; I thing adjugate or adjunct is more modern. In any case adj(A) is not the transpose. Rather the (i,j) element of adj(A) is, up to a sign, the determinant of the matrix obtained from A by deleting the i'th row and j'th column. It is awkward to compute this efficiently.

Polynomial Equations in Q# E=MC^2

I am trying to understand how to use quantum computing and have started to understand some of the basic gates and other concepts, but am not able to understand how to put it to practice in real world problems.
Let's say I want to write a function in Q# that returns the value of E in the equation
E= MC^2
Can someone help me write this operation?
To answer the literal question: if M and C are just floating-point numbers, the calculation can be done using purely classical Q# constructs:
// The function that carries out the computation itself
function Energy (m : Double, c : Double) : Double {
return m * c ^ 2.0;
}
// The operation that you'll invoke to pass the parameters to the function and to print the results
operation PrintEnergy () : Unit {
let c = 299792458.0;
let energy1 = Energy(1.0, c);
Message($"Calculated energy of 1 gram of mass = {energy1}");
let energy2 = Energy(2.0, c);
Message($"Calculated energy of 2 grams of mass = {energy2}");
}
The output is:
Calculated energy of 1 gram of mass = 89875517873681760
Calculated energy of 2 grams of mass = 1.7975103574736352E+17
You will notice that this code fragment does not use any qubits or gates, so it's not really a good example of using quantum computing to solve real-world problems, even though it's implemented using a quantum programming language. This problem involved very simple mathematical computations, which can be done very efficiently using classical computers.
Quantum computers are going to use a co-processor model of computation - we'll use them to do computations that they are well suited to do (such as solving chemistry problems), and use classical computers for the rest of the computations.
To learn to apply quantum computing to solving problems with Q#, you can check out the Quantum Katas - a collection of tutorials and programming exercises. In particular, they show how to translate classical problems such as SAT or graph coloring into a form that can take advantage of quantum computing algorithms. (Disclosure: I'm the maintainer of this project)

computing function of neighbors efficiently on lattice

I'm studying the Ising model, and I'm trying to efficiently compute a function H(σ) where σ is the current state of an LxL lattice (that is, σ_ij ∈ {+1, -1} for i,j ∈ {1,2,...,L}). To compute H for a particular σ, I need to perform the following calculation:
where ⟨i j⟩ indicates that sites σ_i and σ_j are nearest neighbors and (suppose) J is a constant.
A couple of questions:
Should I store my state σ as an LxL matrix or as an L2 list? Is one better than the other for memory accessing in RAM (which I guess depends on the way I'm accessing elements...)?
In either case, how can I best compute H?
Really I think this boils down to how can I access (and manipulate) the neighbors of every state most efficiently.
Some thoughts:
I see that if I loop through each element in the list or matrix that I'll be double counting, so is there a "best" way to return the unique neighbors?
Is there a better data structure that I'm not thinking of?
Your question is a bit broad and a bit confusing for me, so excuse me if my answer is not the one you are looking for, but I hope it will help (a bit).
An array is faster than a list when it comes to indexing. A matrix is a 2D array, like this for example (where N and M are both L for you):
That means that you first access a[i] and then a[i][j].
However, you can avoid this double access, by emulating a 2D array with a 1D array. In that case, if you want to access element a[i][j] in your matrix, you would now do, a[i * L + j].
That way you load once, but you multiply and add your variables, but this may still be faster in some cases.
Now as for the Nearest Neighbor question, it seems that you are using a square-lattice Ising model, which means that you are working in 2 dimensions.
A very efficient data structure for Nearest Neighbor Search in low dimensions is the kd-tree. The construction of that tree takes O(nlogn), where n is the size of your dataset.
Now you should think if it's worth it to build such a data structure.
PS: There is a plethora of libraries implementing the kd-tree, such as CGAL.
I encountered this problem during one of my school assignments and I think the solution depends on which programming language you are using.
In terms of efficiency, there is no better way than to write a for loop to sum neighbours(which are actually the set of 4 points{ (i+/-1,j+/-1)} for a given (i,j). However, when simd(sse etc) functions are available, you can re-express this as a convolution with a 2d kernel {0 1 0;1 0 1;0 1 0}. so if you use a numerical library which exploits simd functions you can obtain significant performance increase. You can see the example implementation of this here(https://github.com/zawlin/cs5340/blob/master/a1_code/denoiseIsingGibbs.py) .
Note that in this case, the performance improvement is huge because to evaluate it in python I need to write an expensive for loop.
In terms of work, there is in fact some waste as the unecessary multiplications and sum with zeros at corners and centers. So whether you can experience performance improvement depends quite a bit on your programming environment( if you are already in c/c++, it can be difficult and you need to use mkl etc to obtain good improvement)

Fastest implementation for All-pairs shortest paths problem?

I have a weighted graph 30k nodes 160k edges, no negative weights.
I would like to compute all the shortest paths from all the nodes to the others.
I think I cannot assume any particular heuristics to simplify the problem.
I tried to use this Dijkstra C implementation http://compprog.wordpress.com/2007/12/01/one-source-shortest-path-dijkstras-algorithm/, that is for a single shortest path problem, calling the function dijkstras() for all my 30 nodes. As you can imagine, it takes ages. At the moment I don't have the time to write and debug the code by myself, I have to compute this paths as soon as possible and store them in a database so I am looking for another faster solution ready to use, do you have any tips?
I have to run it on a recent macbook pro with 8GB ram and I would like to find a solution that takes not more than 24 hours to finish the computation.
Thanks a lot in advance!!
Eugenio
I looked over the Dijkstra's algorithm link that you posted in the comments and I believe that it's the source of your inefficiency. Inside the inner Dijkstra's loop, it's using an extremely unoptimized approach to determine which node to explore next (a linear scan over every node at each step). The problematic code is in two spots. The first is this code, which tries to find the next node to operate on:
mini = -1;
for (i = 1; i <= n; ++i)
if (!visited[i] && ((mini == -1) || (d[i] < d[mini])))
mini = i;
Because this code is nested inside of a loop that visits every node, the complexity (as mentioned in the link) is O(|V|2), where |V| is the number of nodes. In your case, since |V| is 30,000, there will be nine hundred million iterations of this loop overall. This is painfully slow (as you've seen), but there's no reason to have to do this much work.
Another trouble spot is here, which tries to find which edge in the graph should be used to reduce the cost of other nodes:
for (i = 1; i <= n; ++i)
if (dist[mini][i])
if (d[mini] + dist[mini][i] < d[i])
d[i] = d[mini] + dist[mini][i];
This scans over an entire row in the adjacency matrix looking for nodes to consider, which takes time O(n) irrespective of how many outgoing edges leave the node.
While you could try fixing up this version of Dijkstra's into a more optimized implementation, I think the correct option here is just to throw this code away and find a better implementation of Dijkstra's algorithm. For example, if you use the pseudocode from the Wikipedia article implemented with a binary heap, you can get Dijkstra's algorithm running in O(|E| log |V|). In your case, this value is just over two million, which is about 450 times faster than your current approach. That's a huge difference, and I'm willing to bet that with a better Dijkstra's implementation you'll end up getting the code completing in a substantially shorter time than before.
On top of this, you might want to consider running all the Dijkstra searches in parallel, as Jacob Eggers has pointed out. This cam get you an extra speed boost for each processor that you have. Combined with the above (and more critical) fix, this should probably give you a huge performance increase.
If you plan on running this algorithm on a much denser data set (one where the number of edges approaches |V|2 / log |V|), then you may want to consider switching to the Floyd-Warshall algorithm. Running Dijkstra's algorithm once per node (sometimes called Johnson's algorithm) takes time O(|V||E| log |V|) time, while using Floyd-Warshall takes O(|V|3) time. However, for the data set you've mentioned, the graph is sufficiently sparse that running multiple Dijkstra's instances should be fine.
Hope this helps!
How about the Floyd-Warshall algorithm?
Does your graph have any special structure? Is the graph planar (or nearly so)?
I'd recommend not trying to store all shortest paths, a pretty dense encoding (30k^2 "where to go next" entries) will take up 7 gigs of memory.
What is the application? Are you sure that doing a bidirectional Dijkstra (or A*, if you have a heuristic) won't be fast enough when you need to find a particular shortest path?
If you can modify the algorithm to be multi-threaded you might be able to finish it in less than 24hrs.
The first node may be taking more than 1 minute. However, the 15,000th node should only take half that time because you would have calculated the shortest paths to all of the previous nodes.
Bottleneck can be your data structure that you use storing paths. If you use too much storage you run out of cache and memory space very soon causing fast algorithm to run very slowly because it gains order of 100 (cache miss) or 10000+ (swapped pages) constant multiplier.
Because you have to store paths in database I suspect that might be easily a bottleneck. It is probably best to try to first generate paths into memory with very efficient storage mode like N bits per vertex where N == maximum number of edges per vertex. Then set a bit to for each edge that can be used to generate one of the shortest paths. After generating this path information you can run a recursive algorithm generating the path information to a format suitable for database storage.
Of course the most likely bottleneck is still the database. You want to think very carefully what format you use to store the information because insert, search and modifying large datasets in SQL database is very slow. Also using transaction to do database operations might be able to reduce the disk write overhead if database engine manages to path multiple insertions to a single disk write operation.
It can be even better to simple store results in memory cache and discard solutions when they are not actively needed any more. Then generate same results again if you happen to need them again. That means you would generate paths only on demand when you actually need them. Runtime for 30k nodes and 160k edges should be clearly below a second for single all shortest path run of Dijkstra.
For shortest path algorithms I have always chosen C++. There shouldn't be any reason why C implementation wouldn't be simple too but C++ offers reduced coding with STL containers that can be used in initial implementation and only later implement optimized queue algorithm if benchmarks and profiling shows that there is need to have something better than STL offers.
#include <queue>
#include "vertex.h"
class vertex;
class edge;
class searchnode {
vertex *dst;
unsigned long dist;
public:
searchnode(vertex *destination, unsigned long distance) :
dst(dst),
dist(distance)
{
}
bool operator<(const searchnode &b) const {
/* std::priority_queue stores largest value at top */
return dist > b.dist;
}
vertex *dst() const { return dst; }
unsigned long travelDistance() const { return dist; }
};
static void dijkstra(vertex *src, vertex *dst)
{
std::priority_queue<searchnode> queue;
searchnode start(src, 0);
queue.push(start);
while (!queue.empty()) {
searchnode cur = queue.top();
queue.pop();
if (cur.travelDistance() >= cur.dst()->distance())
continue;
cur.dst()->setDistance(cur.travelDistance());
edge *eiter;
for (eiter = cur.dst()->begin(); eiter != cur.dst()->end(); eiter++) {
unsigned nextDist = cur.dist() + eiter->cost();
if (nextDist >= eiter->otherVertex())
continue;
either->otherVertex()->setDistance(nextdist + 1);
searchnode toadd(eiter->othervertex(), nextDist);
queue.push(toadd);
}
}
}

Resources