I have a weighted graph 30k nodes 160k edges, no negative weights.
I would like to compute all the shortest paths from all the nodes to the others.
I think I cannot assume any particular heuristics to simplify the problem.
I tried to use this Dijkstra C implementation http://compprog.wordpress.com/2007/12/01/one-source-shortest-path-dijkstras-algorithm/, that is for a single shortest path problem, calling the function dijkstras() for all my 30 nodes. As you can imagine, it takes ages. At the moment I don't have the time to write and debug the code by myself, I have to compute this paths as soon as possible and store them in a database so I am looking for another faster solution ready to use, do you have any tips?
I have to run it on a recent macbook pro with 8GB ram and I would like to find a solution that takes not more than 24 hours to finish the computation.
Thanks a lot in advance!!
Eugenio
I looked over the Dijkstra's algorithm link that you posted in the comments and I believe that it's the source of your inefficiency. Inside the inner Dijkstra's loop, it's using an extremely unoptimized approach to determine which node to explore next (a linear scan over every node at each step). The problematic code is in two spots. The first is this code, which tries to find the next node to operate on:
mini = -1;
for (i = 1; i <= n; ++i)
if (!visited[i] && ((mini == -1) || (d[i] < d[mini])))
mini = i;
Because this code is nested inside of a loop that visits every node, the complexity (as mentioned in the link) is O(|V|2), where |V| is the number of nodes. In your case, since |V| is 30,000, there will be nine hundred million iterations of this loop overall. This is painfully slow (as you've seen), but there's no reason to have to do this much work.
Another trouble spot is here, which tries to find which edge in the graph should be used to reduce the cost of other nodes:
for (i = 1; i <= n; ++i)
if (dist[mini][i])
if (d[mini] + dist[mini][i] < d[i])
d[i] = d[mini] + dist[mini][i];
This scans over an entire row in the adjacency matrix looking for nodes to consider, which takes time O(n) irrespective of how many outgoing edges leave the node.
While you could try fixing up this version of Dijkstra's into a more optimized implementation, I think the correct option here is just to throw this code away and find a better implementation of Dijkstra's algorithm. For example, if you use the pseudocode from the Wikipedia article implemented with a binary heap, you can get Dijkstra's algorithm running in O(|E| log |V|). In your case, this value is just over two million, which is about 450 times faster than your current approach. That's a huge difference, and I'm willing to bet that with a better Dijkstra's implementation you'll end up getting the code completing in a substantially shorter time than before.
On top of this, you might want to consider running all the Dijkstra searches in parallel, as Jacob Eggers has pointed out. This cam get you an extra speed boost for each processor that you have. Combined with the above (and more critical) fix, this should probably give you a huge performance increase.
If you plan on running this algorithm on a much denser data set (one where the number of edges approaches |V|2 / log |V|), then you may want to consider switching to the Floyd-Warshall algorithm. Running Dijkstra's algorithm once per node (sometimes called Johnson's algorithm) takes time O(|V||E| log |V|) time, while using Floyd-Warshall takes O(|V|3) time. However, for the data set you've mentioned, the graph is sufficiently sparse that running multiple Dijkstra's instances should be fine.
Hope this helps!
How about the Floyd-Warshall algorithm?
Does your graph have any special structure? Is the graph planar (or nearly so)?
I'd recommend not trying to store all shortest paths, a pretty dense encoding (30k^2 "where to go next" entries) will take up 7 gigs of memory.
What is the application? Are you sure that doing a bidirectional Dijkstra (or A*, if you have a heuristic) won't be fast enough when you need to find a particular shortest path?
If you can modify the algorithm to be multi-threaded you might be able to finish it in less than 24hrs.
The first node may be taking more than 1 minute. However, the 15,000th node should only take half that time because you would have calculated the shortest paths to all of the previous nodes.
Bottleneck can be your data structure that you use storing paths. If you use too much storage you run out of cache and memory space very soon causing fast algorithm to run very slowly because it gains order of 100 (cache miss) or 10000+ (swapped pages) constant multiplier.
Because you have to store paths in database I suspect that might be easily a bottleneck. It is probably best to try to first generate paths into memory with very efficient storage mode like N bits per vertex where N == maximum number of edges per vertex. Then set a bit to for each edge that can be used to generate one of the shortest paths. After generating this path information you can run a recursive algorithm generating the path information to a format suitable for database storage.
Of course the most likely bottleneck is still the database. You want to think very carefully what format you use to store the information because insert, search and modifying large datasets in SQL database is very slow. Also using transaction to do database operations might be able to reduce the disk write overhead if database engine manages to path multiple insertions to a single disk write operation.
It can be even better to simple store results in memory cache and discard solutions when they are not actively needed any more. Then generate same results again if you happen to need them again. That means you would generate paths only on demand when you actually need them. Runtime for 30k nodes and 160k edges should be clearly below a second for single all shortest path run of Dijkstra.
For shortest path algorithms I have always chosen C++. There shouldn't be any reason why C implementation wouldn't be simple too but C++ offers reduced coding with STL containers that can be used in initial implementation and only later implement optimized queue algorithm if benchmarks and profiling shows that there is need to have something better than STL offers.
#include <queue>
#include "vertex.h"
class vertex;
class edge;
class searchnode {
vertex *dst;
unsigned long dist;
public:
searchnode(vertex *destination, unsigned long distance) :
dst(dst),
dist(distance)
{
}
bool operator<(const searchnode &b) const {
/* std::priority_queue stores largest value at top */
return dist > b.dist;
}
vertex *dst() const { return dst; }
unsigned long travelDistance() const { return dist; }
};
static void dijkstra(vertex *src, vertex *dst)
{
std::priority_queue<searchnode> queue;
searchnode start(src, 0);
queue.push(start);
while (!queue.empty()) {
searchnode cur = queue.top();
queue.pop();
if (cur.travelDistance() >= cur.dst()->distance())
continue;
cur.dst()->setDistance(cur.travelDistance());
edge *eiter;
for (eiter = cur.dst()->begin(); eiter != cur.dst()->end(); eiter++) {
unsigned nextDist = cur.dist() + eiter->cost();
if (nextDist >= eiter->otherVertex())
continue;
either->otherVertex()->setDistance(nextdist + 1);
searchnode toadd(eiter->othervertex(), nextDist);
queue.push(toadd);
}
}
}
Related
I have an array of structs, where each struct is a 2D position (pair of 32-bit values). This array is for tracking points of interest on a map.
struct Point {
int x;
int y;
};
// ...
struct Point pointsOfInterest[1024];
The problem is, those points of interest are constantly changing, meaning the entries in the array are very frequently being added or removed. On top of that, each reported point of interest can possibly already exist in the array, so I can't blindly add new ones without checking whether they already exist.
At the moment the array is unsorted (new entries added to the end, swap and pop to remove), and I iterate over the entire list to find entries for removal or duplication check. I'd like to know what my options are for speeding up this process.
In other languages, this is where I break out a dictionary or hash set. Neither exist in C, so I have to weigh the complexity of adding something like that.
I've considered sorting the list (i.e. first by X, then Y). But given the frequency of updates, I feel like I'll be thrashing the table far more than when iterating. But my knowledge of sorting algorithms is minimal.
Would a binary tree of some sort be any better here? Or would I again be spending all of my time re-balancing the tree?
Theoretically, given the (perceived) complexity of these algorithms, is there a threshold below which a linear search remains a viable option?
I'm assuming this is a known solved problem, so I'm hoping to get pointed in the right direction before I spend a lot of time reinventing the wheel and testing possible solutions.
Apart from trivial cases, it's often extremely hard to predict where the performance gains are. That's why you should benchmark your code before and after changes. Also profile your code to find where it spends most time.
In other languages, this is where I break out a dictionary or hash set. Neither exist in C, so I have to weigh the complexity of adding something like that.
TBH, it's not that complicated to implement. If you need the performance, it's a no brainer. But it's not guaranteed that it will be faster.
I've considered sorting the list (i.e. first by X, then Y). But given the frequency of updates, I feel like I'll be thrashing the table far more than when iterating. But my knowledge of sorting algorithms is minimal.
It's very likely that this is not optimal. But you can try it out. And you don't need to do a complete sort. Just do a binary search and move everything that comes after.
Would a binary tree of some sort be any better here? Or would I again be spending all of my time re-balancing the tree?
Only one way to find out. Try it and benchmark.
Theoretically, given the (perceived) complexity of these algorithms, is there a threshold below which a linear search remains a viable option?
I'm sure there are, but these always have to be balanced with reality. Like cache misses that can have a great impact on performance. One thing that might improve cache-friendlyness could be changing
struct Point {
int x;
int y;
};
struct Point pointsOfInterest[1024];
to
int pointsOfInterest[2][1024];
And use the first index for x or y. Might work, depending on what you're doing with the data. I guess it would not work in your case, but it could speed up a function that's only loops over one dimension.
The deadline for this project is closing in very quickly and I don't have much time to deal with what it's left. So, instead of looking for the best (and probably more complicated/time consuming) algorithms, I'm looking for the easiest algorithms to implement a few operations on a Graph structure.
The operations I'll need to do is as follows:
List all users in the graph network given a distance X
List all users in the graph network given a distance X and the type of relation
Calculate the shortest path between 2 users on the graph network given a type of relation
Calculate the maximum distance between 2 users on the graph network
Calculate the most distant connected users on the graph network
A few notes about my Graph implementation:
The edge node has 2 properties, one is of type char and another int. They represent the type of relation and weight, respectively.
The Graph is implemented with linked lists, for both the vertices and edges. I mean, each vertex points to the next one and each vertex also points to the head of a different linked list, the edges for that specific vertex.
What I know about what I need to do:
I don't know if this is the easiest as I said above, but for the shortest path between 2 users, I believe the Dijkstra algorithm is what people seem to recommend pretty often so I think I'm going with that.
I've been searching and searching and I'm finding it hard to implement this algorithm, does anyone know of any tutorial or something easy to understand so I can implement this algorithm myself? If possible, with C source code examples, it would help a lot. I see many examples with math notations but that just confuses me even more.
Do you think it would help if I "converted" the graph to an adjacency matrix to represent the links weight and relation type? Would it be easier to perform the algorithm on that instead of the linked lists? I could easily implement a function to do that conversion when needed. I'm saying this because I got the feeling it would be easier after reading a couple of pages about the subject, but I could be wrong.
I don't have any ideas about the other 4 operations, suggestions?
List all users in the graph network given a distance X
A distance X from what? from a starting node or a distance X between themselves? Can you give an example? This may or may not be as simple as doing a BF search or running Dijkstra.
Assuming you start at a certain node and want to list all nodes that have distances X to the starting node, just run BFS from the starting node. When you are about to insert a new node in the queue, check if the distance from the starting node to the node you want to insert the new node from + the weight of the edge from the node you want to insert the new node from to the new node is <= X. If it's strictly lower, insert the new node and if it is equal just print the new node (and only insert it if you can also have 0 as an edge weight).
List all users in the graph network given a distance X and the type of relation
See above. Just factor in the type of relation into the BFS: if the type of the parent is different than that of the node you are trying to insert into the queue, don't insert it.
Calculate the shortest path between 2 users on the graph network given a type of relation
The algorithm depends on a number of factors:
How often will you need to calculate this?
How many nodes do you have?
Since you want easy, the easiest are Roy-Floyd and Dijkstra's.
Using Roy-Floyd is cubic in the number of nodes, so inefficient. Only use this if you can afford to run it once and then answer each query in O(1). Use this if you can afford to keep an adjacency matrix in memory.
Dijkstra's is quadratic in the number of nodes if you want to keep it simple, but you'll have to run it each time you want to calculate the distance between two nodes. If you want to use Dijkstra's, use an adjacency list.
Here are C implementations: Roy-Floyd and Dijkstra_1, Dijkstra_2. You can find a lot on google with "<algorithm name> c implementation".
Edit: Roy-Floyd is out of the question for 18 000 nodes, as is an adjacency matrix. It would take way too much time to build and way too much memory. Your best bet is to either use Dijkstra's algorithm for each query, but preferably implementing Dijkstra using a heap - in the links I provided, use a heap to find the minimum. If you run the classical Dijkstra on each query, that could also take a very long time.
Another option is to use the Bellman-Ford algorithm on each query, which will give you O(Nodes*Edges) runtime per query. However, this is a big overestimate IF you don't implement it as Wikipedia tells you to. Instead, use a queue similar to the one used in BFS. Whenever a node updates its distance from the source, insert that node back into the queue. This will be very fast in practice, and will also work for negative weights. I suggest you use either this or the Dijkstra with heap, since classical Dijkstra might take a long time on 18 000 nodes.
Calculate the maximum distance between 2 users on the graph network
The simplest way is to use backtracking: try all possibilities and keep the longest path found. This is NP-complete, so polynomial solutions don't exist.
This is really bad if you have 18 000 nodes, I don't know any algorithm (simple or otherwise) that will work reasonably fast for so many nodes. Consider approximating it using greedy algorithms. Or maybe your graph has certain properties that you could take advantage of. For example, is it a DAG (Directed Acyclic Graph)?
Calculate the most distant connected users on the graph network
Meaning you want to find the diameter of the graph. The simplest way to do this is to find the distances between each two nodes (all pairs shortest paths - either run Roy-Floyd or Dijkstra between each two nodes and pick the two with the maximum distance).
Again, this is very hard to do fast with your number of nodes and edges. I'm afraid you're out of luck on these last two questions, unless your graph has special properties that can be exploited.
Do you think it would help if I "converted" the graph to an adjacency matrix to represent the links weight and relation type? Would it be easier to perform the algorithm on that instead of the linked lists? I could easily implement a function to do that conversion when needed. I'm saying this because I got the feeling it would be easier after reading a couple of pages about the subject, but I could be wrong.
No, adjacency matrix and Roy-Floyd are a very bad idea unless your application targets supercomputers.
This assumes O(E log V) is an acceptable running time, if you're doing something online, this might not be, and it would require some higher powered machinery.
List all users in the graph network given a distance X
Djikstra's algorithm is good for this, for one time use. You can save the result for future use, with a linear scan through all the vertices (or better yet, sort and binary search).
List all users in the graph network given a distance X and the type of relation
Might be nearly the same as above -- just use some function where the weight would be infinity if it is not of the correct relation.
Calculate the shortest path between 2 users on the graph network given a type of relation
Same as above, essentially, just determine early if you match the two users. (Alternatively, you can "meet in the middle", and terminate early if you find someone on both shortest path spanning tree)
Calculate the maximum distance between 2 users on the graph network
Longest path is an NP-complete problem.
Calculate the most distant connected users on the graph network
This is the diameter of the graph, which you can read about on Math World.
As for the adjacency list vs adjacency matrix question, it depends on how densely populated your graph is. Also, if you want to cache results, then the matrix might be the way to go.
The simplest algorithm to compute shortest path between two nodes is Floyd-Warshall. It's just triple-nested for loops; that's it.
It computes ALL-pairs shortest path in O(N^3), so it may do more work than necessary, and will take a while if N is huge.
After implementing most of the common and needed functions for my Graph implementation, I realized that a couple of functions (remove vertex, search vertex and get vertex) don't have the "best" implementation.
I'm using adjacency lists with linked lists for my Graph implementation and I was searching one vertex after the other until it finds the one I want. Like I said, I realized I was not using the "best" implementation. I can have 10000 vertices and need to search for the last one, but that vertex could have a link to the first one, which would speed up things considerably. But that's just an hypothetical case, it may or may not happen.
So, what algorithm do you recommend for search lookup? Our teachers talked about Breadth-first and Depth-first mostly (and Dikjstra' algorithm, but that's a completely different subject). Between those two, which one do you recommend?
It would be perfect if I could implement both but I don't have time for that, I need to pick up one and implement it has the first phase deadline is approaching...
My guess, is to go with Depth-first, seems easier to implement and looking at the way they work, it seems a best bet. But that really depends on the input.
But what do you guys suggest?
If you’ve got an adjacency list, searching for a vertex simply means traversing that list. You could perhaps even order the list to decrease the needed lookup operations.
A graph traversal (such as DFS or BFS) won’t improve this from a performance point of view.
Finding and deleting nodes in a graph is a "search" problem not a graph problem, so to make it better than O(n) = linear search, BFS, DFS, you need to store your nodes in a different data structure optimized for searching or sort them. This gives you O(log n) for find and delete operations. Candidatas are tree structures like b-trees or hash tables. If you want to code the stuff yourself I would go for a hash table which normally gives very good performance and is reasonably easy to implement.
I think BFS would usually be faster an average. Read the wiki pages for DFS and BFS.
The reason I say BFS is faster is because it has the property of reaching nodes in order of their distance from your starting node. So if your graph has N nodes and you want to search for node N and node 1, which is the node you start your search form, is linked to N, then you will find it immediately. DFS might expand the whole graph before this happens however. DFS will only be faster if you get lucky, while BFS will be faster if the nodes you search for are close to your starting node. In short, they both depend on the input, but I would choose BFS.
DFS is also harder to code without recursion, which makes BFS a bit faster in practice, since it is an iterative algorithm.
If you can normalize your nodes (number them from 1 to 10 000 and access them by number), then you can easily keep Exists[i] = true if node i is in the graph and false otherwise, giving you O(1) lookup time. Otherwise, consider using a hash table if normalization is not possible or you don't want to do it.
Depth-first search is best because
It uses much less memory
Easier to implement
the depth first and breadth first algorithms are almost identical, except for the use of a stack in one (DFS), a queue in the other (BFS), and a few required member variables. Implementing them both shouldn't take you much extra time.
Additionally if you have an adjacency list of the vertices then your look up with be O(V) anyway. So little to nothing will be gained via using one of the two other searches.
I'd comment on Konrad's post but I can't comment yet so... I'd like to second that it doesn't make a difference in performance if you implement DFS or BFS over a simple linear search through your list. Your search for a particular node in the graph doesn't depend on the structure of the graph, hence it's not necessary to confine yourself to graph algorithms. In terms of coding time, the linear search is the best choice; if you want to brush up your skills in graph algorithms, implement DFS or BFS, whichever you feel like.
If you are searching for a specific vertex and terminating when you find it, I would recommend using A*, which is a best-first search.
The idea is that you calculate the distance from the source vertex to the current vertex you are processing, and then "guess" the distance from the current vertex to the target.
You start at the source, calculate the distance (0) plus the guess (whatever that might be) and add it to a priority queue where the priority is distance + guess. At each step, you remove the element with the smallest distance + guess, do the calculation for each vertex in its adjacency list and stick those in the priority queue. Stop when you find the target vertex.
If your heuristic (your "guess") is admissible, that is, if it's always an under-estimate, then you are guaranteed to find the shortest path to your target vertex the first time you visit it. If your heuristic is not admissible, then you will have to run the algorithm to completion to find the shortest path (although it sounds like you don't care about the shortest path, just any path).
It's not really any more difficult to implement than a breadth-first search (you just have to add the heuristic, really) but it will probably yield faster results. The only hard part is figuring out your heuristic. For vertices that represent geographical locations, a common heuristic is to use an "as-the-crow-flies" (direct distance) heuristic.
Linear search is faster than BFS and DFS. But faster than linear search would be A* with the step cost set to zero. When the step cost is zero, A* will only expand the nodes that are closest to a goal node. If the step cost is zero then every node's path cost is zero and A* won't prioritize nodes with a shorter path. That's what you want since you don't need the shortest path.
A* is faster than linear search because linear search will most likely complete after O(n/2) iterations (each node has an equal chance of being a goal node) but A* prioritizes nodes that have a higher chance of being a goal node.
I am working on a simulation system. I will soon have experimental data (histograms) for the real-world distribution of values for several simulation inputs.
When the simulation runs, I would like to be able to produce random values that match the measured distribution. I'd prefer to do this without storing the original histograms. What are some good ways of
Mapping a histogram to a set of parameters representing the distribution?
Generating values that based on those parameters at runtime?
EDIT: The input data are event durations for several different types of events. I expect that different types will have different distribution functions.
At least two options:
Integrate the histogram and invert numerically.
Rejection
Numeric integration
From Computation in Modern Physics by William R. Gibbs:
One can always numerically integrate [the function] and invert the [cdf]
but this is often not very satisfactory especially if the pdf is changing
rapidly.
You literally build up a table that translates the range [0-1) into appropriate ranges in the target distribution. Then throw your usual (high quality) PRNG and translate with the table. It is cumbersome, but clear, workable, and completely general.
Rejection:
Normalize the target histogram, then
Throw the dice to choose a position (x) along the range randomly.
Throw again, and select this point if the new random number is less than the normalized histogram in this bin. Otherwise goto (1).
Again, simple minded but clear and working. It can be slow for distribution with a lot of very low probability (peaks with long tails).
With both of these methods, you can approximate the data with piecewise polynomial fits or splines to generate a smooth curve if a step-function histogram is not desired---but leave that for later as it may be premature optimization.
Better methods may exist for special cases.
All of this is pretty standard and should appear in any Numeric Analysis textbook if I more detail is needed.
More information about the problem would be useful. For example, what sort of values are the histograms over? Are they categorical (e.g., colours, letters) or continuous (e.g., heights, time)?
If the histograms are over categorical data I think it may be difficult to parameterise the distributions unless there are many correlations between categories.
If the histograms are over continuous data you might try to fit the distribution using mixtures of Gaussians. That is, try to fit the histogram using a $\sum_{i=1}^n w_i N(m_i,v_i)$ where m_i and v_i are the mean and variance. Then, when you want to generate data you first sample an i from 1..n with probability proportional to the weights w_i and then sample an x ~ n(m_i,v_i) as you would from any Gaussian.
Either way, you may want to read more about mixture models.
So it seems that what I want in order to generate a given probablity distribution is a Quantile Function, which is the inverse of the
cumulative distribution function, as #dmckee says.
The question becomes: What is the best way to generate and store a quantile function describing a given continuous histogram? I have a feeling the answer will depend greatly on the shape of the input - if it follows any kind of pattern there should be simplifications over the most general case. I'll update here as I go.
Edit:
I had a conversation this week that reminded me of this problem. If I forgo describing the histogram as an equation, and just store the table, can I do selections in O(1) time? It turns out you can, without any loss of precision, at the cost of O(N lgN) construction time.
Create an array of N items. A uniform random selection into the array will find an item with probablilty 1/N. For each item, store the fraction of hits for which this item should actually be selected, and the index of another item which will be selected if this one is not.
Weighted Random Sampling, C implementation:
//data structure
typedef struct wrs_data {
double share;
int pair;
int idx;
} wrs_t;
//sort helper
int wrs_sharecmp(const void* a, const void* b) {
double delta = ((wrs_t*)a)->share - ((wrs_t*)b)->share;
return (delta<0) ? -1 : (delta>0);
}
//Initialize the data structure
wrs_t* wrs_create(int* weights, size_t N) {
wrs_t* data = malloc(sizeof(wrs_t));
double sum = 0;
int i;
for (i=0;i<N;i++) { sum+=weights[i]; }
for (i=0;i<N;i++) {
//what percent of the ideal distribution is in this bucket?
data[i].share = weights[i]/(sum/N);
data[i].pair = N;
data[i].idx = i;
}
//sort ascending by size
qsort(data,N, sizeof(wrs_t),wrs_sharecmp);
int j=N-1; //the biggest bucket
for (i=0;i<j;i++) {
int check = i;
double excess = 1.0 - data[check].share;
while (excess>0 && i<j) {
//If this bucket has less samples than a flat distribution,
//it will be hit more frequently than it should be.
//So send excess hits to a bucket which has too many samples.
data[check].pair=j;
// Account for the fact that the paired bucket will be hit more often,
data[j].share -= excess;
excess = 1.0 - data[j].share;
// If paired bucket now has excess hits, send to new largest bucket at j-1
if (excess >= 0) { check=j--;}
}
}
return data;
}
int wrs_pick(wrs_t* collection, size_t N)
//O(1) weighted random sampling (after preparing the collection).
//Randomly select a bucket, and a percentage.
//If the percentage is greater than that bucket's share of hits,
// use it's paired bucket.
{
int idx = rand_in_range(0,N);
double pct = rand_percent();
if (pct > collection[idx].share) { idx = collection[idx].pair; }
return collection[idx].idx;
}
Edit 2:
After a little research, I found it's even possible to do the construction in O(N) time. With careful tracking, you don't need to sort the array to find the large and small bins. Updated implementation here
If you need to pull a large number of samples with a weighted distribution of discrete points, then look at an answer to a similar question.
However, if you need to approximate some continuous random function using a histogram, then your best bet is probably dmckee's numeric integration answer. Alternatively, you can use the aliasing, and store the point to the left, and pick a uniform number between the two points.
To choose from a histogram (original or reduced),
Walker's alias method
is fast and simple.
For a normal distribution, the following may help:
http://en.wikipedia.org/wiki/Normal_distribution#Generating_values_for_normal_random_variables
I've read in one of my AI books that popular algorithms (A-Star, Dijkstra) for path-finding in simulation or games is also used to solve the well-known "15-puzzle".
Can anyone give me some pointers on how I would reduce the 15-puzzle to a graph of nodes and edges so that I could apply one of these algorithms?
If I were to treat each node in the graph as a game state then wouldn't that tree become quite large? Or is that just the way to do it?
A good heuristic for A-Star with the 15 puzzle is the number of squares that are in the wrong location. Because you need at least 1 move per square that is out of place, the number of squares out of place is guaranteed to be less than or equal to the number of moves required to solve the puzzle, making it an appropriate heuristic for A-Star.
A quick Google search turns up a couple papers that cover this in some detail: one on Parallel Combinatorial Search, and one on External-Memory Graph Search
General rule of thumb when it comes to algorithmic problems: someone has likely done it before you, and published their findings.
This is an assignment for the 8-puzzle problem talked about using the A* algorithm in some detail, but also fairly straightforward:
http://www.cs.princeton.edu/courses/archive/spring09/cos226/assignments/8puzzle.html
The graph theoretic way to solve the problem is to imagine every configuration of the board as a vertex of the graph and then use a breath-first search with pruning based on something like the Manhatten Distance of the board to derive a shortest path from the starting configuration to the solution.
One problem with this approach is that for any n x n board where n > 3 the game space becomes so large that it is not clear how you can efficiently mark the visited vertices. In other words there is no obvious way to assess if the current configuration of the board is identical to one that has previously been discovered through traversing some other path. Another problem is that the graph size grows so quickly with n (it's approximately (n^2)!) that it is just not suitable for a brue-force attack as the number of paths becomes computationally infeasible to traverse.
This paper by Ian Parberry A Real-Time Algorithm for the (n^2 − 1) - Puzzle describes a simple greedy algorithm that iteritively arrives at a solution by completing the first row, then the first column, then the second row... It arrives at a solution almost immediately, however the solution is far from optimal; essentially it solves the problem the way a human would without leveraging any computational muscle.
This problem is closely related to that of solving the Rubik's cube. The graph of all game states it too large to solve by brue force, but there is a fairly simple 7 step method that can be used to solve any cube in about 1 ~ 2 minutes by a dextrous human. This path is of course non-optimal. By learning to recognise patterns that define sequences of moves the speed can be brought down to 17 seconds. However, this feat by Jiri is somewhat superhuman!
The method Parberry describes moves only one tile at a time; one imagines that the algorithm could be made better up by employing Jiri's dexterity and moving multiple tiles at one time. This would not, as Parberry proves, reduce the path length from n^3, but it would reduce the coefficient of the leading term.
Remember that A* will search through the problem space proceeding down the most likely path to goal as defined by your heurestic.
Only in the worst case will it end up having to flood fill the entire problem space, this tends to happen when there is no actual solution to your problem.
Just use the game tree. Remember that a tree is a special form of graph.
In your case the leaves of each node will be the game position after you make one of the moves that is available at the current node.
Here you go http://www.heyes-jones.com/astar.html
Also. be mindful that with the A-Star algorithm, at least, you will need to figure out a admissible heuristic to determine whether a possible next step is closer to the finished route than another step.
For my current experience, on how to solve an 8 puzzle.
it is required to create nodes. keep track of each step taken
and get the manhattan distance from each following steps, taking/going to the one with the shortest distance.
update the nodes, and continue until reaches the goal