Currently I am developing a simple game that implements the Alpha Beta Pruning algorithm but it is quite slow when the board of the game is big. I first thought was to break the alphabeta down to 2-3 mini alphabetas to account for 2-3 different kind of moves. My problem is that the computer that I am working on has one CPU with one core only. Do you think that multithreading will improve its performance ?
AlphaBeta is a sequential algorithm so divide the tree is not a good solution. To speed the search you must have a good moves order. Another improved is use a hash table to cache the move; than, with hash table you can use the lazySMP multithread alghorithm.
Related
I'm trying to implement the MCTS algorithm on a game. I can only use around 0.33 seconds per move. In this time I can generate one or two games per child from the start state, which contains around 500 child nodes. My simulations aren't random, but of course I can't make a right choice based on 1 or 2 simulations. Further in the game the tree becomes smaller and I can my choices are based on more simulations.
So my problem is in the first few moves. Is there a way to improve the MCTS algorithm so it can simulate more games or should I use another algorithm?
Is it possible to come up with some heuristic evaluation function for states? I realise that one of the primary benefits of MCTS is that in theory you wouldn't need this, BUT: if you can create a somewhat reasonable evaluation function anyway, this will allow you to stop simulations early, before they reach a terminal game state. Then you can back-up the evaluation of such a non-terminal game state instead of just a win or a loss. If you stop your simulations early like this, you may be able to run more simulations (because every individual simulation takes less time).
Apart from that, you'll want to try to find ways to ''generalize''. If you run one simulation, you should try to see if you can also extract some useful information from that simulation for other nodes in the tree which you didn't go through. Examples of enhancements you may want to consider in this spirit are AMAF, RAVE, Progressive History, N-Gram Selection Technique.
Do you happen to know where the bottleneck is for your performance? You could investigate this using a profiler. If most of your processing time is spent in functions related to the game (move generation, advancing from one state to the next, etc.), you know for sure that you're going to be limited in the number of simulations you can do. You should then try to implement enhancements that make each individual simulation as informative as possible. This can for example mean using really good, computationally expensive evaluation functions. If the game code itself already is very well optimized and fast, moving extra computation time into things like evaluation functions will be more harmful to your simulation count and probably pay off less.
For more on this last idea, it may be interesting to have a look through some stuff I wrote on my MCTS-based agent in General Video Game AI, which is also a real-time environment with a very computationally expensive game, meaning that simulations counts are severely constrained (but the branching factor is much much smaller than it seems to be in your case). Pdf files of my publications on this are also available online.
In Algorithm Design Foundations,Analysis, and Internet Examples by Michael T. Goodrich ,Roberto Tamassia in section 2.5.5 Collision-Handling Schemes the last paragraph says
These open addressing schemes save some space over the separate
chaining method, but they are not necessarily faster. In experimental
and theoretical analysis, the chaining method is either competitive or
faster than the other methods, depending upon the load factor of the
methods.
But regarding the speed previous SO Answer says exact opposite.
Linear probing wins when the load factor = n/m is smaller. That is when the number of elements is small compared to the slots. But exactly reverse happen when load factor tends to 1. The table become saturated and every time we have to travel nearly whole table resulting in exponential growth. On the other hand Chaining still grows linearly.
So, for larger networks with large data sets, they take a while to train. It would be awesome if there was a way to share the computing time across multiple machines. However, the issue with that is that when a neural network is training, the weights are constantly being altered every iteration, and each iteration is more or less based on the last -- which makes the idea of distributed computing at the very least a challenge.
I've thought that for each portion of the network, the server could send maybe a 1000 sets of data to train a network on... but... you'd have roughly the same computing time as I wouldn't be able to train on different sets of data simultaneously (which is what I want to do).
But even if I could split up the network's training into blocks of different data sets to train on, how would I know when I'm done with that set of data? especially if the amount of data sent to the client machine isn't enough to achieve the desired error?
I welcome all ideas.
Quoting http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation:
When multicore computers are used multithreaded techniques can greatly decrease the amount of time that backpropagation takes to converge. If batching is being used, it is relatively simple to adapt the backpropagation algorithm to operate in a multithreaded manner.
The training data is broken up into equally large batches for each of the threads. Each thread executes the forward and backward propagations. The weight and threshold deltas are summed for each of the threads. At the end of each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to the neural network.
which is essentially what other answers here describe.
Depending on your ANN model you can exploit some parallelism on multiple machines by running the same model with the same training and validation data on multiple machines but set different ANN properies; initial values, ANN parameters, noise etc, for different runs.
I used to do this a lot to make sure I'd explored the problem space effectively and wasn't stuck in local minima etc. This is a very easy way to take advantage of multiple machines without having to recode your algorith. Just another approach you might want to consider.
My assumption is you have more than 1 training set, and you have a gold standard. Also, I assume you have some way of storing the state of the neural network (whether it's a list of probability weights for each node, or something along those lines).
Using as many compute nodes in a cluster as you can, launch the program on a data set on each node. Save the results for each, and test on the gold standard. Which ever neural network state performs best set as the input for the next round of training. Repeat as much as you see fit
If I understand correctly, you're trying to figure out a way to train an ANN on a cluster of machines? As you stated, partitioning the network isn't the right approach, and as far as I know, is seemingly unfeasible for most models. A possible approach might be to partition the training sets and run local copies of your network, and then merge the results. An intuitive way to do this and gain some validation along the way would be with cross-validation. As you stated, knowing when the network has had the right amount of training is a problem, but that variability is a problem inherent to neural nets in general, not in parallelizing the work.
As you also stated, the updates that happen during each iteration of training are dependent on the current state of the weights, but without mixing up training sets/validation, you're likely overfitting. This is why CV is nice, because your training sets will all get a chance to play a role in the training, and the validating, across multiple samples.
If you do batch training, the weight are only altered after you have been through the entire dataset. You can compute the weight update vector for each data point in the set on a separate machine/core and add them up at the end, then proceed with the next epoch.
Here is a link to a question about batch training.
I need insight on how one would go about estimating the computing power he might need to run a program and projecting into the future when the project expands.
Put another way, "How well will my application scale?"
1) Identify the slowest component in the system. Determine what it's complexity is (i.e. O(nlogn) or O(n^2) for example)
2) How big is my data now, at what rate will it grow?
3) Identify whether my application can be broken into separately run components
I have to implement an algorithm to decompose 3D volumes in voxels. The algorithm starts by identifying which vertexes is on each side of the cut plan and in a second step which edge traverse the cutting plan.
This process could be optimized by using the benefit of sorted list. Identifying the split point is O log(n). But I have to maintain one such sorted list per axis and this for vertexes and edges. Since this is to be implemented to be used by GPU I also have some constrains on memory management (i.e. CUDA). Intrusive listsM/trees and C are imposed.
With a complete "voxelization" I expect to endup with ~4000 points, and 12000 edges. Fortunately this can be optimized by using a smarter strategy to get rid of processed voxels and order residual volumes cutting to keep their number to a minimum. In this case I would expect to have less then 100 points and 300 edges. This makes the process more complex to manage but could end up beeing more efficient.
The question is thus to help me identify the criteria to determine when the benefit of using a sorted data structure is worth the effort and complexity overhead compared to simple intrusive linked lists.
chmike, this really sounds like the sort of thing you want to do first the simpler way, and see how it behaves. Any sort of GPU voxelization approach is pretty fragile to system details once you get into big volumes at least (which you don't seem to have). In your shoes I'd definitely want the straightforward implementation first, if for no other reason that to check against....
The question will ALWAYS boil down to which operator is most common, accessing, or adding.
If you have an unordered list, adding to it takes no time, and accessing particular items takes extra time.
If you have a sorted list, adding to it takes more time, but accessing it is quicker.
Most applications spending most of their time accessing the data, rather than adding to it, which means that the (running) time overhead in creating a sorted list will usually be balanced or covered by the time saved in accessing the list.
If there is a lot of churn in your data (which it doesn't sound like there is) then maintaining a sorted list isn't necessarily advisable, because you will be constantly resorting the list as considerable CPU cost.
The complexity of the data structures only matters if they cannot be sorted in a useful way. If they can be sorted, then you'll have to go by the heuristic of
number of accesses:number of changes
to determine if sorting is a good idea.
After considering all answers I found out that the later method used to avoid duplicate computation would end up being less efficient because of the effort to maintain and navigate in the data structure. Beside, the initial method is straightforward to parallelize with a few small kernel routines and thus more appropriate for GPU implementation.
Checking back my initial method I also found significant optimization opportunities that leaves the volume cut method well behind.
Since I had to pick one answer I chose devinb because he answer the question, but Simon's comment, backed up by Tobias Warre comment, were as valuable for me.
Thanks to all of you for helping me sorting out this issue.
Stack overflow is an impressive service.