Fastest way to generate next move in TIC TAC TOE game - c

In a X's and 0's game (i.e. TIC TAC TOE(3X3)) if you write a program for this give a fast way to generate the moves by the computer. I mean this should be the fastest way possible.
All I could think of at that time is to store all the board configurations in a hash so that getting best position of move is a O(1) operation.
Each board square can be either 0,1, or 2.
0 represents empty square. 1 represents a X & 2 represents 0.
So every square can be filled with either of the three. There are approx 3^9 board configurations.
In simple, we need a hash of size 3^9. For hashing,we can go for base 3 representation. Means each number in base 3 will be 9 digits long each digit corresponding to each square.
To search in hash, we need to find the decimal representation of this 9 digit number.
Now, each square can be associated with row number & column number. In order to identify each square uniquely, we can again make use of base 3 representation.
say SQ[1][2] will be 12 in base 3 which is equivalent to 5 in decimal.
Thus, we have effectively designed an algorithm which is fast enough to calculate the next move in O(1).
But, the interviewer insisted in reducing the space complexity as DOS system doesn't have that much amount of memory.
How can we reduce the space complexity with no change in time complexity?
Please help me so that I do not miss such type of questions in the future.

For a small game like this, a different way of going about this is to pre-compute and store the potential game tree in a table.
Looking first at the situation where the human starts, she obvious has 9 different start positions. A game-play table would contain 9 entry points, then, each pointing to the correct response - you could use the guidelines outlined in this question to calculate the responses - as well as the next level table of human responses. This time there are only 7 possible responses. For the next level there'll be 5, then 3, then just 1. In total, there will be 9 * 7 * 5 * 3 * 1 = 945 entries in the table, but that can be compressed by realizing symmetries, i.e. rotations and flipped colors.
Of course, the situation where the computer starts is similar in principle but the table is actually smaller because the computer will probably want to start by playing the middle piece - or at least avoid certain spots.

There are not 3^9 different board configurations. Just as tomdemuyt says, there are 9! different board configurations, i.e., 9 choices at first, 8 choices next, 7 choices after that, and so on.
Also, we can further reduce the space complexity by accounting for symmetry. For example, for the first move, placing an X in [0,0] is the same as placing it in [0,2], [2,0], and [2,2]. I believe this reduces 9! to 9!/4
We can even reduce that by accounting for which board configurations were winning before the final move (the 9th move). I don't know the number, but a detailed explanation can be found on the Stack Overflow cousin http://en.wikipedia.org/wiki/Tic-tac-toe

The assumption of 3^9 is wrong. This would include for example a board that only has X which is impossible as both players place each turn an X or an O.
My initial thought was that there are (9*8*7*6*5*4*3*2) * 2 possibilities.
First player has 9 choices, second player has 8 choices, first player has 7 etc.
I put * 2 because you might have different best moves depending who starts.
Now 3^9 is 19863 and 9! is 362880, so clearly this is not the superior solution, a lot of 'different scenarios' actually will end up looking exactly the same. Still, the base idea that many of the 19863 board setups are invalid remain.
This piece of code which probably could be replaced by a simple formula tells me that this is the count of positions you want to have a move for.
<script>
a = permuteString( "X........" ); document.write( Object.keys(a).length + "<br>" );console.log( a );
a = permuteString( "XO......." ); document.write( Object.keys(a).length + "<br>" );console.log( a );
a = permuteString( "XOX......" ); document.write( Object.keys(a).length + "<br>" );console.log( a );
a = permuteString( "XOXO....." ); document.write( Object.keys(a).length + "<br>" );console.log( a );
a = permuteString( "XOXOX...." ); document.write( Object.keys(a).length + "<br>" );console.log( a );
a = permuteString( "XOXOXO..." ); document.write( Object.keys(a).length + "<br>" );console.log( a );
a = permuteString( "XOXOXOX.." ); document.write( Object.keys(a).length + "<br>" );console.log( a );
//Subset of the Array.prototype.slice() functionality for a string
function spliceString( s , i )
{
var a = s.split("");
a.splice( i , 1 );
return a.join("");
}
//Permute the possibilities, throw away equivalencies
function permuteString( s )
{
//Holds result
var result = {};
//Sanity
if( s.length < 2 ) return [];
//The atomic case, if AB is given return { AB : true , BA : true }
if( s.length == 2 )
{
result[s] = true;
result[s.charAt(1)+s.charAt(0)] = true;
return result;
}
//Enumerate
for( var head = 0 ; head < s.length ; head++ )
{
var o = permuteString( spliceString( s , head ) );
for ( key in o )
result[ s.charAt( head ) + key ] = true;
}
return result;
}
</script>
This gives the following numbers:
1st move : 9
2nd move : 72
3rd move : 252
4th move : 756
5th move : 1260
6th move : 1680
7th move : 1260
So in total 5289 moves, this is without even checking for already finished games or symmetry.
These numbers allow you to lookup a move through an array, you can generate this array yourself by looping over all possible games.
T.

The game of Tic Tac Toe is sufficiently simple that optimal algorithm may be implemented by a machine built from Tinker Toys (a brand of sticks and fasteners). Since the level of hardware complexity encapsulated by such a construction is below that of a typical 1970's microprocessor, the time required to find out what moves have been made would in most cases exceed the time required to figure out the next move. Probably the simplest approach would be have a table which, given the presence or absence of markers of a given player (2^9, or 512 entries), would indicate what squares would turn two-in-a-rows into three-in-a-rows. Start by doing a lookup with the pieces owned by the player on move; if any square which would complete a three-in-a-row is not taken by the opponent, take it. Otherwise look up the opponent's combination of pieces; any square it turns up that isn't already occupied must be taken. Otherwise, if the center is available, take it; if only the center is taken, take a corner. Otherwise take an edge.
It might be more interesting to open up your question to 4x4x4 Tic Tac Toe, since that represents a sufficient level of complexity that 1970's-era computer implementations would often take many seconds per move. While today's computers are thousands of times faster than e.g. the Atari 2600, the level of computation at least gets beyond trivial.
If one extends the game to 4x4x4, there will be many possibilities for trading off speed, RAM, and code space. Unlike the original game which has 8 winning lines, the 4x4x4 version has (IIRC) 76. If one keeps track of each line as being in one of 8 states [ten if one counts wins], and for each vacant square one keeps track of how many of the winning lines that pass through it are in what states, it should be possible to formulate some pretty fast heuristics based upon that information. It would probably be necessary to use an exhaustive search algorithm to ensure that heuristics would in fact win, but once the heuristics were validated they should be able to run much faster than would an exhaustive search.

Related

Binary search modification

I have been attempting to solve following problem. I have a sequence of positive
integer numbers which can be very long (several milions of elements). This
sequence can contain "jumps" in the elements values. The aforementioned jump
means that two consecutive elements differs each other by more than 1.
Example 01:
1 2 3 4 5 6 7 0
In the above mentioned example the jump occurs between 7 and 0.
I have been looking for some effective algorithm (from time point of view) for
finding of the position where this jump occurs. This issue is complicated by the
fact that there can be a situation when two jumps are present and one of them
is the jump which I am looking for and the other one is a wrap-around which I
am not looking for.
Example 02:
9 1 2 3 4 6 7 8
Here the first jump between 9 and 1 is a wrap-around. The second jump between
4 and 6 is the jump which I am looking for.
My idea is to somehow modify the binary search algorithm but I am not sure whether it is possible due to the wrap-around presence. It is worthwhile to say that only two jumps can occur in maximum and between these jumps the elements are sorted. Does anybody have any idea? Thanks in advance for any suggestions.
You cannot find an efficient solution (Efficient meaning not looking at all numbers, O(n)) since you cannot conclude anything about your numbers by looking at less than all. For example if you only look at every second number (still O(n) but better factor) you would miss double jumps like these: 1 5 3. You can and must look at every single number and compare it to it's neighbours. You could split your workload and use a multicore approach but that's about it.
Update
If you have the special case that there is only 1 jump in your list and the rest is sorted (eg. 1 2 3 7 8 9) you can find this jump rather efficiently. You cannot use vanilla binary search since the list might not be sorted fully and you don't know what number you are searching but you could use an abbreviation of the exponential search which bears some resemblance.
We need the following assumptions for this algorithm to work:
There is only 1 jump (I ignore the "wrap around jump" since it is not technically between any following elements)
The list is otherwise sorted and it is strictly monotonically increasing
With these assumptions we are now basically searching an interruption in our monotonicity. That means we are searching the case when 2 elements and b have n elements between them but do not fulfil b = a + n. This must be true if there is no jump between the two elements. Now you only need to find elements which do not fulfil this in a nonlinear manner, hence the exponential approach. This pseudocode could be such an algorithm:
let numbers be an array of length n fulfilling our assumptions
start = 0
stepsize = 1
while (start < n-1)
while (start + stepsize > n)
stepsize -= 1
stop = start + stepsize
while (numbers[stop] != numbers[start] + stepsize)
// the number must be between start and stop
if(stepsize == 1)
// congratiulations the jump is at start to start + 1
return start
else
stepsize /= 2
start += stepsize
stepsize *= 2
no jump found

Find an element in an array, but the element can jump

There is an array where all but one of the cells are 0, and we want to find the index of that single non-zero cell. The problem is, every time that you check for a cell in this array, that non-zero element will do one of the following:
move forward by 1
move backward by 1
stay where it is.
For example, if that element is currently at position 10, and I check what is in arr[5], then the element may be at position 9, 10 or 11 after I checked arr[5].
We only need to find the position where the element is currently at, not where it started at (which is impossible).
The hard part is, if we write a for loop, there really is no way to know if the element is currently in front of you, or behind you.
Some more context if it helps:
The interviewer did give a hint which is maybe I should move my pointer back after checking x-number of cells. The problem is, when should I move back, and by how many slots?
While "thinking out loud", I started saying a bunch of common approaches hoping that something would hit. When I said recursion, the interviewer did say "recursion is a good start". I don't know recursion really is the right approach, because I don't see how I can do recursion and #1 at the same time.
The interviewer said this problem can't be solved in O(n^2). So we are looking at at least O(n^3), or maybe even exponential.
Tl;dr: Your best bet is to keep checking each even index in the array in turn, wrapping around as many times as necessary until you find your target. On average you will stumble upon your target in the middle of your second pass.
First off, as many have already said, it is indeed impossible to ensure you will find your target element in any given amount of time. If the element knows where your next sample will be, it can always place itself somewhere else just in time. The best you can do is to sample the array in a way that minimizes the expected number of accesses - and because after each sample you learn nothing except if you were successful or not and a success means you stop sampling, an optimal strategy can be described simply as a sequence of indexes that should be checked, dependent only on the size of the array you're looking through. We can test each strategy in turn via automated means to see how well they perform. The results will depend on the specifics of the problem, so let's make some assumptions:
The question doesn't specify the starting position our target. Let us assume that the starting position is chosen uniformly from across the entire array.
The question doesn't specify the probability our target moves. For simplicity let's say it's independent on parameters such as the current position in the array, time passed and the history of samples. Using the probability 1/3 for each option gives us the least information, so let's use that.
Let us test our algorithms on an array of 100 101 elements. Also, let us test each algorithm one million times, just to be reasonably sure about its average case behavior.
The algorithms I've tested are:
Random sampling: after each attempt we forget where we were looking and choose an entirely new index at random. Each sample has an independent 1/n chance of succeeding, so we expect to take n samples on average. This is our control.
Sweep: try each position in sequence until our target is found. If our target wasn't moving, this would take n/2 samples on average. Our target is moving, however, so we may miss it on our first sweep.
Slow sweep: the same, except we test each position several times before moving on. Proposed by Patrick Trentin with a slowdown factor of 30x, tested with a slowdown factor of 2x.
Fast sweep: the opposite of slow sweep. After the first sample we skip (k-1) cells before testing the next one. The first pass starts at ary[0], the next at ary[1] and so on. Tested with each speed up factor (k) from 2 to 5.
Left-right sweep: First we check each index in turn from left to right, then each index from right to left. This algorithm would be guaranteed to find our target if it was always moving (which it isn't).
Smart greedy: Proposed by Aziuth. The idea behind this algorithm is that we track each cell probability of holding our target, then always sampling the cell with the highest probability. On one hand, this algorithm is relatively complex, on the other hand it sounds like it should give us the optimal results.
Results:
The results are shown as [average] ± [standard derivation].
Random sampling: 100.889145 ± 100.318212
At this point I have realised a fencepost error in my code. Good thing we have a control sample. This also establishes that we have in the ballpark of two or three digits of useful precision (sqrt #samples), which is in line with other tests of this type.
Sweep: 100.327030 ± 91.210692
The chance of our target squeezing through the net well counteracts the effect of the target taking n/2 time on average to reach the net. The algorithm doesn't really fare any better than a random sample on average, but it's more consistent in its performance and it isn't hard to implement either.
slow sweep (x0.5): 128.272588 ± 99.003681
While the slow movement of our net means our target will probably get caught in the net during the first sweep and won't need a second sweep, it also means the first sweep takes twice as long. All in all, relying on the target moving onto us seems a little inefficient.
fast sweep x2: 75.981733 ± 72.620600
fast sweep x3: 84.576265 ± 83.117648
fast sweep x4: 88.811068 ± 87.676049
fast sweep x5: 91.264716 ± 90.337139
That's... a little surprising at first. While skipping every other step means we complete each lap in twice as many turns, each lap also has a reduced chance of actually encountering the target. A nicer view is to compare Sweep and FastSweep in broom-space: rotate each sample so that the index being sampled is always at 0 and the target drifts towards the left a bit faster. In Sweep, the target moves at 0, 1 or 2 speed each step. A quick parallel with the Fibonacci base tells us that the target should hit the broom/net around 62% of the time. If it misses, it takes another 100 turns to come back. In FastSweep, the target moves at 1, 2 or 3 speed each step meaning it misses more often, but it also takes half as much time to retry. Since the retry time drops more than the hit rate, it is advantageous to use FastSweep over Sweep.
Left-right sweep: 100.572156 ± 91.503060
Mostly acts like an ordinary sweep, and its score and standard derivation reflect that. Not too surprising a result.
Aziuth's smart greedy: 87.982552 ± 85.649941
At this point I have to admit a fault in my code: this algorithm is heavily dependent on its initial behavior (which is unspecified by Aziuth and was chosen to be randomised in my tests). But performance concerns meant that this algorithm will always choose the same randomized order each time. The results are then characteristic of that randomisation rather than of the algorithm as a whole.
Always picking the most likely spot should find our target as fast as possible, right? Unfortunately, this complex algorithm barely competes with Sweep 3x. Why? I realise this is just speculation, but let us peek at the sequence Smart Greedy actually generates: During the first pass, each cell has equal probability of containing the target, so the algorithm has to choose. If it chooses randomly, it could pick up in the ballpark of 20% of cells before the dips in probability reach all of them. Afterwards the landscape is mostly smooth where the array hasn't been sampled recently, so the algorithm eventually stops sweeping and starts jumping around randomly. The real problem is that the algorithm is too greedy and doesn't really care about herding the target so it could pick at the target more easily.
Nevertheless, this complex algorithm does fare better than both simple Sweep and a random sampler. it still can't, however, compete with the simplicity and surprising efficiency of FastSweep. Repeated tests have shown that the initial randomisation could swing the efficiency anywhere between 80% run time (20% speedup) and 90% run time (10% speedup).
Finally, here's the code that was used to generate the results:
class WalkSim
attr_reader :limit, :current, :time, :p_stay
def initialize limit, p_stay
#p_stay = p_stay
#limit = limit
#current = rand (limit + 1)
#time = 0
end
def poke n
r = n == #current
#current += (rand(2) == 1 ? 1 : -1) if rand > #p_stay
#current = [0, #current, #limit].sort[1]
#time += 1
r
end
def WalkSim.bench limit, p_stay, runs
histogram = Hash.new{0}
runs.times do
sim = WalkSim.new limit, p_stay
gen = yield
nil until sim.poke gen.next
histogram[sim.time] += 1
end
histogram.to_a.sort
end
end
class Array; def sum; reduce 0, :+; end; end
def stats histogram
count = histogram.map{|k,v|v}.sum.to_f
avg = histogram.map{|k,v|k*v}.sum / count
variance = histogram.map{|k,v|(k-avg)**2*v}.sum / (count - 1)
{avg: avg, stddev: variance ** 0.5}
end
RUNS = 1_000_000
PSTAY = 1.0/3
LIMIT = 100
puts "random sampling"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{y.yield rand (LIMIT + 1)}}
}
puts "sweep"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{0.upto(LIMIT){|i|y.yield i}}}
}
puts "x0.5 speed sweep"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{0.upto(LIMIT){|i|2.times{y.yield i}}}}
}
(2..5).each do |speed|
puts "x#{speed} speed sweep"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{speed.times{|off|off.step(LIMIT, speed){|i|y.yield i}}}}
}
end
puts "sweep LR"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {
Enumerator.new {|y|loop{
0.upto(LIMIT){|i|y.yield i}
LIMIT.downto(0){|i|y.yield i}
}}
}
$sg_gen = Enumerator.new do |y|
probs = Array.new(LIMIT + 1){1.0 / (LIMIT + 1)}
loop do
ix = probs.each_with_index.map{|v,i|[v,rand,i]}.max.last
probs[ix] = 0
probs = [probs[0] * (1 + PSTAY)/2 + probs[1] * (1 - PSTAY)/2,
*probs.each_cons(3).map{|a, b, c| (a + c) / 2 * (1 - PSTAY) + b * PSTAY},
probs[-1] * (1 + PSTAY)/2 + probs[-2] * (1 - PSTAY)/2]
y.yield ix
end
end
$sg_cache = []
def sg_enum; Enumerator.new{|y| $sg_cache.each{|n| y.yield n}; $sg_gen.each{|n| $sg_cache.push n; y.yield n}}; end
puts "smart greedy"
p stats WalkSim.bench(LIMIT, PSTAY, RUNS) {sg_enum}
no forget everything about loops.
copy this array to another array and then check what cells are now non-zero. for example if your main array is mainArray[] you can use:
int temp[sizeOfMainArray]
int counter = 0;
while(counter < sizeOfArray)
{
temp[counter] == mainArray[counter];
}
//then check what is non-zero in copied array
counter = 0;
while(counter < sizeOfArray)
{
if(temp[counter] != 0)
{
std::cout<<"I Found It!!!";
}
}//end of while
One approach perhaps :
i - Have four index variables f,f1,l,l1. f is pointing at 0,f1 at 1, l is pointing at n-1 (end of the array) and l1 at n-2 (second last element)
ii - Check the elements at f1 and l1 - are any of them non zero ? If so stop. If not, check elements at f and l (to see if the element has jumped back 1).
iii - If f and l are still zero, increment the indexes and repeat step ii. Stop when f1 > l1
Iff an equality check against an array index makes the non-zero element jump.
Why not think of a way where we don't really require an equality check with an array index?
int check = 0;
for(int i = 0 ; i < arr.length ; i++) {
check |= arr[i];
if(check != 0)
break;
}
Orrr. Maybe you can keep reading arr[mid]. The non-zero element will end up there. Some day. Reasoning: Patrick Trentin seems to have put it in his answer (somewhat, its not really that, but you'll get an idea).
If you have some information about the array, maybe we can come up with a niftier approach.
Ignoring the trivial case where the 1 is in the first cell of the array if you iterate through the array testing each element in turn you must eventually get to the position i where the 1 is in cell i+2. So when you read cell i+1 one of three things is going to happen.
The 1 stays where it is, you're going to find it next time you look
The 1 moves away from you, your back to the starting position with the 1 at i+2 next time
The 1 moves to cell you've just checked, it dodged your scan
Re-reading the i+1 cell will find the 1 in case 3 but just give it another chance to move in cases 1 and 2 so a strategy based on re-reading won't work.
My option would therefore to adopt a brute force approach, if I keep scanning the array then I'm going to hit case 1 at some point and find the elusive 1.
Assumptions:
The array is no true array. This is obvious given the problem. We got some class that behaves somewhat like an array.
The array is mostly hidden. The only public operations are [] and size().
The array is obfuscated. We cannot get any information by retrieving it's address and then analyze the memory at that position. Even if we iterate through the whole memory of our system, we can't do tricks due to some advanced cryptographic means.
Every field of the array has the same probability to be the first field that hosts the one.
We know the probabilities of how the one changes it's position when triggered.
Probability controlled algorithm:
Introduce another array of same size, the probability array (over double).
This array is initialized with all fields to be 1/size.
Every time we use [] on the base array, the probability array changes in this way:
The accessed position is set to zero (did not contain the one)
An entry becomes the sum of it's neighbors times the probability of that neighbor to jump to the entries position. (prob_array_next_it[i] = prob_array_last_it[i-1]*prob_jump_to_right + prob_array_last_it[i+1]*prob_jump_to_left + prob_array_last_it[i]*prob_dont_jump, different for i=0 and i=size-1 of course)
The probability array is normalized (setting one entry to zero set the sum of the probabilities to below one)
The algorithm accesses the field with the highest probability (chooses amongst those that have)
It might be able to optimize this by controlling the flow of probabilities, but that needs to be based on the wandering event and might require some research.
No algorithm that tries to solve this problem is guaranteed to terminate after some time. For a complexity, we would analyze the average case.
Example:
Jump probabilities are 1/3, nothing happens if trying to jump out of bounds
Initialize:
Hidden array: 0 0 1 0 0 0 0 0
Probability array: 1/8 1/8 1/8 1/8 1/8
1/8 1/8 1/8
First iteration: try [0] -> failure
Hidden array: 0 0 1 0 0 0 0 0 (no jump)
Probability array step 1: 0
1/8 1/8 1/8 1/8 1/8 1/8 1/8
Probability array step 2: 1/24 2/24 1/8
1/8 1/8 1/8 1/8 1/8
Probability array step 2: same normalized (whole array * 8/7):
1/21 2/21 1/7
1/7 1/7 1/7 1/7 1/7
Second iteration: try [2] as 1/7 is the maximum and this is the first field with 1/7 -> success (example should be clear by now, of course this might not work so fast on another example, had no interest of doing this for a lot of iterations since the probabilities would get cumbersome to compute by hand, would need to implement it. Note that if the one jumped to the left, we wouldn't have checked it so fast, even if it remained there for some time)

No. of paths in integer array

There is an integer array, for eg.
{3,1,2,7,5,6}
One can move forward through the array either each element at a time or can jump a few elements based on the value at that index. For e.g., one can go from 3 to 1 or 3 to 7, then one can go from 1 to 2 or 1 to 2(no jumping possible here), then one can go 2 to 7 or 2 to 5, then one can go 7 to 5 only coz index of 7 is 3 and adding 7 to 3 = 10 and there is no tenth element.
I have to only count the number of possible paths to reach the end of the array from start index.
I could only do it recursively and naively which runs in exponential time.
Somebody plz help.
My recommendation: use dynamic programming.
If this key word is sufficient and you want the challenge to find a possible solution on your own, dont read any further!
Here a possible DP-algorithm on the example input {3,1,2,7,5,6}. It will be your job to adjust on the general problem.
create array sol length 6 with just zeros in it. the array will hold the number of ways.
sol[5] = 1;
for (i = 4; i>=0;i--) {
sol[i] = sol[i+1];
if (i+input[i] < 6 && input[i] != 1)
sol[i] += sol[i+input[i]];
}
return sol[0];
runtime O(n)
As for the directed graph solution hinted in the comments :
Each cell in the array represents a node. Make an directed edge from each node to the node accessable. Basically you can then count more easily the number of ways by just looking at the outdegrees on the nodes (since there is no directed cycle) however it is a lot of boiler plate to actual program it.
Adjusting the recursive solution
another solution would be to pruning. This is basically equivalent to the DP-algorithm. The exponentiel time comes from the fact, that you calculate values several times. Eg function is recfunc(index). The initial call recFunc(0) calls recFunc(1) and recFunc(3) and so on. However recFunc(3) is bound to be called somewhen again, which leads to a repeated recursive calculation. To prune this you add a Map to hold all already calculated values. If you make a call recFunc(x) you lookup in the map if x was already calculated. If yes, return the stored value. If not, calculate, store and return it. This way you get a O(n) too.

How to, given a predetermined set of keys, reorder the keys such that the minimum number of nodes are used when inserting into a B-Tree?

So I have a problem which i'm pretty sure is solvable, but after many, many hours of thinking and discussion, only partial progress has been made.
The issue is as follows. I'm building a BTree of, potentially, a few million keys. When searching the BTree, it is paged on demand from disk into memory, and each page in operation is relatively expensive. This effectively means that we want to need to traverse as few nodes as possible (although after a node has been traversed to, the cost of traversing through that node, up to that node is 0). As a result, we don't want to waste space by having lots of nodes nearing minimum capacity. In theory, this should be preventable (within reason) as the structure of the tree is dependent on the order that the keys were inserted in.
So, the question is how to reorder the keys such that after the BTree is built the fewest number of nodes are used. Here's an example:
I did stumble on this question In what order should you insert a set of known keys into a B-Tree to get minimal height? which unfortunately asks a slightly different question. The answers, also don't seem to solve my problem. It is also worth adding that we want the mathematical guarantees that come from not building the tree manually, and only using the insert option. We don't want to build a tree manually, make a mistake, and then find it is unsearchable!
I've also stumbled upon 2 research papers which are so close to solving my question but aren't quite there!
Time-and Space-Optimality in B-Trees and Optimal 2,3-Trees (where I took the above image from in fact) discuss and quantify the differences between space optimal and space pessimal BTrees, but don't go as far as to describe how to design an insert order as far as I can see.
Any help on this would be greatly, greatly appreciated.
Thanks
Research papers can be found at:
http://www.uqac.ca/rebaine/8INF805/Automne2007/Sujets2007Automne/p174-rosenberg.pdf
http://scholarship.claremont.edu/cgi/viewcontent.cgi?article=1143&context=hmc_fac_pub
EDIT:: I ended up filling a btree skeleton constructed as described in the above papers with the FILLORDER algorithm. As previously mentioned, I was hoping to avoid this, however I ended up implementing it before the 2 excellent answers were posted!
The algorithm below should work for B-Trees with minimum number of keys in node = d and maximum = 2*d I suppose it can be generalized for 2*d + 1 max keys if way of selecting median is known.
Algorithm below is designed to minimize the number of nodes not just height of the tree.
Method is based on idea of putting keys into any non-full leaf or if all leaves are full to put key under lowest non full node.
More precisely, tree generated by proposed algorithm meets following requirements:
It has minimum possible height;
It has no more then two nonfull nodes on each level. (It's always two most right nodes.)
Since we know that number of nodes on any level excepts root is strictly equal to sum of node number and total keys number on level above we can prove that there is no valid rearrangement of nodes between levels which decrease total number of nodes. For example increasing number of keys inserted above any certain level will lead to increase of nodes on that level and consequently increasing of total number of nodes. While any attempt to decrease number of keys above the certain level will lead to decrease of nodes count on that level and fail to fit all keys on that level without increasing tree height.
It also obvious that arrangement of keys on any certain level is one of optimal ones.
Using reasoning above also more formal proof through math induction may be constructed.
The idea is to hold list of counters (size of list no bigger than height of the tree) to track how much keys added on each level. Once I have d keys added to some level it means node filled in half created in that level and if there is enough keys to fill another half of this node we should skip this keys and add root for higher level. Through this way, root will be placed exactly between first half of previous subtree and first half of next subtree, it will cause split, when root will take it's place and two halfs of subtrees will become separated. Place for skipped keys will be safe while we go through bigger keys and can be filled later.
Here is nearly working (pseudo)code, array needs to be sorted:
PushArray(BTree bTree, int d, key[] Array)
{
List<int> counters = new List<int>{0};
//skip list will contain numbers of nodes to skip
//after filling node of some order in half
List<int> skip = new List<int>();
List<Pair<int,int>> skipList = List<Pair<int,int>>();
int i = -1;
while(true)
{
int order = 0;
while(counters[order] == d) order += 1;
for(int j = order - 1; j >= 0; j--) counters[j] = 0;
if (counters.Lenght <= order + 1) counters.Add(0);
counters[order] += 1;
if (skip.Count <= order)
skip.Add(i + 2);
if (order > 0)
skipList.Add({i,order}); //list of skipped parts that will be needed later
i += skip[order];
if (i > N) break;
bTree.Push(Array[i]);
}
//now we need to add all skipped keys in correct order
foreach(Pair<int,int> p in skipList)
{
for(int i = p.2; i > 0; i--)
PushArray(bTree, d, Array.SubArray(p.1 + skip[i - 1], skip[i] -1))
}
}
Example:
Here is how numbers and corresponding counters keys should be arranged for d = 2 while first pass through array. I marked keys which pushed into the B-Tree during first pass (before loop with recursion) with 'o' and skipped with 'x'.
24
4 9 14 19 29
0 1 2 3 5 6 7 8 10 11 12 13 15 16 17 18 20 21 22 23 25 26 27 28 30 ...
o o x x o o o x x o o o x x x x x x x x x x x x o o o x x o o ...
1 2 0 1 2 0 1 2 0 1 2 0 1 ...
0 0 1 1 1 2 2 2 0 0 0 1 1 ...
0 0 0 0 0 0 0 0 1 1 1 1 1 ...
skip[0] = 1
skip[1] = 3
skip[2] = 13
Since we don't iterate through skipped keys we have O(n) time complexity without adding to B-Tree itself and for sorted array;
In this form it may be unclear how it works when there is not enough keys to fill second half of node after skipped block but we can also avoid skipping of all skip[order] keys if total length of array lesser than ~ i + 2 * skip[order] and skip for skip[order - 1] keys instead, such string after changing counters but before changing variable i might be added:
while(order > 0 && i + 2*skip[order] > N) --order;
it will be correct cause if total count of keys on current level is lesser or equal than 3*d they still are split correctly if add them in original order. Such will lead to slightly different rearrangement of keys between two last nodes on some levels, but will not break any described requirements, and may be it will make behavior more easy to understand.
May be it's reasonable to find some animation and watch how it works, here is the sequence which should be generated on 0..29 range: 0 1 4 5 6 9 10 11 24 25 26 29 /end of first pass/ 2 3 7 8 14 15 16 19 20 21 12 13 17 18 22 23 27 28
The algorithm below attempts to prepare the order the keys so that you don't need to have power or even knowledge about the insertion procedure. The only assumption is that overfilled tree nodes are either split at the middle or at the position of the last inserted element, otherwise the B-tree can be treated as a black box.
The trick is to trigger node splits in a controlled way. First you fill a node exactly, the left half with keys that belong together and the right half with another range of keys that belong together. Finally you insert a key that falls in between those two ranges but which belongs with neither; the two subranges are split into separate nodes and the last inserted key ends up in the parent node. After splitting off in this fashion you can fill the remainder of both child nodes to make the tree as compact as possible. This also works for parent nodes with more than two child nodes, just repeat the trick with one of the children until the desired number of child nodes is created. Below, I use what is conceptually the rightmost childnode as the "splitting ground" (steps 5 and 6.1).
Apply the splitting trick recursively, and all elements should end up in their ideal place (which depends on the number of elements). I believe the algorithm below guarantees that the height of the tree is always minimal and that all nodes except for the root are as full as possible. However, as you can probably imagine it is hard to be completely sure without actually implementing and testing it thoroughly. I have tried this on paper and I do feel confident that this algorithm, or something extremely similar, should do the job.
Implied tree T with maximum branching factor M.
Top procedure with keys of length N:
Sort the keys.
Set minimal-tree-height to ceil(log(N+1)/log(M)).
Call insert-chunk with chunk = keys and H = minimal-tree-height.
Procedure insert-chunk with chunk of length L, subtree height H:
If H is equal to 1:
Insert all keys from the chunk into T
Return immediately.
Set the ideal subchunk size S to pow(M, H - 1).
Set the number of subtrees T to ceil((L + 1) / S).
Set the actual subchunk size S' to ceil((L + 1) / T).
Recursively call insert-chunk with chunk' = the last floor((S - 1) / 2) keys of chunk and H' = H - 1.
For each of the ceil(L / S') subchunks (of size S') except for the last with index I:
Recursively call insert-chunk with chunk' = the first ceil((S - 1) / 2) keys of subchunk I and H' = H - 1.
Insert the last key of subchunk I into T (this insertion purposefully triggers a split).
Recursively call insert-chunk with chunk' = the remaining keys of subchunk I (if any) and H' = H - 1.
Recursively call insert-chunk with chunk' = the remaining keys of the last subchunk and H' = H - 1.
Note that the recursive procedure is called twice for each subtree; that is fine, because the first call always creates a perfectly filled half subtree.
Here is a way which would lead to minimum height in any BST (including b tree) :-
sort array
Say you can have m key in b tree
Divide array recursively in m+1 equal parts using m keys in parent.
construct the child tree of n/(m+1) sorted keys using recursion.
example : -
m = 2 array = [1 2 3 4 5 6 7 8 9 10]
divide array into three parts :-
root = [4,8]
recursively solve :-
child1 = [1 2 3]
root1 = [2]
left1 = [1]
right1 = [3]
similarly for all childs solve recursively.
So is this about optimising the creation procedure, or optimising the tree?
You can clearly create a maximally efficient B-Tree by first creating a full Balanced Binary Tree, and then contracting nodes.
At any level in a binary tree, the gap in numbers between two nodes contains all the numbers between those two values by the definition of a binary tree, and this is more or less the definition of a B-Tree. You simply start contracting the binary tree divisions into B-Tree nodes. Since the binary tree is balanced by construction, the gaps between nodes on the same level always contain the same number of nodes (assuming the tree is filled). Thus the BTree so constructed is guaranteed balanced.
In practice this is probably quite a slow way to create a BTree, but it certainly meets your criteria for constructing the optimal B-Tree, and the literature on creating balanced binary trees is comprehensive.
=====================================
In your case, where you might take an off the shelf "better" over a constructed optimal version, have you considered simply changing the number of children nodes can have? Your diagram looks like a classic 2-3 tree, but its perfectly possible to have a 3-4 tree, or a 3-5 tree, which means that every node will have at least three children.
Your question is about btree optimization. It is unlikely that you do this just for fun. So I can only assume that you would like to optimize data accesses - maybe as part of database programming or something like this. You wrote: "When searching the BTree, it is paged on demand from disk into memory", which means that you either have not enough memory to do any sort of caching or you have a policy to utilize as less memory as possible. In either way this may be the root cause for why any answer to your question will not be satisfying. Let me explain why.
When it comes to data access optimization, memory is your friend. It does not matter if you do read or write optimization you need memory. Any sort of write optimization always works on the assumption that it can read information in a quick way (from memory) - sorting needs data. If you do not have enough memory for read optimization you will not have that for write optimization too.
As soon as you are willing to accept at least some memory utilization you can rethink your statement "When searching the BTree, it is paged on demand from disk into memory", which makes up room for balancing between read and write optimization. A to maximum optimized BTREE is maximized write optimization. In most data access scenarios I know you get a write at any 10-100 reads. That means that a maximized write optimization is likely to give a poor performance in terms of data access optimization. That is why databases accept restructuring cycles, key space waste, unbalanced btrees and things like that...

Information Gain and Entropy

I recently read this question regarding information gain and entropy. I think I have a semi-decent grasp on the main idea, but I'm curious as what to do with situations such as follows:
If we have a bag of 7 coins, 1 of which is heavier than the others, and 1 of which is lighter than the others, and we know the heavier coin + the lighter coin is the same as 2 normal coins, what is the information gain associated with picking two random coins and weighing them against each other?
Our goal here is to identify the two odd coins. I've been thinking this problem over for a while, and can't frame it correctly in a decision tree, or any other way for that matter. Any help?
EDIT: I understand the formula for entropy and the formula for information gain. What I don't understand is how to frame this problem in a decision tree format.
EDIT 2: Here is where I'm at so far:
Assuming we pick two coins and they both end up weighing the same, we can assume our new chances of picking H+L come out to 1/5 * 1/4 = 1/20 , easy enough.
Assuming we pick two coins and the left side is heavier. There are three different cases where this can occur:
HM: Which gives us 1/2 chance of picking H and a 1/4 chance of picking L: 1/8
HL: 1/2 chance of picking high, 1/1 chance of picking low: 1/1
ML: 1/2 chance of picking low, 1/4 chance of picking high: 1/8
However, the odds of us picking HM are 1/7 * 5/6 which is 5/42
The odds of us picking HL are 1/7 * 1/6 which is 1/42
And the odds of us picking ML are 1/7 * 5/6 which is 5/42
If we weight the overall probabilities with these odds, we are given:
(1/8) * (5/42) + (1/1) * (1/42) + (1/8) * (5/42) = 3/56.
The same holds true for option B.
option A = 3/56
option B = 3/56
option C = 1/20
However, option C should be weighted heavier because there is a 5/7 * 4/6 chance to pick two mediums. So I'm assuming from here I weight THOSE odds.
I am pretty sure I've messed up somewhere along the way, but I think I'm on the right path!
EDIT 3: More stuff.
Assuming the scale is unbalanced, the odds are (10/11) that only one of the coins is the H or L coin, and (1/11) that both coins are H/L
Therefore we can conclude:
(10 / 11) * (1/2 * 1/5) and
(1 / 11) * (1/2)
EDIT 4: Going to go ahead and say that it is a total 4/42 increase.
You can construct a decision tree from information-gain considerations, but that's not the question you posted, which is only the compute the information gain (presumably the expected information gain;-) from one "information extraction move" -- picking two random coins and weighing them against each other. To construct the decision tree, you need to know what moves are affordable from the initial state (presumably the general rule is: you can pick two sets of N coins, N < 4, and weigh them against each other -- and that's the only kind of move, parametric over N), the expected information gain from each, and that gives you the first leg of the decision tree (the move with highest expected information gain); then you do the same process for each of the possible results of that move, and so on down.
So do you need help to compute that expected information gain for each of the three allowable values of N, only for N==1, or can you try doing it yourself? If the third possibility obtains, then that would maximize the amount of learning you get from the exercise -- which after all IS the key purpose of homework. So why don't you try, edit your answer to show you how you proceeded and what you got, and we'll be happy to confirm you got it right, or try and help correct any misunderstanding your procedure might reveal!
Edit: trying to give some hints rather than serving the OP the ready-cooked solution on a platter;-). Call the coins H (for heavy), L (for light), and M (for medium -- five of those). When you pick 2 coins at random you can get (out of 7 * 6 == 42 possibilities including order) HL, LH (one each), HM, MH, LM, ML (5 each), MM (5 * 4 == 20 cases) -- 2 plus 20 plus 20 is 42, check. In the weighting you get 3 possible results, call them A (left heavier), B (right heavier), C (equal weight). HL, HM, and ML, 11 cases, will be A; LH, MH, and LM, 11 cases, will be B; MM, 20 cases, will be C. So A and B aren't really distinguishable (which one is left, which one is right, is basically arbitrary!), so we have 22 cases where the weight will be different, 20 where they will be equal -- it's a good sign that the cases giving each results are in pretty close numbers!
So now consider how many (equiprobable) possibilities existed a priori, how many a posteriori, for each of the experiment's results. You're tasked to pick the H and L choice. If you did it at random before the experiment, what would be you chances? 1 in 7 for the random pick of the H; given that succeeds 1 in 6 for the pick of the L -- overall 1 in 42.
After the experiment, how are you doing? If C, you can rule out those two coins and you're left with a mystery H, a mystery L, and three Ms -- so if you picked at random you'd have 1 in 5 to pick H, if successful 1 in 4 to pick L, overall 1 in 20 -- your success chances have slightly more than doubled. It's trickier to see "what next" for the A (and equivalently B) cases because they're several, as listed above (and, less obviously, not equiprobable...), but obviously you won't pick the known-lighter coin for H (and viceversa) and if you pick one of the 5 unweighed coins for H (or L) only one of the weighed coins is a candidate for the other role (L or H respectively). Ignoring for simplicity the "non equiprobable" issue (which is really kind of tricky) can you compute what your chances of guessing (with a random pick not inconsistent with the experiment's result) would be...?

Resources