A-star implementation with no path finding - artificial-intelligence

I'm dealing with a task from the ai class that is the following
I need to use the A* algorithm
I have an n-digits display between 0-9
Every digit is identified with Ci, I=0...n-1
Under every digit there is a button that the agent can press to modify their value.
Every time the agent presses the button, digits change following 2 rules
Ci = (Ci+1)%10
C (I+j)%n = [C (I+j)%n +k]%10
Rule 1) is known to the agent while rule 2) isn't. Constants j and k are unknown to the agenti
My goal is to reach a goal-state starting from a 00...0 Configuration.
The goal-state can be generated applying random actions by the agent, aka pressing buttons randomly, so I'm sure there is at least one way to solve the problem.
My difficulties are:
How do I represent an n-digits display as a node?
How do I choose a right heuristics?
I'm stuck and frustrated with this exam.
(Sorry for English mistakes, I'm italian!)

An A* algorithm is characterized namely by:
The nodes forming the search tree.
The definition of the "neighbor" concept.
The definition of an optimistic heuristic for a node.
Nodes in the search tree
As you are putting the problem, it seems that each node of the search tree is a configuration of the n digits of the display. Note that this can be easily represented as an array of integers of size n, where each position of the array represents a digit in the display, and the value represents the actual value of the digit in the display. For example, if a display of 5 digits showing 25847 could be represented by [2, 5, 8, 4, 7]. Easy, right?
"Neighbor" concept
That is, how do yo change the display status? As you said, you can only use one of the n buttons below each digit. So, for every status, you will have exactly n possible "neighbors", or "descendants", if you think on the search tree. You will need a function that gives you the node resulting of pressing a particular button (which would simulate the agent pressing the button). Something like the following (in Java):
static int[] pressButton(int[] node, int button, int j, int k) {
int n = node.length;
int[] newNode = Arrays.copyOf(node, n);
newNode[button] = (newNode[button] + 1) % 10;
newNode[(button + j) % n] = (newNode[(button + j) % n] + k) % 10;
return newNode;
}
Now you have this, you can generate every child of the current node with something as simple as:
for (int i = 0; i < node.length; i++) {
int[] newNode = pressButton(node, i, j, k);
// compute heuristic of the new node
// add it to the A* priority queue
}
Optimistic heuristic
Now you just need an heuristic that provides an optimistic estimation of the distance from a given node to the goal. One simple idea would be to assume that, for a given node, you will have to press the buttons at least as many times as the number of digits that differ from the goal; for example, if the goal is 41243 and the current node is 37253, you will have to press a button at least three times, because there are three digits that differ from the goal. In Java:
static int heuristic(int[] currentNode, int[] goal) {
int h = 0;
for (int i = 0; i < goal.length; i++) {
if (goal[i] != currentNode[i]) {
h = h + 1;
}
}
return h;
}
Note, however, that this heuristic is wrong. For example, if the goal is 81730, your current node is 71230, j = 2 and k = 5, this heuristic would give a value of 2; however, we could reach the goal state from the current node in just 1 step pressing button one, so the heuristic would be pessimist in this case. This is because each time a button is pressed, two digits are affected, which could make as get to the solution faster than we thought. To avoid this, we could just substract one from the heuristic (if it is bigger than one):
static int heuristic(int[] currentNode, int[] goal) {
int h = 0;
for (int i = 0; i < goal.length; i++) {
if (goal[i] != currentNode[i]) {
h = h + 1;
}
}
if (h > 1) {
h = h - 1;
}
return h;
}
More accurate heuristics could be defined based not only on the number of digits that differ from the goal, but also on how much they differ (the bigger the difference, the more times you need to press the button); however, I don't think there is an easy way to define an optimistic heuristic based on this (specially keeping in mind corner cases like j = 0 or j = n, negative k, etc.). If you are allowed to use the values of j and k in your heuristic (which I assumed not, because that's what I interpreted when you said that these are unknown to the agent), maybe there could be room for some sophistication, but even then, I would have a hard time trying to define it.
Finally, given the nature of the problem, it's very easy to reach already visited states when going from one node to another. If you want to keep your tree as an actual tree (and finite), and not as an infinite graph, you will have to use a set of already visited nodes, to avoid creating cycles in the search space.

Related

Solving Lights out for AI Course

So I was given the following task: Given that all lights in a 5x5 version of a game are turned on, write an algorithm using UCS / A* / BFS / Greedy best first search that finds a solution.
What I did first was realize that UCS would be unnecessary as the cost from moving from one state to another is 1(pressing a button that flips itself and neighbouring ones). So what I did is wrote BFS instead. It turned out that it works too long and fills up a queue, even though I was paying attention to removing parent nodes when I was finished with them not to overflow the memory. It would work for around 5-6mins and then crash because of memory.
Next, what I did is write DFS(even though it was not mentioned as one of possibilities) and it did find a solution in 123 secs, at depth 15(I used depth-first limited because I knew that there was a solution at depth 15).
What I am wondering now is am I missing something? Is there some good heuristics to try to solve this problem using A* search? I figured out absolutely nothing when it's about heuristics, because it doesn't seem any trivial to find one in this problem.
Thanks very much. Looking forward to some help from you guys
Here is my source code(I think it's pretty straightforward to follow):
struct state
{
bool board[25];
bool clicked[25];
int cost;
int h;
struct state* from;
};
int visited[1<<25];
int dx[5] = {0, 5, -5};
int MAX_DEPTH = 1<<30;
bool found=false;
struct state* MakeStartState()
{
struct state* noviCvor = new struct state();
for(int i = 0; i < 25; i++) noviCvor->board[i] = false, noviCvor->clicked[i] = false;
noviCvor->cost = 0;
//h=...
noviCvor->from = NULL;
return noviCvor;
};
struct state* MakeNextState(struct state* temp, int press_pos)
{
struct state* noviCvor = new struct state();
for(int i = 0; i < 25; i++) noviCvor->board[i] = temp->board[i], noviCvor->clicked[i] = temp->clicked[i];
noviCvor->clicked[press_pos] = true;
noviCvor->cost = temp->cost + 1;
//h=...
noviCvor->from = temp;
int temp_pos;
for(int k = 0; k < 3; k++)
{
temp_pos = press_pos + dx[k];
if(temp_pos >= 0 && temp_pos < 25)
{
noviCvor->board[temp_pos] = !noviCvor->board[temp_pos];
}
}
if( ((press_pos+1) % 5 != 0) && (press_pos+1) < 25 )
noviCvor->board[press_pos+1] = !noviCvor->board[press_pos+1];
if( (press_pos % 5 != 0) && (press_pos-1) >= 0 )
noviCvor->board[press_pos-1] = !noviCvor->board[press_pos-1];
return noviCvor;
};
bool CheckFinalState(struct state* temp)
{
for(int i = 0; i < 25; i++)
{
if(!temp->board[i]) return false;
}
return true;
}
int bijection_mapping(struct state* temp)
{
int temp_pow = 1;
int mapping = 0;
for(int i = 0; i < 25; i++)
{
if(temp->board[i])
mapping+=temp_pow;
temp_pow*=2;
}
return mapping;
}
void BFS()
{
queue<struct state*> Q;
struct state* start = MakeStartState();
Q.push(start);
struct state* temp;
visited[ bijection_mapping(start) ] = 1;
while(!Q.empty())
{
temp = Q.front();
Q.pop();
visited[ bijection_mapping(temp) ] = 2;
for(int i = 0; i < 25; i++)
{
if(!temp->clicked[i])
{
struct state* next = MakeNextState(temp, i);
int mapa = bijection_mapping(next);
if(visited[ mapa ] == 0)
{
if(CheckFinalState(next))
{
printf("NADJENO RESENJE\n");
exit(0);
}
visited[ mapa ] = 1;
Q.push(next);
}
}
}
delete temp;
}
}
PS. As I am not using map anymore(switched to array) for visited states, my DFS solution improved from 123 secs to 54 secs but BFS still crashes.
First of all, you may already recognize that in Lights Out you never have to flip the same switch more than once, and it doesn't matter in which order you flip the switches. You can thus describe the current state in two distinct ways: either in terms of which lights are on, or in terms of which switches have been flipped. The latter, together with the starting pattern of lights, gives you the former.
To employ a graph-search algorithm to solve the problem, you need a notion of adjacency. That follows more easily from the second characterization: two states are adjacent if there is exactly one switch about which they they differ. That characterization also directly encodes the length of the path to each node (= the number of switches that have been flipped), and it reduces the number of subsequent moves that need to be considered for each state considered, since all possible paths to each node are encoded in the pattern of switches.
You could use that in a breadth-first search relatively easily (and this may be what you in fact tried). BFS is equivalent to Dijkstra's algorithm in that case, even without using an explicit priority queue, because you enqueue new nodes to explore in priority (path-length) order.
You can also convert that to an A* search with addition of a suitable heuristic. For example, since each move turns off at most five lights, one could take as the heuristic the number of lights still on after each move, divided by 5. Though that's a bit crude, I'm inclined to think that it would be of some help. You do need a real priority queue for that alternative, however.
As far as implementation goes, do recognize that you can represent both the pattern of lights currently on and the pattern of switches that have been pressed as bit vectors. Each pattern fits in a 32-bit integer, and a list of visited states requires 225 bits, which is well within the capacity of modern computing systems. Even if you use that many bytes, instead, you ought to be able to handle it. Moreover, you can perform all needed operations using bitwise arithmetic operators, especially XOR. Thus, this problem (at its given size) ought to be computable relatively quickly.
Update:
As I mentioned in comments, I decided to solve the problem for myself, with -- it seemed to me -- very good success. I used a variety of techniques to achieve good performance and minimize memory usage, and in this case, those mostly were complementary. Here are some of my tricks:
I represented each whole-system state with a single uint64_t. The top 32 bits contain a bitmask of which switches have been flipped, and the bottom 32 contain a bitmask of which lights are on as a result. I wrapped these in a struct along with a single pointer to link them together as elements of a queue. A given state can be tested as a solution with one bitwise-and operation and one integer comparison.
I created a pre-initialized array of 25 uint64_t bitmasks representing the effect of each move. One bit set among the top 32 of each represents the switch that is flipped, and between 3 and five bits set among the bottom 32 represent the lights that are toggled as a result. The effect of flipping one switch can then be computed simply as new_state = old_state ^ move[i].
I implemented plain breadth-first search instead of A*, in part because I was trying to put something together quickly, and in particular because that way I could use a regular queue instead of a priority queue.
I structured my BFS in a way that naturally avoided visiting the same state twice, without having to actually track which states had ever been enqueued. This was based on some insight into how to efficiently generate distinct bit patterns without repeating, with those having fewer bits set generated before those having more bits set. The latter criterion was satisfied fairly naturally by the queue-based approach required anyway for BFS.
I used a second (plain) queue to recycle dynamically-allocated queue nodes after they were removed from the main queue, to minimize the number calls to malloc().
Overall code was a bit less than 200 lines, including blank and comment lines, data type declarations, I/O, queue implementation (plain C, no STL) -- everything.
Note, by the way, that the priority queue employed in standard Dijkstra and in A* is primarily about finding the right answer (shortest path), and only secondarily about doing so efficiently. Enqueueing and dequeueing from a standard queue can both be O(1), whereas those operations on a priority queue are o(log m) in the number of elements in the queue. A* and BFS both have worst-case queue size upper bounds of O(n) in the total number of states. Thus, BFS will scale better than A* with problem size; the only question is whether the former reliably gives you the right answer, which in this case, it does.

Generating a connected graph and checking if it has eulerian cycle

So, I wanted to have some fun with graphs and now it's driving me crazy.
First, I generate a connected graph with a given number of edges. This is the easy part, which became my curse. Basically, it works as intended, but the results I'm getting are quite bizarre (well, maybe they're not, and I'm the issue here). The algorithm for generating the graph is fairly simple.
I have two arrays, one of them is filled with numbers from 0 to n - 1, and the other is empty.
At the beginning I shuffle the first one move its last element to the empty one.
Then, in a loop, I'm creating an edge between the last element of the first array and a random element from the second one and after that I, again, move the last element from the first array to the other one.
After that part is done, I have to create random edges between the vertexes until I get as many as I need. This is, again, very easy. I just random two numbers in the range from 0 to n - 1 and if there is no edge between these vertexes, I create one.
This is the code:
void generate(int n, double d) {
initMatrix(n); // <- creates an adjacency matrix n x n, filled with 0s
int *array1 = malloc(n * sizeof(int));
int *array2 = malloc(n * sizeof(int));
int j = n - 1, k = 0;
for (int i = 0; i < n; ++i) {
array1[i] = i;
array2[i] = 0;
}
shuffle(array1, 0, n); // <- Fisher-Yates shuffle
array2[k++] = array1[j--];
int edges = d * n * (n - 1) * .5;
if (edges % 2) {
++edges;
}
while (j >= 0) {
int r = rand() % k;
createEdge(array1[j], array2[r]);
array2[k++] = array1[j--];
--edges;
}
free(array1);
free(array2);
while (edges) {
int a = rand() % n;
int b = rand() % n;
if (a == b || checkEdge(a, b)) {
continue;
}
createEdge(a, b);
--edges;
}
}
Now, if I print it out, it's a fine graph. Then I want to find a Hammiltonian cycle. This part works. Then I get to my bane - Eulerian cycle. What's the problem?
Well, first I check if all vertexes are even. And they are not. Always. Every single time, unless I choose to generate a complete graph.
I now feel destroyed by my own code. Is something wrong? Or is it supposed to be like this? I knew that Eulerian circuits would be rare, but not that rare. Please, help.
Let's analyze the probability for having euleran cycle, and for simplicity - let's do it for all graphs with n vertices, no matter number of edges.
Given a graph G of size n, choose one arbitrary vertex. The probability of it's degree being even is roughly 1/2 (assuming for each u1,u2, P((v,u1) exists) = P((v,u2) exists)).
Now, remove v from G, and create a new graph G' with n-1 vertices, and without all edges connected to v.
Similarly, for any arbitrary vertex v' in G' - if (v,v') was an edge on G', we need d(v') to be odd. Otherwise, we need d(v') to be even (both in G'). Either way, probability of it is still roughly ~1/2. (independent from previous degree of v).
....
For the ith round, let #(v) be the number of discarded edges until reaching the current graph that are connected to v. If #(v) is odd, the probability of its current degree being odd is ~1/2, and if #(v) is even, the probability of its current degree being even is also ~1/2, and we remain with current probability of ~1/2
We can now understand how it works, and make a recurrence formula for the probability of the graph being eulerian cyclic:
P(n) ~= 1/2*P(n-1)
P(1) = 1
This is going to give us P(n) ~= 2^-n, which is very unlikely for reasonable n.
Note, 1/2 is just a rough estimation (and is correct when n->infinity), probability is in fact a bit higher, but it is still exponential in -n - which makes it very unlikely for reasonable size graphs.

Feasibility of non-self-intersecting path according to array constraints

I have two arrays, each containing a different ordering of the same set of integers. Each integer is a label for a point in which two closed paths intersect in the plane. The two arrays are interpreted as giving the circular ordering (in clockwise order) of points along each of two closed paths in the plane, with no particular starting point. The two paths intersect with each other as many times as there are points in the arrays, but a path may not self-intersect at all. How do I determine, from these two arrays, whether it is possible to draw the two paths in the plane without self-crossings? (The integer labels have no inherent meaning.)
Example 1: A = {3,4,2,1,10,7} and B = {1,2,4,10,7,3}: it is possible
Example 2: A = {2,3,0,10,8,11} and B = {10,2,3,8,11,0}: it is not possible.
Try it by drawing a circle, with 6 points labelled around it according to A, then attempt to connect the 6 points in a second closed path, according to the ordering in B, without crossing the new line you are drawing. (I believe it makes no difference to the possibility/impossibility of drawing the line whether you start by exiting or entering the first loop.) You will be able to do it for example 1, but not for example 2.
I am currently using a very elaborate method where I look at adjacent pairs in one array, e.g. in Example 1, array A is divided into {3,4}, {2,1}, {10,7}, then I find the groupings in the array B as partitioned by the two members listed in each case:
{3,4} --> {{1,2}, {10,7}}
{2,1} --> {{4,10,7,3}, {}}
{10,7} --> {{3,1,2,4}, {}}
and check that each pair on the left-hand-side finds itself in the same grouping of the right-hand-side partition in each of the other 2 rows. Then I do the same, offset by one position:
{4,2} --> {{10,7,3,1}, {}}
{1,10} --> {{2,4}, {7,3}}
{7,3} --> {{1,2,4,10}, {}}
Everything checks out here.
In Example 2, though, the method shows that it is impossible to draw the path. Among the "offset by 1" pairs from array A we find {10,8} causes a partition of array B into {{2,3}, {11,0}}. But we need 11 and 2 to be in the same grouping, as they are the next pair of points in array A.
This idea is unwieldy, and my implementation is even more unwieldy. I'm not even 100% convinced it always works. Could anyone suggest an algorithm for deciding? Target language is C, if that matters.
EDIT: I've added an illustration here: http://imgur.com/TS8xDIk. Here the paths to be reconciled share points 0, 1, 2 and 3. On the black path they are visited in order (A = {0,1,2,3}). On the blue path we have B = {0,2,1,3}. You can see on the left-hand side that this is impossible--the blue path will have to self-intersect in order to do it (or have additional intersections with the black path, which is also not allowed).
On the right-hand side is an illustration of the same problem interpreted as a graph with edges, responding to the suggestion that the problem boils down to a check for planarity. Well, as you can see, it's quite possible to form a planar graph from this collection of edges, but we cannot read the graph as two closed paths with n intersections--the blue path has "intersections" with the other path that don't actually cross. The paths are required to cross from inside to outside or vice-versa at each node, they cannot simply kiss and turn back.
I hope this clarifies the problem and I apologise for any lack of clarity the first time around.
By the way introducing coordinates would be a complete red herring: any point can be given any coordinates, and the problem remains the same. In a sense it is topological more than geometrical. Thanks for any additional suggestions on how to accomplish this feasibility check.
SECOND EDIT to show my current code. Like in the suggestion below by svinja, I first reduced the two arrays to a permutation of 0..2n-1. The input to the function is two arrays (which contain different orderings of the same 2n integers) and the length of these arrays. I am a hobbyist with no training in programming so I expect you will find several infelicities in the approach to coding. The idea is to return 1 if the arrays A and B are in a permutational relationship that allows the path to be drawn, and 0 if not.
int isGoodPerm(int A[], int B[], int len)
{
int i,j,a,b;
int P[max_len];
for (i=0; i<len; i++)
for (j=0; j<len; j++)
if (B[j] == A[i])
{
P[i] = j;
break;
}
for (i=0; i<len; i++)
{
if (P[i] < P[(i+1)%len])
{
a = P[i];
b = P[(i+1)%len];
}
else
{
a = P[(i+1)%len];
b = P[i];
}
for (j=i+2; j<i+len; j+=2)
if ((P[j%len] > a && P[j%len] < b) != (P[(j+1)%len] > a && P[(j+1)%len] < b))
return 0;
}
return 1;
}
I'm actually still testing another part of this project, and have only tested this part in isolation. I tweaked a couple of things when pasting it into the larger codebase and have copied that version--I hope I didn't introduce any errors.
I think the long question is hiding the true intent. I might be missing something, but it looks like the only thing you really need to check is if the points in an array can be drawn without self-intersecting. I'm assuming you can map the integers to the actual coordinates. If so, you might find the solution posed by the related math.statckexchange site here describing either the determinant-based method or the Bentley-Ottman algorithm for crossings to be helpful.
I am not sure if this is correct, but as nobody is posting an answer, here it is:
We can convert any instance of this problem to one where the first path is (0, 1, 2, ... N). In your example 2, this would be (0, 1, 2, 3, 4, 5) and (3, 0, 1, 4, 5, 2). I only mention this because I do this conversion in my code to simplify further code.
Now, imagine the first path are points on a circle. I think we can assume this without loss of generality. I also assume we can start the second path either inside or outside of the circle, if one works the other should, too. If I am wrong about either, the algorithm is certainly wrong.
So we always start by connecting the first and second point of the second path on the, let's say, outside. If we connect 2 points X and Y which are not right next to each other on the circle, we divide the remaining points into group A - the ones from X to Y clockwise, and group B - the ones from Y to X clockwise. Now we remember that points from group A can no longer be connected to points from group B on the outside part.
After this, we continue connecting the second and third point of the second path, but we are now on the inside. So we check "can we connect X and Y on the inside?" if we can't, we return false. If we can, we again find groups A and B and remember that none of them can be connected to each other, but now on the inside.
Now we're back on the outside, and we connect the third and fourth point of the second path... And so on.
Here is an image that shows how it works, for your examples 1 and 2:
And here is the code (in C#, but should be easy to translate):
static bool Check(List<int> path1, List<int> path2)
{
// Translate into a problem where the first path is (0, 1, 2, ... N}
var path = new List<int>();
foreach (var path2Element in path2)
path.Add(path1.IndexOf(path2Element));
var N = path.Count;
var blocked = new bool[N, N, 2];
var subspace = 0;
var currentElementIndex = 0;
var nextElementIndex = 1;
for (int step = 1; step <= N; step++)
{
var currentElement = path[currentElementIndex];
var nextElement = path[nextElementIndex];
// If we're blocked before finishing, return false
if (blocked[currentElement, nextElement, subspace])
return false;
// Mark appropriate pairs as blocked
for (int i = (currentElement + 1) % N; i != nextElement; i = (i + 1) % N)
for (int j = (nextElement + 1) % N; j != currentElement; j = (j + 1) % N)
blocked[i, j, subspace] = blocked[j, i, subspace] = true;
// Move to the next edge
currentElementIndex = (currentElementIndex + 1) % N;
nextElementIndex = (nextElementIndex + 1) % N;
// Outside -> Inside, or Inside -> Outside
subspace = (2 - subspace) / 2;
}
return true;
}
Old answer:
I am not sure I understood this problem correctly, but if I have, I think this can be reduced to planarity testing. I will use your example 2 for the numbers:
Create graph G1 from the first array; it has edges 2-3, 3-0, 10-8, 8-11, 11-2
Create graph G2 from the second array; 10-2, 2-3, 3-8, 8-11, 11-0, 0-10
Create graph G whose set of edges is the union of the sets of edges of G1 and G2: 2-3, 3-0, 10-8, 8-11, 11-2, 10-2, 3-8, 11-0, 0-10
Check if G is planar.
This is if I correctly interpreted the question in the sense that the second path must not cross itself but must not cross the first path either (except for the unavoidable 1 intersection per vertex due to shared vertices). If this is not the case, then Example 2 does have solutions (note how the 11-2 and 8-10 edges are crossed by the second path).

Efficiently choose an integer distinct from all elements of a list

I have a linked list of objects each containing a 32-bit integer (and provably fewer than 232 such objects) and I want to efficiently choose an integer that's not present in the list, without using any additional storage (so copying them to an array, sorting the array, and choosing the minimum value not in the array would not be an option). However, the definition of the structure for list elements is under my control, so I could add (within reason) additional storage to each element as part of solving the problem. For example, I could add an extra set of prev/next pointers and merge-sort the list. Is this the best solution? Or is there a simpler or more efficient way to do it?
Given the conditions that you outline in the comments, especially your expectation of many identical values, you must expect a sparse distribution of used values.
Consequently, it might actually be best to just guess a value randomly and then check whether it coincides with a value in the list. Even if half the available value range were used (which seems extremely unlikely from your comments), you would only traverse the list twice on average. And you can drastically decrease this factor by simultaneously checking a number of guesses in one pass. Done correctly, the factor should always be close to one.
The advantage of such a probabilistic approach is that you are immune to bad sequences of values. Such sequences are always possible with range based approaches: If you calculate the min and max of the data, you run the risk, that the data contains both 0 and 2^32-1. If you sequentially subdivide an interval, you run the risk of always getting values in the middle of the interval, which can shrink it to zero in 32 steps. With a probabilistic approach, these sequences can't hurt you.
I think, I would use something like four guesses for very small lists, and crank it up to roughly 16 as the size of the list approaches the limit. The high starting value is due to the fact that any such algorithm will be memory bound, i. e. your CPU has ample amounts of time to check a value while it waits for the next values to arrive from memory, so you better make good use of that time to reduce the number of passes required.
A further optimization would instantly replace a busted guess with a new one and keep track of where the replacement happened, so that you can avoid a complete second pass through the data. Also, move the busted guess to the end of the list of guesses, so that you only need to check against the start position of the first guess in your loop to stop as early as possible.
If you can spare one pointer in each object, you get an O(n) worst-case algorithm easily (standard divide-and-conquer):
Divide the range of possible IDs equally.
Make a singly-linked list covering each subrange.
If one subrange is empty, choose any id in it.
Otherwise repeat with the elements of the subrange with fewest elements.
Example code using two sub-ranges per iteration:
unsigned getunusedid(element* h) {
unsigned start = 0, stop = -1;
for(;h;h = h->mainnext)
h->next = h->mainnext;
while(h) {
element *l = 0, *r = 0;
unsigned cl = 0, cr = 0;
unsigned mid = start + (stop - start) / 2;
while(h) {
element* next = h->next;
if(h->id < mid) {
h->next = l;
cl++;
l = h;
} else {
h->next = r;
cr++;
r = h;
}
h = next;
}
if(cl < cr) {
h = l;
stop = mid - 1;
} else {
h = r;
start = mid;
}
}
return start;
}
Some more remarks:
Beware of bugs in the above code; I have only proved it correct, not tried it.
Using more buckets (best keep to a power of 2 for easy and efficient handling) each iteration might be faster due to better data-locality (though only try and measure if it's not fast enough otherwise), as #MarkDickson rightly remarks.
Without those extra-pointers, you need full sweeps each iteration, raising the bound to O(n*lg n).
An alternative would be using 2+ extra-pointers per element to maintain a balanced tree. That would speed up id-search, at the expense of some memory and insertion/removal time overhead.
If you don't mind an O(n) scan for each change in the list and two extra bits per element, whenever an element is inserted or removed, scan through and use the two bits to represent whether an integer (element + 1) or (element - 1) exists in the list.
For example, inserting the element, 2, the extra bits for each 3 and 1 in the list would be updated to show that 3-1 (in the case of 3) and 1+1 (in the case of 1) now exist in the list.
Insertion/deletion time can be reduced by adding a pointer from each element to the next element with the same integer.
I am supposing that integers have random values not controlled by your code.
Add two unsigned integers in your list class:
unsigned int rangeMinId = 0;
unsigned int rangeMaxId = 0xFFFFFFFF ;
Or if not possible to change the List class add them as global variables.
When the list is empty you will always know that the range if free. When you add a new item in the list check if its ID is between rangeMinId and rangeMaxId and if so change the nearest of them to this ID.
It may happen after a lot of time that rangeMinId to become equal to rangeMaxId-1, then you need a simple function which traverses the whole list and search for another free range. But this will not happens very frequently.
Other solutions are more complex and involves using of sets, binary trees or sorted arrays.
Update:
The free range search function can be done in O(n*log(n)). An example of such function is given below(I have not extensively tested it). The example is for integer array but easily can be adapted for a list.
int g_Calls = 0;
bool _findFreeRange(const int* value, int n, int& left, int& right)
{
g_Calls ++ ;
int l=left, r=right,l2,r2;
int m = (right + left) / 2 ;
int nl=0, nr=0;
for(int k = 0; k < n; k++)
{
const int& i = value[k] ;
if(i > l && i < r)
{
if(i-l < r-i)
l = i;
else
r = i;
}
if(i < m)
nl ++ ;
else
nr ++ ;
}
if ( (r - l) > 1 )
{
left = l;
right = r;
return true ;
}
if( nl < nr)
{
// check first left then right
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
else
{
// check first right then left
l2 = m;
r2 = right;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
l2 = left;
r2 = m;
if(r2-l2 > 1 && _findFreeRange(value, n, l2, r2))
{
left = l2 ;
right = r2 ;
return true;
}
}
return false;
}
bool findFreeRange(const int* value, int n, int& left, int& right, int maxx)
{
g_Calls = 1;
left = 0;
right = maxx;
if(!_findFreeRange(value, n, left, right))
return false ;
left++;
right--;
return (right - left) >= 0 ;
}
If it returns false list is filled and there is no free range (very least possible), maxm is the maximal limit of the range in this case 0xFFFFFFFF.
The idea is first to search the biggest range of the list and then if no free hole is found to recursively search the subranges for holes which may have been left during the first pass. If the list is sparsely filled it is very least probable that function will be called more than once. However when the list become almost completely filled it can happen the range search to take longer. Thus in this most worst case scenario, when the list becomes closed to filled, its better to start keeping all free ranges in a list.
This reminds me of the book Programming Pearls, and in particular the very first column, "Cracking the Oyster". What is the real problem you are trying to solve?
If your list is small, then a simple linear search to find max/min would work and it would work quickly.
When your list gets large and linear search becomes unwieldy, you can build a bitmap to represent the unused numbers for much less memory than adding 2 extra pointers at each node in the linked list. In fact, it would only be 2^(32-8) = 16KB of RAM compared to your linked list being potentially >10GB.
Then to find an unused number, you can just traverse the bitmap one machine-word at a time, checking if it's non-zero. If it is, then at least one number in that 32- or 64- bit block is unused, and you can inspect the word to find out exactly which bit is set. As you add numbers to the list, all you have to do is clear the corresponding bit in the bitmap.
One possible solution is to take the min and max of the list with a simple O(n) iteration, then pick a number between max and min + (1 << 32). This is simple to do since overflow/underflow behavior is well-defined for unsigned integers:
uint32_t min, max;
// TODO: compute min and max here
// exclude max from choice space (min will be an exclusive upper bound)
max++;
uint32_t choice = rand32() % (min - max) + max; // where rand32 is a random unsigned 32-bit integer
Of course, if it doesn't need to be random, then you can just use one more than the maximum of the list.
Note: the only case where this fails is if min is 0 and max is UINT32_MAX (aka 4294967295).
Ok. Here is one really simple solution. Some of the answers have become too theoretical and complicated for optimization. If you need a quick solution do this:
1.In your List add a member:
unsigned int NextFreeId = 1;
add also an std::set<unsigned int> ids
When you add item in the list add also the integer in the set and keep track of the NextFreeId:
int insert(unsigned int id)
{
ids.insert(id);
if (NextFreeId == id) //will not happen too frequently
{
unsigned int TheFreeId ;
unsigned int nextid = id+1, previd = id-1;
while(true )
{
if(nextid < 0xFFFFFFF && !ids.count(nextid))
{
NextFreeId = nextid ;
break ;
}
if(previd > 0 && !ids.count(previd))
{
NextFreeId = previd ;
break ;
}
if(prevId == 0 && nextid == 0xFFFFFFF)
break; // all the range is filled, there is no free id
nextid++ ;
previd -- ;
}
}
return 1;
}
Sets are very efficient to check if a value is contained so the complexity will be O(log(N)). It is quick to implement. Also set is searched not each time but only when the NextFreeId is filled. List is not traversed at all.

How can I find a number which occurs an odd number of times in a SORTED array in O(n) time?

I have a question and I tried to think over it again and again... but got nothing so posting the question here. Maybe I could get some view-point of others, to try and make it work...
The question is: we are given a SORTED array, which consists of a collection of values occurring an EVEN number of times, except one, which occurs ODD number of times. We need to find the solution in log n time.
It is easy to find the solution in O(n) time, but it looks pretty tricky to perform in log n time.
Theorem: Every deterministic algorithm for this problem probes Ω(log2 n) memory locations in the worst case.
Proof (completely rewritten in a more formal style):
Let k > 0 be an odd integer and let n = k2. We describe an adversary that forces (log2 (k + 1))2 = Ω(log2 n) probes.
We call the maximal subsequences of identical elements groups. The adversary's possible inputs consist of k length-k segments x1 x2 … xk. For each segment xj, there exists an integer bj ∈ [0, k] such that xj consists of bj copies of j - 1 followed by k - bj copies of j. Each group overlaps at most two segments, and each segment overlaps at most two groups.
Group boundaries
| | | | |
0 0 1 1 1 2 2 3 3
| | | |
Segment boundaries
Wherever there is an increase of two, we assume a double boundary by convention.
Group boundaries
| || | |
0 0 0 2 2 2 2 3 3
Claim: The location of the jth group boundary (1 ≤ j ≤ k) is uniquely determined by the segment xj.
Proof: It's just after the ((j - 1) k + bj)th memory location, and xj uniquely determines bj. //
We say that the algorithm has observed the jth group boundary in case the results of its probes of xj uniquely determine xj. By convention, the beginning and the end of the input are always observed. It is possible for the algorithm to uniquely determine the location of a group boundary without observing it.
Group boundaries
| X | | |
0 0 ? 1 2 2 3 3 3
| | | |
Segment boundaries
Given only 0 0 ?, the algorithm cannot tell for sure whether ? is a 0 or a 1. In context, however, ? must be a 1, as otherwise there would be three odd groups, and the group boundary at X can be inferred. These inferences could be problematic for the adversary, but it turns out that they can be made only after the group boundary in question is "irrelevant".
Claim: At any given point during the algorithm's execution, consider the set of group boundaries that it has observed. Exactly one consecutive pair is at odd distance, and the odd group lies between them.
Proof: Every other consecutive pair bounds only even groups. //
Define the odd-length subsequence bounded by the special consecutive pair to be the relevant subsequence.
Claim: No group boundary in the interior of the relevant subsequence is uniquely determined. If there is at least one such boundary, then the identity of the odd group is not uniquely determined.
Proof: Without loss of generality, assume that each memory location not in the relevant subsequence has been probed and that each segment contained in the relevant subsequence has exactly one location that has not been probed. Suppose that the jth group boundary (call it B) lies in the interior of the relevant subsequence. By hypothesis, the probes to xj determine B's location up to two consecutive possibilities. We call the one at odd distance from the left observed boundary odd-left and the other odd-right. For both possibilities, we work left to right and fix the location of every remaining interior group boundary so that the group to its left is even. (We can do this because they each have two consecutive possibilities as well.) If B is at odd-left, then the group to its left is the unique odd group. If B is at odd-right, then the last group in the relevant subsequence is the unique odd group. Both are valid inputs, so the algorithm has uniquely determined neither the location of B nor the odd group. //
Example:
Observed group boundaries; relevant subsequence marked by […]
[ ] |
0 0 Y 1 1 Z 2 3 3
| | | |
Segment boundaries
Possibility #1: Y=0, Z=2
Possibility #2: Y=1, Z=2
Possibility #3: Y=1, Z=1
As a consequence of this claim, the algorithm, regardless of how it works, must narrow the relevant subsequence to one group. By definition, it therefore must observe some group boundaries. The adversary now has the simple task of keeping open as many possibilities as it can.
At any given point during the algorithm's execution, the adversary is internally committed to one possibility for each memory location outside of the relevant subsequence. At the beginning, the relevant subsequence is the entire input, so there are no initial commitments. Whenever the algorithm probes an uncommitted location of xj, the adversary must commit to one of two values: j - 1, or j. If it can avoid letting the jth boundary be observed, it chooses a value that leaves at least half of the remaining possibilities (with respect to observation). Otherwise, it chooses so as to keep at least half of the groups in the relevant interval and commits values for the others.
In this way, the adversary forces the algorithm to observe at least log2 (k + 1) group boundaries, and in observing the jth group boundary, the algorithm is forced to make at least log2 (k + 1) probes.
Extensions:
This result extends straightforwardly to randomized algorithms by randomizing the input, replacing "at best halved" (from the algorithm's point of view) with "at best halved in expectation", and applying standard concentration inequalities.
It also extends to the case where no group can be larger than s copies; in this case the lower bound is Ω(log n log s).
A sorted array suggests a binary search. We have to redefine equality and comparison. Equality simple means an odd number of elements. We can do comparison by observing the index of the first or last element of the group. The first element will be an even index (0-based) before the odd group, and an odd index after the odd group. We can find the first and last elements of a group using binary search. The total cost is O((log N)²).
PROOF OF O((log N)²)
T(2) = 1 //to make the summation nice
T(N) = log(N) + T(N/2) //log(N) is finding the first/last elements
For some N=2^k,
T(2^k) = (log 2^k) + T(2^(k-1))
= (log 2^k) + (log 2^(k-1)) + T(2^(k-2))
= (log 2^k) + (log 2^(k-1)) + (log 2^(k-2)) + ... + (log 2^2) + 1
= k + (k-1) + (k-2) + ... + 1
= k(k+1)/2
= (k² + k)/2
= (log(N)² + log(N))/ 2
= O(log(N)²)
Look at the middle element of the array. With a couple of appropriate binary searches, you can find the first and its last appearance in the array. E.g., if the middle element is 'a', you need to find i and j as shown below:
[* * * * a a a a * * *]
^ ^
| |
| |
i j
Is j - i an even number? You are done! Otherwise (and this is the key here), the question to ask is i an even or an odd number? Do you see what this piece of knowledge implies? Then the rest is easy.
This answer is in support of the answer posted by "throwawayacct". He deserves the bounty. I spent some time on this question and I'm totally convinced that his proof is correct that you need Ω(log(n)^2) queries to find the number that occurs an odd number of times. I'm convinced because I ended up recreating the exact same argument after only skimming his solution.
In the solution, an adversary creates an input to make life hard for the algorithm, but also simple for a human analyzer. The input consists of k pages that each have k entries. The total number of entries is n = k^2, and it is important that O(log(k)) = O(log(n)) and Ω(log(k)) = Ω(log(n)). To make the input, the adversary makes a string of length k of the form 00...011...1, with the transition in an arbitrary position. Then each symbol in the string is expanded into a page of length k of the form aa...abb...b, where on the ith page, a=i and b=i+1. The transition on each page is also in an arbitrary position, except that the parity agrees with the symbol that the page was expanded from.
It is important to understand the "adversary method" of analyzing an algorithm's worst case. The adversary answers queries about the algorithm's input, without committing to future answers. The answers have to be consistent, and the game is over when the adversary has been pinned down enough for the algorithm to reach a conclusion.
With that background, here are some observations:
1) If you want to learn the parity of a transition in a page by making queries in that page, you have to learn the exact position of the transition and you need Ω(log(k)) queries. Any collection of queries restricts the transition point to an interval, and any interval of length more than 1 has both parities. The most efficient search for the transition in that page is a binary search.
2) The most subtle and most important point: There are two ways to determine the parity of a transition inside a specific page. You can either make enough queries in that page to find the transition, or you can infer the parity if you find the same parity in both an earlier and a later page. There is no escape from this either-or. Any set of queries restricts the transition point in each page to some interval. The only restriction on parities comes from intervals of length 1. Otherwise the transition points are free to wiggle to have any consistent parities.
3) In the adversary method, there are no lucky strikes. For instance, suppose that your first query in some page is toward one end instead of in the middle. Since the adversary hasn't committed to an answer, he's free to put the transition on the long side.
4) The end result is that you are forced to directly probe the parities in Ω(log(k)) pages, and the work for each of these subproblems is also Ω(log(k)).
5) Things are not much better with random choices than with adversarial choices. The math is more complicated, because now you can get partial statistical information, rather than a strict yes you know a parity or no you don't know it. But it makes little difference. For instance, you can give each page length k^2, so that with high probability, the first log(k) queries in each page tell you almost nothing about the parity in that page. The adversary can make random choices at the beginning and it still works.
Start at the middle of the array and walk backward until you get to a value that's different from the one at the center. Check whether the number above that boundary is at an odd or even index. If it's odd, then the number occurring an odd number of times is to the left, so repeat your search between the beginning and the boundary you found. If it's even, then the number occurring an odd number of times must be later in the array, so repeat the search in the right half.
As stated, this has both a logarithmic and a linear component. If you want to keep the whole thing logarithmic, instead of just walking backward through the array to a different value, you want to use a binary search instead. Unless you expect many repetitions of the same numbers, the binary search may not be worthwhile though.
I have an algorithm which works in log(N/C)*log(K), where K is the length of maximum same-value range, and C is the length of range being searched for.
The main difference of this algorithm from most posted before is that it takes advantage of the case where all same-value ranges are short. It finds boundaries not by binary-searching the entire array, but by first quickly finding a rough estimate by jumping back by 1, 2, 4, 8, ... (log(K) iterations) steps, and then binary-searching the resulting range (log(K) again).
The algorithm is as follows (written in C#):
// Finds the start of the range of equal numbers containing the index "index",
// which is assumed to be inside the array
//
// Complexity is O(log(K)) with K being the length of range
static int findRangeStart (int[] arr, int index)
{
int candidate = index;
int value = arr[index];
int step = 1;
// find the boundary for binary search:
while(candidate>=0 && arr[candidate] == value)
{
candidate -= step;
step *= 2;
}
// binary search:
int a = Math.Max(0,candidate);
int b = candidate+step/2;
while(a+1!=b)
{
int c = (a+b)/2;
if(arr[c] == value)
b = c;
else
a = c;
}
return b;
}
// Finds the index after the only "odd" range of equal numbers in the array.
// The result should be in the range (start; end]
// The "end" is considered to always be the end of some equal number range.
static int search(int[] arr, int start, int end)
{
if(arr[start] == arr[end-1])
return end;
int middle = (start+end)/2;
int rangeStart = findRangeStart(arr,middle);
if((rangeStart & 1) == 0)
return search(arr, middle, end);
return search(arr, start, rangeStart);
}
// Finds the index after the only "odd" range of equal numbers in the array
static int search(int[] arr)
{
return search(arr, 0, arr.Length);
}
Take the middle element e. Use binary search to find the first and last occurrence. O(log(n))
If it is odd return e.
Otherwise, recurse onto the side that has an odd number of elements [....]eeee[....]
Runtime will be log(n) + log(n/2) + log(n/4).... = O(log(n)^2).
AHhh. There is an answer.
Do a binary search and as you search, for each value, move backwards until you find the first entry with that same value. If its index is even, it is before the oddball, so move to the right.
If its array index is odd, it is after the oddball, so move to the left.
In pseudocode (this is the general idea, not tested...):
private static int FindOddBall(int[] ary)
{
int l = 0,
r = ary.Length - 1;
int n = (l+r)/2;
while (r > l+2)
{
n = (l + r) / 2;
while (ary[n] == ary[n-1])
n = FindBreakIndex(ary, l, n);
if (n % 2 == 0) // even index we are on or to the left of the oddball
l = n;
else // odd index we are to the right of the oddball
r = n-1;
}
return ary[l];
}
private static int FindBreakIndex(int[] ary, int l, int n)
{
var t = ary[n];
var r = n;
while(ary[n] != t || ary[n] == ary[n-1])
if(ary[n] == t)
{
r = n;
n = (l + r)/2;
}
else
{
l = n;
n = (l + r)/2;
}
return n;
}
You can use this algorithm:
int GetSpecialOne(int[] array, int length)
{
int specialOne = array[0];
for(int i=1; i < length; i++)
{
specialOne ^= array[i];
}
return specialOne;
}
Solved with the help of a similar question which can be found here on http://www.technicalinterviewquestions.net
We don't have any information about the distribution of lenghts inside the array, and of the array as a whole, right?
So the arraylength might be 1, 11, 101, 1001 or something, 1 at least with no upper bound, and must contain at least 1 type of elements ('number') up to (length-1)/2 + 1 elements, for total sizes of 1, 11, 101: 1, 1 to 6, 1 to 51 elements and so on.
Shall we assume every possible size of equal probability? This would lead to a middle length of subarrays of size/4, wouldn't it?
An array of size 5 could be divided into 1, 2 or 3 sublists.
What seems to be obvious is not that obvious, if we go into details.
An array of size 5 can be 'divided' into one sublist in just one way, with arguable right to call it 'dividing'. It's just a list of 5 elements (aaaaa). To avoid confusion let's assume the elements inside the list to be ordered characters, not numbers (a,b,c, ...).
Divided into two sublist, they might be (1, 4), (2, 3), (3, 2), (4, 1). (abbbb, aabbb, aaabb, aaaab).
Now let's look back at the claim made before: Shall the 'division' (5) be assumed the same probability as those 4 divisions into 2 sublists? Or shall we mix them together, and assume every partition as evenly probable, (1/5)?
Or can we calculate the solution without knowing the probability of the length of the sublists?
The clue is you're looking for log(n). That's less than n.
Stepping through the entire array, one at a time? That's n. That's not going to work.
We know the first two indexes in the array (0 and 1) should be the same number. Same with 50 and 51, if the odd number in the array is after them.
So find the middle element in the array, compare it to the element right after it. If the change in numbers happens on the wrong index, we know the odd number in the array is before it; otherwise, it's after. With one set of comparisons, we figure out which half of the array the target is in.
Keep going from there.
Use a hash table
For each element E in the input set
if E is set in the hash table
increment it's value
else
set E in the hash table and initialize it to 0
For each key K in hash table
if K % 2 = 1
return K
As this algorithm is 2n it belongs to O(n)
Try this:
int getOddOccurrence(int ar[], int ar_size)
{
int i;
int xor = 0;
for (i=0; i < ar_size; i++)
xor = xor ^ ar[i];
return res;
}
XOR will cancel out everytime you XOR with the same number so 1^1=0 but 1^1^1=1 so every pair should cancel out leaving the odd number out.
Assume indexing start at 0. Binary search for the smallest even i such that x[i] != x[i+1]; your answer is x[i].
edit: due to public demand, here is the code
int f(int *x, int min, int max) {
int size = max;
min /= 2;
max /= 2;
while (min < max) {
int i = (min + max)/2;
if (i==0 || x[2*i-1] == x[2*i])
min = i+1;
else
max = i-1;
}
if (2*max == size || x[2*max] != x[2*max+1])
return x[2*max];
return x[2*min];
}

Resources