Overhead effect in Iterative Deepening by increasing branching factor and depth - artificial-intelligence

I am studying Iterative Deepening from this link. My main concern is with Overhead. That link says that
The higher the branching factor, the lower the overhead of repeatedly
expanded states
There is no explanation given for this statement and also no convincing arguments are given at that link. I am searching the reason behind this statement because I think that overhead should increase as branching factor is increasing and it also means that no of nodes are increasing then how overhead is reducing?
Till now I did not find anything reasonable and helpful. If someone can help in correcting my concepts then I would be thankful to you.

The answer to your question is in the formula above the statement
(d)b + (d-1)b^{2} + \cdots + 3b^{d-2} + 2b^{d-1} + b^{d}
The cost of doing a BFS - which is to what you should compare is --
b + b^{2} + \cdots + b^{d-2} + b^{d-1} + b^{d}
Hence the overhead is
(d-1)b + (d-2)b^{2} + \cdots + 2b^{d-2} + 1b^{d-1}
Obviously this is heavily influence by the branching factor. especially when looking at the last term 1b^{d-1}

Related

What is the time complexity of this algorithmic problem?

*
A search method has time complexity O(n2), where n is the number of states in the space to be
searched. If it takes 1 second to search a space of a thousand states, roughly how long will it take to
search a space of a million states?*
I have found that its approximately 12 days but the way I found is quite wrong i think.
I did 1million^2 / 86400(seconds in a day ) and found 11.56 so approximately 12 days. Is there a better and more efficient solution?
There is not nearly enough information to answer this question. See Big-O description.
O(N^2) means only that the algorithm's execution time will be dominated by an N^2 term. As N grows large, the ratio between two execution times will asymptotically approach the square of their ratios. It says nothing about the execution time for particular values.
Let's keep this simple, assuming a set-up overhead with an array initialization O(N) and some system start-up, a constant. This makes the execution time
t = a * N^2 + b * N + c
for some values of a, b, and c. Even if we know that this is the equation form, we do not have enough information to solve given only one (t, N) data point. We don't know enough to derive t for N= 10^6.
I suspect that whomever posed this problem is looking for the invalid solution, making the unwarranted assumption that N=1000 has already blown all smaller terms to insignificance. In this case, simply scale up by the square of the size ratio:
N1 / N2 = 10^6 / 10^3 = 10^3
Scale up by N^2, or (10^3)^2 = 10^6
That gives you 10^6 seconds, or somewhat over a day; I'll leave the math to you.

Fast algorithm mapping int to monotonically increasing int subset

I have encountered variations of this problem multiple times, and most recently it became a bottleneck in my arithmetic coder implementation. Given N (<= 256) segments of known non-negative size Si laid out in order starting from the origin, and for a given x, I want to find n such that
S0 + S1 + ... + Sn-1 <= x < S0 + S1 + ... + Sn
The catch is that lookups and updates are done at about the same frequency, and almost every update is in the form of increasing the size of a segment by 1. Also, the bigger a segment, the higher the probability it will be looked up or updated again.
Obviously some sort of tree seems like the obvious approach, but I have been unable to come up with any tree implementation that satisfactorily takes advantage of the known domain specific details.
Given the relatively small size of N, I also tried linear approaches, but they turned out to be considerably slower than a naive binary tree (even after some optimization, like starting from the back of the list for numbers above half the total)
Similarly, I tested introducing an intermediate step that remaps values in such a way as to keep segments ordered by size, to make access faster for the most frequently used, but the added overhead exceeded gains.
Sorry for the unclear title -- despite it being a fairly basic problem, I am not aware of any specific names for it.
I suppose some BST would do... You may try to add a new numeric member (int or long) to each node to keep a sum of values of all left descendants. Then you'll seek for each item in approximately logarithmic time, and once an item is added, removed or modified you'll have to update just its ancestors on the returning path from the recursion. You may apply some self-organizing tree structure, for example AVL to keep the worst-case search optimal or a splay tree to optimize searches for those most often used items. Take care to update the left-subtree-sums during rebalancing or splaying.
You could use a binary tree where each node n contains two integers A_n
and U_n, where initially
A_n = S_0 + .. S_n and U_n = 0.
Let, at any fixed subsequent time, T_n = S_0 + .. + S_n.
When looking for the place of a query x, you would go along the tree, knowing that for each node m the current corresponding value of T_m is A_m + U_m + sum_{p : ancestors of m, we visited the right child of p to attain m} U_p.
This solves look up in O(log(N)).
For update of the n-th interval (increasing its size by y), you just look for it in the tree, increasing the value of U_m og y for each node m that you visit along the way. This also solves update in O(log(N)).

Neural Network Architecture Design

I'm playing around with Neural Networks trying to understand the best practices for designing their architecture based on the kind of problem you need to solve.
I generated a very simple data set composed of a single convex region as you can see below:
Everything works fine when I use an architecture with L = 1, or L = 2 hidden layers (plus the output layer), but as soon as I add a third hidden layer (L = 3) my performance drops down to slightly better than chance.
I know that the more complexity you add to a network (number of weights and parameters to learn) the more you tend to go towards over-fitting your data, but I believe this is not the nature of my problem for two reasons:
my performance on the Training set is also around 60% (whereas over-fitting typically means you have a very low training error and high test error),
and I have a very large amount of data examples (don't look at the figure that's only a toy figure I uplaoded).
Can anybody help me understand why adding an extra hidden layer gives
me this drop in performances on such a simple task?
Here is an image of my performance as a function of the number of layers used:
ADDED PART DUE TO COMMENTS:
I am using a sigmoid functions assuming values between 0 and 1, L(s) = 1 / 1 + exp(-s)
I am using early stopping (after 40000 iterations of backprop) as a criteria to stop the learning. I know it is not the best way to stop but I thought that it would ok for such a simple classification task, if you believe this is the main reason I'm not converging I I might implement some better criteria.
At least on the surface of it, this appears to be a case of the so-called "vanishing gradient" problem.
Activation functions
Your neurons activate according to the logistic sigmoid function, f(x) = 1 / (1 + e^-x) :
This activation function is used frequently because it has several nice properties. One of these nice properties is that the derivative of f(x) is expressible computationally using the value of the function itself, as f'(x) = f(x)(1 - f(x)). This function has a nonzero value for x near zero, but quickly goes to zero as |x| gets large :
Gradient descent
In a feedforward neural network with logistic activations, the error is typically propagated backwards through the network using the first derivative as a learning signal. The usual update for a weight in your network is proportional to the error attributable to that weight times the current weight value times the derivative of the logistic function.
delta_w(w) ~= w * f'(err(w)) * err(w)
As the product of three potentially very small values, the first derivative in such networks can become small very rapidly if the weights in the network fall outside the "middle" regime of the logistic function's derivative. In addition, this rapidly vanishing derivative becomes exacerbated by adding more layers, because the error in a layer gets "split up" and partitioned out to each unit in the layer. This, in turn, further reduces the gradient in layers below that.
In networks with more than, say, two hidden layers, this can become a serious problem for training the network, since the first-order gradient information will lead you to believe that the weights cannot usefully change.
However, there are some solutions that can help ! The ones I can think of involve changing your learning method to use something more sophisticated than first-order gradient descent, generally incorporating some second-order derivative information.
Momentum
The simplest solution to approximate using some second-order information is to include a momentum term in your network parameter updates. Instead of updating parameters using :
w_new = w_old - learning_rate * delta_w(w_old)
incorporate a momentum term :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old)
w_new = w_old + w_dir_new
Intuitively, you want to use information from past derivatives to help determine whether you want to follow the new derivative entirely (which you can do by setting mu = 0), or to keep going in the direction you were heading on the previous update, tempered by the new gradient information (by setting mu > 0).
You can actually get even better than this by using "Nesterov's Accelerated Gradient" :
w_dir_new = mu * w_dir_old - learning_rate * delta_w(w_old + mu * w_dir_old)
w_new = w_old + w_dir_new
I think the idea here is that instead of computing the derivative at the "old" parameter value w, compute it at what would be the "new" setting for w if you went ahead and moved there according to a standard momentum term. Read more in a neural-networks context here (PDF).
Hessian-Free
The textbook way to incorporate second-order gradient information into your neural network training algorithm is to use Newton's Method to compute the first and second order derivatives of your objective function with respect to the parameters. However, the second order derivative, called the Hessian matrix, is often extremely large and prohibitively expensive to compute.
Instead of computing the entire Hessian, some clever research in the past few years has indicated a way to compute just the values of the Hessian in a particular search direction. You can then use this process to identify a better parameter update than just the first-order gradient.
You can learn more about this by reading through a research paper (PDF) or looking at a sample implementation.
Others
There are many other optimization methods that could be useful for this task -- conjugate gradient (PDF -- definitely worth a read), Levenberg-Marquardt (PDF), L-BFGS -- but from what I've seen in the research literature, momentum and Hessian-free methods seem to be the most common ones.
Because the number of iterations of training required for convergence increases as you add complexity to a neural network, holding the length of training constant while adding layers to a neural network will certainly result in you eventually observing a drop like this. To figure out whether that is the explanation for this particular observation, try increasing the number of iterations of training that you're using and see if it improves. Using a more intelligent stopping criterion is also a good option, but a simple increase in the cut-off will give you answers faster.

What is the difference between Q-learning and SARSA?

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms.
According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(st, at), can be updated as follows
Q(st, at) = Q(st, at) + α*(rt + γ*Q(st+1, at+1) - Q(st, at))
On the other hand, the update step for the Q-learning algorithm is the following
Q(st, at) = Q(st, at) + α*(rt + γ*maxa Q(st+1, a) - Q(st, at))
which can also be written as
Q(st, at) = (1 - α) * Q(st, at) + α * (rt + γ*maxa Q(st+1, a))
where γ (gamma) is the discount factor and rt is the reward received from the environment at timestep t.
Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?
TLDR (and my own answer)
Thanks to all those answering this question since I first asked it. I've made a github repo playing with Q-Learning and empirically understood what the difference is. It all amounts to how you select your next best action, which from an algorithmic standpoint can be a mean, max or best action depending on how you chose to implement it.
The other main difference is when this selection is happening (e.g., online vs offline) and how/why that affects learning. If you are reading this in 2019 and are more of a hands-on person, playing with a RL toy problem is probably the best way to understand the differences.
One last important note is that both Suton & Barto as well as Wikipedia often have mixed, confusing or wrong formulaic representations with regards to the next state best/max action and reward:
r(t+1)
is in fact
r(t)
When I was learning this part, I found it very confusing too, so I put together the two pseudo-codes from R.Sutton and A.G.Barto hoping to make the difference clearer.
Blue boxes highlight the part where the two algorithms actually differ. Numbers highlight the more detailed difference to be explained later.
TL;NR:
| | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' | π | π |
| Updating Q | π | μ |
where π is a ε-greedy policy (e.g. ε > 0 with exploration), and μ is a greedy policy (e.g. ε == 0, NO exploration).
Given that Q-learning is using different policies for choosing next action A' and updating Q. In other words, it is trying to evaluate π while following another policy μ, so it's an off-policy algorithm.
In contrast, SARSA uses π all the time, hence it is an on-policy algorithm.
More detailed explanation:
The most important difference between the two is how Q is updated after each action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn from it. In contrast, Q-learning uses the maximum Q' over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, i.e. NO exploration in this part.
However, when actually taking an action, Q-learning still uses the action taken from a ε-greedy policy. This is why "Choose A ..." is inside the repeat loop.
Following the loop logic in Q-learning, A' is still from the ε-greedy policy.
Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.
In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:
Q-Learning: Q(st+1,at+1) = maxaQ(st+1,a)
SARSA: Q(st+1,at+1) = ε·meanaQ(st+1,a) + (1-ε)·maxaQ(st+1,a)
What is the difference mathematically?
As is already described in most other answers, the difference between the two updates mathematically is indeed that, when updating the Q-value for a state-action pair (St, At):
Sarsa uses the behaviour policy (meaning, the policy used by the agent to generate experience in the environment, which is typically epsilon-greedy) to select an additional action At+1, and then uses Q(St+1, At+1) (discounted by gamma) as expected future returns in the computation of the update target.
Q-learning does not use the behaviour policy to select an additional action At+1. Instead, it estimates the expected future returns in the update rule as maxA Q(St+1, A). The max operator used here can be viewed as "following" the completely greedy policy. The agent is not actually following the greedy policy though; it only says, in the update rule, "suppose that I would start following the greedy policy from now on, what would my expected future returns be then?".
What does this mean intuitively?
As mentioned in other answers, the difference described above means, using technical terminology, that Sarsa is an on-policy learning algorithm, and Q-learning is an off-policy learning algorithm.
In the limit (given an infinite amount of time to generate experience and learn), and under some additional assumptions, this means that Sarsa and Q-learning converge to different solutions / "optimal" policies:
Sarsa will converge to a solution that is optimal under the assumption that we keep following the same policy that was used to generate the experience. This will often be a policy with some element of (rather "stupid") randomness, like epsilon-greedy, because otherwise we are unable to guarantee that we'll converge to anything at all.
Q-Learning will converge to a solution that is optimal under the assumption that, after generating experience and training, we switch over to the greedy policy.
When to use which algorithm?
An algorithm like Sarsa is typically preferable in situations where we care about the agent's performance during the process of learning / generating experience. Consider, for example, that the agent is an expensive robot that will break if it falls down a cliff. We'd rather not have it fall down too often during the learning process, because it is expensive. Therefore, we care about its performance during the learning process. However, we also know that we need it to act randomly sometimes (e.g. epsilon-greedy). This means that it is highly dangerous for the robot to be walking alongside the cliff, because it may decide to act randomly (with probability epsilon) and fall down. So, we'd prefer it to quickly learn that it's dangerous to be close to the cliff; even if a greedy policy would be able to walk right alongside it without falling, we know that we're following an epsilon-greedy policy with randomness, and we care about optimizing our performance given that we know that we'll be stupid sometimes. This is a situation where Sarsa would be preferable.
An algorithm like Q-learning would be preferable in situations where we do not care about the agent's performance during the training process, but we just want it to learn an optimal greedy policy that we'll switch to eventually. Consider, for example, that we play a few practice games (where we don't mind losing due to randomness sometimes), and afterwards play an important tournament (where we'll stop learning and switch over from epsilon-greedy to the greedy policy). This is where Q-learning would be better.
There's an index mistake in your formula for Q-Learning.
Page 148 of Sutton and Barto's.
Q(st,at) <-- Q(st,at) + alpha * [r(t+1) + gamma * max Q(st+1,a) -
Q(st,at) ]
The typo is in the argument of the max:
the indexes are st+1 and a,
while in your question they are st+1 and at+1 (these are correct for SARSA).
Hope this helps a bit.
In Q-Learning
This is your:
Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,At) - Q(St,At) ]
should be changed to
Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,a) - Q(St,At) ]
As you said, you have to find the maximum Q-value for the update eq. by changing the a, Then you will have a new Q(St,At). CAREFULLY, the a that give you the maximum Q-value is not the next action. At this stage, you only know the next state (St+1), and before going to next round, you want to update the St by the St+1 (St <-- St+1).
For each loop;
choose At from the St using the Q-value
take At and observe Rt+1 and St+1
Update Q-value using the eq.
St <-- St+1
Until St is terminal
The only difference between SARSA and Qlearning is that SARSA takes the next action based on the current policy while qlearning takes the action with maximum utility of next state
I didn't read any book just I see the implication of them
q learning just focus on the (action grid)
SARSA learning just focus on the (state to state) and observe the action list of s and s' and then update the (state to state grid)
Both SARSA and Q-learnig agents follow e-greedy policy to interact with environment.
SARSA agent updates its Q-function using the next timestep Q-value with whatever action the policy provides(mostly still greedy, but random action also accepted). The policy being executed and the policy being updated towards are the same.
Q-learning agent updates its Q-function with only the action brings the maximum next state Q-value(total greedy with respect to the policy). The policy being executed and the policy being updated towards are different.
Hence, SARSA is on-policy, Q-learning is off-policy.

Exhaustive searches vs sorting followed by binary search

This is a direct quote from the textbook, Invitation to Computer Science by G. Michael Scneider and Judith L. Gersting.
At the end of Section 3.4.2, we talked about the tradeoff between using sequential search on an unsorted list as opposed to sorting the list and then using binary search. If the list size is n=100,000 about how many worst-case searches must be done before the second alternative is better in terms of number of comparisons?
I don't really get what the question is asking for.
Sequential search is of order (n) and binary is of order (lgn) which in any case lgn will always be less than n. And in this case n is already given so what am I supposed to find.
This is one of my homework assignment but I don't really know what to do. Could anyone explain the question in plain English for me?
and binary is of order (lgn) which in any case lgn will always be less than n
This is where you're wrong. In assignment, you're asked to consider the cost of sorting array too.
Obviously, if you need only one search, first approach is better than sorting array and doing binary search: n < n*logn + logn. And you're asked, how many searches you need for second approach to become more effective.
End of hint.
The question is how to decide which approach to choose - to just use linear search or to sort and then use binary search.
If you only search a couple of times linear search is better - it is O(n), while sorting is already O(n*logn). If you search very often on the same collection sorting is better - searching multiple times can become O(n*n) but sorting and then searching with binary search is again O(n*logn) + NumberOfSearches*O(logn) which can be less or more than using linear search depending on how NumberOfSearches and n relate.
The task is to determine the exact value of NumberOfSearches (not the exact number, but a function of n) which will make one of the options preferable:
NumberOfSearches * O(n) <> O(n*logn) + NumberOfSearches * O(logn)
don't forget that each O() can have a different constant value.
The order of the methods is not important here. It tells you something how well algorithms scale when the problem becomes bigger and bigger. You can't do any exact calculations if you only know O(n) == it complexity grows linear in the size of the problem. It won't give you any numbers.
This can well mean that an algorithm with O(n) complexity is faster than a O(logn) algorithm, for some n. Because O(log(n)) scales better when it gets larger, we know for sure, there is a n (a problem size) where the algorithm with O(logn) complexity is faster. We just don't know when (for what n).
In plain english:
If you want to know 'how many searches', you need exact equations to solve, you need exact numbers. How many comparisons does it take to search sequential? (Remember n is given, so you can give a number.) How many comparisons (in the worst case!) does it take to search with a binary search? Before you can do a binary search, you have to sort. Let's add the number of comparisons needed to sort to the cost of binary search. Now compare the two numbers, which one is less?
The binary search is fast, but the sorting is slow. The sequential search is slower than binary search, but faster than sorting. However the sorting needs to be done only once, no matter how many times you search. So, when does one heavy sort outweigh having to do a slow (sequential) search every time?
Good luck!
For sequential search, the worst case is n = 100000, so for p searches p × 100000 comparisons are required.
Using a Θ(n2) sorting algorithm would require 100000 × 100000 comparisons.
Binary search would require 1 + log n = 1 + log 100000 = 17 comparisons for each search,
together there would be 100000×100000 + 17p comparisons.
The first expression is larger than the second, meaning
100000p > 100000^2 + 17p
For p > 100017.
The question is about appreciating the number NUM_SEARCHES needed to compensate the cost of sorting. So we'll have:
time( NUM_SEARCHES * O(n) ) > time( NUM_SEARCHES * O(log(n)) + O(n* log(n)) )
Thank you guys. I think I get the point now. Could you take a look at my answer and see whether I'm on the right track.
For worst case searches
Number of comparison for sequential search is n = 100,000.
Number of comparison for binary search is lg(n) = 17.
Number of comparison for sorting is (n-1)/2 * n = (99999)(50000).
(I'm following my textbook and used the selection sort algorithm covered in my class)
So let p be the number of worst case searches, then 100,000p > (99999)(50000) + 17p
OR p > 50008
In conclusion, I need 50,008 worst case searches to make sorting and using binary search better than a sequential search for a list of n=100,000.

Resources