QLearning and never-ending episodes - artificial-intelligence

Let's imagine we have an (x,y) plane where a robot can move. Now we define the middle of our world as the goal state, which means that we are going to give a reward of 100 to our robot once it reaches that state.
Now, let's say that there are 4 states(which I will call A,B,C,D) that can lead to the goal state.
The first time we are in A and go to the goal state, we will update our QValues table as following:
Q(state = A, action = going to goal state) = 100 + 0
One of 2 things can happen. I can end the episode here, and start a different one where the robot has to find again the goal state, or I can continue exploring the world even after I found the goal state. If I try to do this, I see a problem though. If I am in the goal state and go back to state A, it's Qvalue will be the following:
Q(state = goalState, action = going to A) = 0 + gamma * 100
Now, if I try to go again to the goal state from A:
Q(state = A, action = going to goal state) = 100 + gamma * (gamma * 100)
Which means that if I keep doing this, as 0 <= gamma <= 0, both qValues are going to rise forever.
Is this the expected behavior of QLearning? Am I doing something wrong? If this is the expected behavior, can't this lead to problems? I know that probabilistically, all the 4 states(A,B,C and D), will grow at the same rate, but even so it kinda bugs me having them growing forever.
The ideia of allowing the agent to continue exploring even after finding the goal has to do with that the nearer he is from the goal state, the more likely it is to being in states that can be updated at the moment.

This is as expected, since the Q estimate isn't the expected reward, it's the expected return, which is the (possibly discounted via gamma) amount of reward I'd expect to reap from that state/action if I started there and followed my policy until the end of the episode or forever.
If you give me some buttons, and one of those buttons always produces $1 when pressed, then the true expected reward for pressing that button is $1. But the true expected return for pressing the button is infinity dollars, assuming I get infinite number of chances to push a button.

Related

Need help finding a logical solution solving a problem

Given the variable 'points' which increases every time a variable 'player' collects a point, how do I logically find a way to reward user for finding 30 points inside a 5 minutes limit? There's no countdown timer.
e.g player may have 4 points but in 5 minutes if he has 34 points that also counts.
I was thinking about using timestamps but I don't really know how to do that.
What you are talking about is a "sliding window". Your window is time based. Record each point's timestamp and slide your window over these timestamps. You will need to pick a time increment to slide your window.
Upon each "slide", count your points. When you get the amount you need, "reward your user". The "upon each slide" means you need some sort of timer that calls a function each time to evaluate the result and do what you want.
For example, set a window for 5 minutes and a slide of 1 second. Don't keep a single variable called points. Instead, simply create an array of timestamps. Every timer tick (of 1 second in this case), count the number of timestamps that match t - 5 minutes to t now; if there are 30 or more, you've met your threshold and can reward your super-fast user. If you need the actual value, that may be 34, well, you've just computed it, so you can use it.
There may be ways to optimize this. I've provided the naive approach. Timestamps that have gone out of range can be deleted to save space.
If there are "points going into the window" that count, then just add them to the sum.

Generated unique id with 6 characters - handling when too much ids already used

​In my program you can book an item. This item has an id with 6 characters from 32 possible characters.
So my possibilities are 32^6. Every id must be unique.
func tryToAddItem {
if !db.contains(generateId()) {
addItem()
} else {
tryToAddItem()
}
}
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
So that is quite high. This are 5 database queries on a lot of datas.
When the probability is so high I want to implement a prefix „A-xxxxxx“.
What is a good condition for that? At which time do I will need a prefix?
In my example 90% ids were use. What is about the rest? Do I threw it away?
What is about database performance when I call tryToAddItem 5 times? I could imagine that this is not best practise.
For example 90% of my ids are used. So the probability that I call tryToAddItem 5 times is 0,9^5 * 100 = 59% isn't it?
Not quite. Let's represent the number of call you make with the random variable X, and let's call the probability of an id collision p. You want the probability that you make the call at most five times, or in general at most k times:
P(X≤k) = P(X=1) + P(X=2) + ... + P(X=k)
= (1-p) + (1-p)*p + (1-p)*p^2 +... + (1-p)*p^(k-1)
= (1-p)*(1 + p + p^2 + .. + p^(k-1))
If we expand this out all but two terms cancel and we get:
= 1- p^k
Which we want to be greater than some probability, x:
1 - p^k > x
Or with p in terms of k and x:
p < (1-x)^(1/k)
where you can adjust x and k for your specific needs.
If you want less than a 50% probability of 5 or more calls, then no more than (1-0.5)^(1/5) ≈ 87% of your ids should be taken.
First of all make sure there is an index on the id columns you are looking up. Then I would recommend thinking more in terms of setting a very low probability of a very bad event occurring. For example maybe making 20 calls slows down the database for too long, so we'd like to set the probability of this occurring to <0.1%. Using the formula above we find that no more than 70% of ids should be taken.
But you should also consider alternative solutions. Is remapping all ids to a larger space one time only a possibility?
Or if adding ids with prefixes is not a big deal then you could generate longer ids with prefixes for all new items going forward and not have to worry about collisions.
Thanks for response. I searched for alternatives and want show three possibilities.
First possibility: Create an UpcomingItemIdTable with 200 (more or less) valid itemIds. A task in the background can calculate them every minute (or what you need). So the action tryToAddItem will always get a valid itemId.
Second possibility
Is remapping all ids to a larger space one time only a possibility?
In my case yes. I think for other problems the answer will be: it depends.
Third possibility: Try to generate an itemId and when there is a collision try it again.
Possible collisions handling: Do some test before. Measure the time to generate itemIds when there are already 1000,10.000,100.000,1.000.000 etc. entries in the table. When the tryToAddItem method needs more than 100ms (or what you prefer) then increase your length from 6 to 7,8,9 characters.
Some thoughts
every request must be atomar
create an index on itemId
Disadvantages for long UUIDs in API: See https://zalando.github.io/restful-api-guidelines/#144
less usable, because...
-cannot be memorized and easily communicated by humans
-harder to use in debugging and logging analysis
-less convenient for consumer facing usage
-quite long: readable representation requires 36 characters and comes with higher memory and bandwidth consumption
-not ordered along their creation history and no indication of used id volume
-may be in conflict with additional backward compatibility support of legacy ids
[...]
TLDR: For my case every possibility is working. As so often it depends on the problem. Thanks for input.

MDP: How to calculate the chances of each possible result for a sequence of actions?

I've got a MDP problem with the following environment (3x4 map):
with the possible actions Up/Down/Right/Left and a 0.8 chance of moving in the right direction, 0.1 for each adjoining direction (e.g. for Up: 0.1 chance to go Left, 0.1 chance to go Right).
Now what I need to do is calculate the possible results starting in (1,1) running the following sequence of actions:
[Up, Up, Right, Right, Right]
And also calculate the chance of reaching a field (for each possible outcome) with this actions sequence. How can I do this efficiently (so not going through the at least 2^5, max 3^5 possible results)?
Thanks in advance!
Well. I wonder if you are solving the RL problem.
We now usually solve the RL problem with Bellman equation and Q-learning.
You will also benefit from this lecture.
http://cs229.stanford.edu/notes/cs229-notes12.pdf
And if you have finished learning, repeat the whole process and you will know [up, up, right, right, right]'s probability.
and after learning, the second constraint will be meaningless because it reaches the correct answer almost immediately.
I think this example is in AIMA, right?
Actually I have a few questions about the approach.
I think it doesn't seem to right my answer if you approach it very theoretically.
while not done:
if np.random.rand(1) < e:
action = env.action_space.sample()
else:
action = rargmax(Q[state, :])
new_state, reward, done, _ = env.step(action)
Q[state, action] = Q[state, action]+ lr * (reward + r*np.max(Q[new_state,:]) - Q[state, action])
and this is the code I simply code with the gym.

How to calculate the value function in reinforcement learning

Could anybody help to explain how to following value function been generated, the problem and solution are attached, I just don't know how the solution is generated. thank you!
STILL NEED HELP WITH THIS!!!
Since no one else has taken a stab at it, I'll present my understanding of the problem (disclaimer: I'm not an expert on reinforced learning and I'm posting this as an answer because it's too long to be a comment)
Think of it this way: when starting at, for example, node d, a random walker has a 50% chance to jump to either node e or node a. Each such jump reduces the reward (r) with the multiplier y (gamma in the picture). You continue jumping around until you get to the target node (f in this case), after which you collect the reward r.
If I've understood correctly, the two smaller 3x2 squares represent the expected values of reward when starting from each node. Now, it's obvious why in the first 3x2 square every node has a value of 100: because y = 1, the reward never decreases. You can just keep jumping around until you eventually end up in the reward node, and gather the reward of r=100.
However, in the second 3x2 square, with every jump the reward is decreased with a multiplier of 0.9. So, to get the expected value of reward when starting from square c, you sum together the reward you get from different paths, multiplied by their probabilities. Going from c to f has a chance of 50% and it's 1 jump, so you get r = 0.5*0.9^0*100 = 50. Then there's the path c-b-e-f: 0.5*(1/3)*(1/3)*0.9^2*100 = 4.5. Then there's c-b-c-f: 0.9^2*0.5^2*(1/3)^1*100 = 6.75. You keep going this way until the reward from the path you're examining is insignificantly small, and sum together the rewards from all the paths. This should give you the result of the corresponding node, that is, 50+6.75+4.5+... = 76.
I guess the programmatic way of doing to would be to use a modified dfs/bfs to explore all the paths of length N or less, and sum together the rewards from those paths (N chosen so that 0.9^N is small).
Again, take this with a grain of salt; I'm not an expert on reinforced learning.

How do I use a PID controller?

I'm currently working on a temperature controller.
I have a Temperature_PID() function that returns the manipulated variable (which is the sum of the P, I, and D terms) but what do I do with this output?
The temperature is controlled by PWM, so 0% duty cycle = heater off and 100% duty cycle = heater on.
So far I tried
Duty_Cycle += Temperature_PID();
if(Duty_Cycle > 100) Duty_Cycle = 100;
else if(Duty_Cycle < 0) Duty_Cycle = 0;
This didn't work for me because the I term is basically makes this system very unstable. Imagine integrating an area, adding another small data point, and integrating the area again, and summing them. Over and over. That means each data point makes this control scheme exponentially worse.
The other thing I would like to try is
Duty_Cycle = Expected_Duty_Cycle + Temperature_PID();
where Expected_Duty_Cycle is what the temperature should be set to once the controller reaches a stable point and Temperature_PID() is 0. However, this also doesn't work because the Expected_Duty_Cycle would always be changing depending on the conditions of the heater, e.g. different weather.
So my question is what exactly do I do with the output of PID? I don't understand how to assign a duty cycle based on the PID output. Ideally this will stay at 100% duty cycle until the temperature almost reaches the set point and start dropping off to a lower duty cycle. But using my first method (with my I gain set to zero) it only starts lowering the duty cycle after it already overshoots.
This is my first post. Hope I find my answer. Thank you stackoverflow.
EDIT:
Here's my PID function.
double TempCtrl_PID(PID_Data *pid)
{
Thermo_Data tc;
double error, pTerm, iTerm, dTerm;
Thermo_Read(CHIP_TC1, &tc);
pid->last_pv = pid->pv;
pid->pv = Thermo_Temperature(&tc);
error = pid->sp - pid->pv;
if(error/pid->sp < 0.1)
pid->err_sum += error;
pTerm = pid->kp * error;
iTerm = pid->ki * pid->err_sum;
dTerm = pid->kd * (pid->last_pv - pid->pv);
return pTerm + iTerm + dTerm;
}
EDIT 2:
Never used this before so let me know if the link is broken.
https://picasaweb.google.com/113881440334423462633/January302013
Sorry, Excel is crashing on me when I try to rename axes or the title. Note: there isn't a fan in the system yet so I can't cool the heater as fast as I can get it to heat up, so it spends very little time below the set point compared to above.
The first picture is a simple on-off controller.
The second picture is my PD controller. As you can see, it takes a lot longer for the temperature to decrease because it doesn't subtract before the temperature overshoots, it waits until the temperature overshoots before subtracting from the duty cycle, and does so too slowly. How exactly do I tell my controller to lower the duty cycle before it hits the max temperature?
The output of the PID is the duty cycle. You must adjust kp, ki, and kd to put the PID output in the range of the Duty_Cycle, e.g., 0 to 100. It is usually a good idea to explicitly limit the output in the PID function itself.
You should "tune" your PID in simple steps.
Turn off the integral and derivative terms (set ki and kd to zero)
Slowly increase your kp until a 10% step change in the setpoint makes the output oscillate
Reduce kp by 30% or so, which should eliminate the oscillations
Set ki to a fraction of kp and adjust to get your desired tradeoff of overshoot versus time to reach setpoint
Hopefully, you will not need kd, but if you do, make it smaller still
Your PID controller output should be setting the value of the duty cycle directly.
Basically you are going to be controlling the heater settings based on the difference in the actual temperature versus the temperature setpoint.
You will need to adjust the values of the PID parameters to obtain the performance you are looking for.
First, set I and D to zero and put in a value for P, say 2 to start.
Change the setpoint and see what your response is. Increase P and make another setpoint change and see what happens. Eventually you will see the temperature oscillate consistently and never come to any stable value. This value is known as the "ulitmate gain". Pay attention to the frequency of the oscillation as well. Set P equal to half of the ultimate gain.
Start with a value of 1.2(ultimate gain)/(Oscillation Frequency) for I and change the setpoint. Adjust the values of P and I from those values to get to where you want to go, tracking the process and seeing if increasing or decreasing values improves things.
Once you have P and I you can work on D but depending on the process dynamics giving a value for D might make your life worse.
The Ziegler-Nichols method gives you some guidelines for PID values which should get you in the ballpark. From there you can make adjustments to get better performance.
You will have to weigh the options of having overshoot with the amount of time the temperature takes to reach the new setpoint. The faster the temperature adjusts the more overshoot you will have. To have no overshoot will increase that amount of time considerably.
A few suggestions:
You seem to be integrating twice. Once inside your TempCtrl_PID function and once outside. Duty_Cycle += . So now your P term is really I.
Start with only a P term and keep increasing it until the system becomes unstable. Then back off (e.g. use 1/2 to 1/4 the value where it becomes unstable) and start adding an I term. Start with very low values on the I term and then gradually increase. This process is a way of tuning the loop. Because the system will probably have a pretty long time constant this may be time consuming...
You can add some feed-forward as you suggest (expected duty cycle for a given setpoint - map it out by setting the duty cycle and letting the system stabilize.). It doesn't matter if that term isn't perfect since the loop will take out the remaining error. You can also simply add some constant bias to the duty cycle. Keep in mind a constant wouldn't really make any difference as the integrator will take it out. It will only affect a cold start.
Make sure you have some sort of fixed time base for this loop. E.g. make an adjustment every 10ms.
I would not worry about the D term for now. A PI controller should be good enough for most applications.

Resources