How is the optimal policy for recurrent utilities calculated? - artificial-intelligence

Exam Solutions
I am learning the Markov Decision Process and for Question 6 of the exam (see the link attached above), I understand how utility is calculated when the same state is obtained after an action (part a of Question 6).
J*(cool) = 4 + 0.9 * J*(cool)
But I don't get how the calculations for the other states and actions can be made (part b of Question 6). I am assuming the equations would be something like this:
For action "fast" in state "cool":
J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * J*(warm))
For action "slow" in state "warm":
J*(warm) = 4 + 0.9 * (0.5 * J*(cool) + 0.5 * J*(warm))
For action "fast" in state "warm":
J*(warm) = 10 + 0.9 * (0.875 * J*(warm) + 0.125 * J*(off))
But we do not have a single variable in these equations and we don't have the utilities of these states? How can we get the value of expected utilities associated with each action?

You're on the right track with those equations. You just need to consider each of the four possible policies in turn: (slow, slow), (fast, slow), (slow, fast), (fast, fast).
Consider (slow, fast):
From a) you have already seen J*(cool) = 40.
J*(warm) = 10 + 0.9 * (0.875 * J*(warm) + 0.125 * J*(off))
J*(warm) = 10 + 0.9 * (0.875 * J*(warm) + 0.125 * 0)
J*(warm) = 47.06
For (slow, slow):
Again J*(cool) is independent of your action in the warm state so J*(cool) = 40.
J*(warm) = 4 + 0.9 * (0.5 * J*(cool) + 0.5 * J*(warm))
J*(warm) = 4 + 0.9 * (0.5 * 40 + 0.5 * J*(warm))
J*(warm) = 40
And for (fast, fast):
This time the value of being in the warm state is independent of the cool action and is J*(warm) = 47.06, from above.
J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * J*(warm))
J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * 47.06)
J*(cool) = 53.89
Lastly (fast, slow):
This is the hardest case, but we have two equations and two unknowns so we can solve using simultaneous equations.
J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * J*(warm))
J*(warm) = 4 + 0.9 * (0.5 * J*(cool) + 0.5 * J*(warm))
J*(warm) = (4 + 0.45 * J*(cool))/0.55
J*(cool) = 10 + 0.9 * (0.25 * J*(cool) + 0.75 * (4 + 0.45 * J*(cool))/0.55)
J*(cool) = 66.94
J*(warm) = 62.04
As we can see the highest value that can be obtained if we start in the warm state is 62.04. The highest value starting in cool is 66.94. Both of these occur when our policy is (fast, slow), ie fast in cool, slow in warm, hence this is the optimal policy.
As it turns out it is not possible to have a policy that is optimal is you start in state A but not optimal if you start in state B. It is also worth noting that for these types of infinite time horizon MDPs, you can prove that the optimal policy will always be stationary, that is if it is optimal to take the slow action in the cool state at time 1, it will be optimal to take the slow action for all times.
Finally, in practice the number of states and actions are much larger than in this question and more advanced techniques, such as value iteration, policy iteration or dynamic programming are typically required.

Related

Calculation of countdown timer

How numerators and denominators of days, hours and minutes are calculated in this code, why modulus is calculated in numerator?
var countDownDate = new Date("Sep 5, 2018 15:37:25").getTime();
var x = setInterval(function() {
var now = new Date().getTime();
var distance = countDownDate - now;
var days = Math.floor(distance / (1000 * 60 * 60 * 24));
var hours = Math.floor((distance % (1000 * 60 * 60 * 24)) / (1000 * 60 * 60));
var minutes = Math.floor((distance % (1000 * 60 * 60)) / (1000 * 60));
var seconds = Math.floor((distance % (1000 * 60)) / 1000);
document.getElementById("demo").innerHTML = days + "d " + hours + "h " + minutes + "m " + seconds + "s ";
if (distance < 0) {
clearInterval(x);
document.getElementById("demo").innerHTML = "EXPIRED";
}
}, 1000);
Let me explain it line by line:
var countDownDate = new Date("Sep 5, 2018 15:37:25").getTime();
In the above line, you are getting the milliseconds for the date Sep 5, 2018 15:37:25 from Jan 1, 1970 (which is the reference date being used by getTime()
var now = new Date().getTime();
var distance = countDownDate - now;
The above two lines are simple. now gets the current time in milliseconds and distance is the difference between the two times (also in milliseconds)
var days = Math.floor(distance / (1000 * 60 * 60 * 24));
The total number of seconds in a day is 60 * 60 * 24 and if we want to get the milliseconds, we need to multiply it by 1000 so the number 1000 * 60 * 60 * 24 is the total number of milliseconds in a day. Dividing the difference (distance) by this number and discarding the values after the decimal, we get the number of days.
var hours = Math.floor((distance % (1000 * 60 * 60 * 24)) / (1000 * 60 * 60));
The above line is a little tricker as there are two operations. The first operation (%) is used to basically discard the part of the difference representing days (% returns the remainder of the division so the days portion of the difference is taken out.
In the next step (division), 1000 * 60 * 60 is the total number of milliseconds in an hour. So dividing the remainder of the difference by this number will give us the number of hours (and like before we discard the numbers after decimal)
var minutes = Math.floor((distance % (1000 * 60 * 60)) / (1000 * 60));
This is similar to how hours are calculated. The first operation (%) takes out the hours portion from difference and the division (1000*60) returns the minutes (as 1000 * 60 is the number of milliseconds in a minute)
var seconds = Math.floor((distance % (1000 * 60)) / 1000);
Here the first operation (%) takes out the minutes part and the second operation (division) returns the number of seconds.
Note: You might have noticed that in every operation the original distance is used but the code still works fine. Let me give you an example (I am using difference instead of distance as this name makes more sense).
difference = 93234543
days = Math.floor(89234543 / (1000 * 60 * 60 * 24))
=> days = 1
hours = Math.floor((89234543 % (1000 * 60 * 60 * 24)) / (1000 * 60 * 60));
(result of modulus operation is 6834543, and division is )
=> hours = 1
This is a very important operation to understand:
var minutes = Math.floor((distance % (1000 * 60 * 60)) / (1000 * 60));
distance(difference) / (1000 * 60 * 60) returns 25 (hours). As you can see we have already got 1 day and 1 hour (25 hours) so distance % (1000 * 60 * 60) wipes out all of these 25 hours and then the division calculates the minutes and so on.

Multiple boost of each matching value in one field

I have the one multiple field, with the following values:
"itm_field_skills":[1, 2]
Now I have the following query:
q=itm_field_skills:(1+OR+2)^5
So I've got the result, but the score is 5.
I want to make a search request with boosting of each matching value to get score 10.
Absolute score values isn't something you can rely on. Your query does not mean that your score will be 5 or 10 - just that those terms are five/ten times more important than other parts of your query.
If you look at the output of debugQuery, you'll see that the boost (5) is being applied separately to each term and then the scores for the terms are summed together afterwards.
4.8168015 = sum of:
1.2343608 = weight(..) [SchemaSimilarity], result of:
1.2343608 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
5.0 = boost <----
0.3254224 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
6.0 = docFreq
8.0 = docCount
0.7586207 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
1.125 = avgFieldLength
2.0 = fieldLength
3.5824406 = weight(..) [SchemaSimilarity], result of:
3.5824406 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
5.0 = boost <---
0.9444616 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
3.0 = docFreq
8.0 = docCount
0.7586207 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
1.125 = avgFieldLength
2.0 = fieldLength

Correct way to get weighted average of concrete array-values along continous interval

I've been looking for a while onto websearch, however, possibly or probably I am missing the right terminology.
I have arbitrary sized arrays of scalars ...
array = [n_0, n_1, n_2, ..., n_m]
I also have a function f->x->y, with 0<=x<=1, and y an interpolated value from array. Examples:
array = [1,2,9]
f(0) = 1
f(0.5) = 2
f(1) = 9
f(0.75) = 5.5
My problem is that I want to compute the average value for some interval r = [a..b], where a E [0..1] and b E [0..1], i.e. I want to generalize my interpolation function f->x->y to compute the average along r.
My mind boggles me slightly w.r.t. finding the right weighting. Imagine I want to compute f([0.2,0.8]):
array --> 1 | 2 | 9
[0..1] --> 0.00 0.25 0.50 0.75 1.00
[0.2,0.8] --> ^___________________^
The latter being the range of values I want to compute the average of.
Would it be mathematically correct to compute the average like this?: *
1 * (1-0.8) <- 0.2 'translated' to [0..0.25]
+ 2 * 1
avg = + 9 * 0.2 <- 0.8 'translated' to [0.75..1]
----------
1.4 <-- the sum of weights
This looks correct.
In your example, your interval's length is 0.6. In that interval, your number 2 is taking up (0.75-0.25)/0.6 = 0.5/0.6 = 10/12 of space. Your number 1 takes up (0.25-0.2)/0.6 = 0.05 = 1/12 of space, likewise your number 9.
This sums up to 10/12 + 1/12 + 1/12 = 1.
For better intuition, think about it like this: The problem is to determine how much space each array-element covers along an interval. The rest is just filling the machinery described in http://en.wikipedia.org/wiki/Weighted_average#Mathematical_definition .

round to the nearest multiple of 1/16 in C

Please let me know how to round a decimal number like 0.53124 to a nearest multiple of 1/16 which is 0.5. And similarly, when you round 0.46875 we must get 0.5. Thanks
floor(0.53124 * 16 + 0.5) / 16
floor(0.46875 * 16 + 0.5) / 16
floor(x * 16 + 0.5) / 16
I suppose, that you can multiply by 16, call round(double x) and divide by 16. noob code:
double x;
x=x*16;
x=round(x);
x=x/16;
and the one line code:
double x;
x=round(x*16)/16;
C Code:
answer = (int) ((x + 1.0/32.0) * 16) / 16.0;
Python verification:
>>> int(((.53124 + 1.0/32) * 16)) / 16.0
0.5
>>> int(((.46875 + 1.0/32) * 16)) / 16.0
0.5
>>>

What does "linear interpolation" mean?

I often hear the term "linear interpolation" in context with animations in WPF. What exactly does "linear interpolation" mean? Could you give me an example where to use "linear interpolation"?
Linear means lines (straight ones).
Interpolation is the act of finding a point within two other points. Contrast this with extrapolation, which is finding a point beyond the ends of a line.
So linear interpolation is the use of a straight line to find a point between two others.
For example:
*(5,10)
/
/
/
/
*(0,0)
You can use the two endpoints with linear interpolation to get the points along the line:
(1,2)
(2,4)
(3,6)
(4,8)
and linear extrapolation to get (for example):
(1000,2000)
(-1e27,-2e27)
In animation, let's say you have a bouncing ball that travels from the (x,y) position of (60,22) to (198,12) in 10 seconds.
With an animation rate of 10 frames per second, you can calculate it's position at any time with:
x0 = 60, y0 = 22
x1 = 198, y1 = 12
frames = 100
for t = 0 to frames:
x = (x1 - x0) * (t / frames) + x0
y = (y1 - y0) * (t / frames) + y0
Those two formulae at the bottom are examples of linear interpolation. At 50% (where t == 50):
x = (198 - 60) * (50 / 100) + 60
= 138 * 0.5 + 60
= 69 + 60
= 129
y = (12 - 22) * (50 / 100) + 22
= -10 * 0.5 + 22
= -5 + 22
= 17
and (129,17) is the midpoint between the starting and ending positions.
E.g. when you want a storyboard to move an element from one position to another using a fixed speed, then you'd use linear interpolation between the start and end positions.

Resources