my question might be easy, but I am not sure about time indexes in well known Q-learning equation.
The equation:
Qt+1(St, At) = Qt(St, At) + alpha * (Rt+1 + gamma * max_A(Qt(St+1, A)) - Qt(St, At))
and I don't understand what Rt+1 stands for. Simple example:
We are at state X at time T.
pick new action based on epsilon-greedy
apply action
We are at state Y at time T + 1
(now we want update Q values for state Y) reward is calculated from action X -> Y (?) or is it reward from action Y -> Z after evaluating all next Q-values (max_A(Q(Y, A)))
repeat 1
On the previous turn you were in state s(t) and took action a(t). Now you are in state s(t+1), receive reward r(t+1) and (greedily) choose action a(t+1). You adjust the value of the previous action towards the sum of the discounted value of the new action and the reward.
A few misconceptions in your example:
you are actually updating action, not state values
you are updating values for the action at state X, not Y
the specific action taken at state X may lead to various states, not just Y, so there is no such thing as the X→Y action
Related
What is the practical and theoretical difference is between these 3 states, which ultimately produce the same output result.
Could you tell me some examples of different results obtained starting from these 3 states and doing the same operations below.
The concept is unclear to me.
Thank you
|0> -> RY(pi/2) -> RX(pi) -> cnot q[0] q[1]
|0> -> RX(pi/2) -> cnot q[0] q[1]
|0> -> H -> cnot q[0] q[1]
Not all of these states are the same, assuming that you're talking about the single-qubit states obtained before application of the CNOT gate (otherwise please specify which single-qubit gates are applied to which qubit in the 2-qubit state).
The last state is H|0⟩ = 1/sqrt(2) (|0⟩ + |1⟩).
The first state ends up being the same state, up to a global phase, which means there is no way to observe a difference between these two states.
But the second state is 1/sqrt(2) (|0⟩ - i|1⟩), which behaves differently.
To observe the difference between the second and the last states, apply a Hadamard gate to both and measure them multiple times: you'll always get 0 result for the last state, but you'll get both 0 and 1 for the second state.
To quickly run this experiment, you can use Q#: running the following snippet will give you ~50 0 measurements for the state prepared using Rx and 100 0 measurements for the state prepared using H.
open Microsoft.Quantum.Diagnostics;
open Microsoft.Quantum.Math;
operation RunTests (prep : (Qubit => Unit)) : Unit {
mutable n0 = 0;
use q = Qubit();
for _ in 1 .. 100 {
// Prepare the qubit in the given state.
prep(q);
// Apply Hadamard gate and measure.
H(q);
if M(q) == Zero {
set n0 += 1;
}
Reset(q);
}
Message($"{n0} zeros measured");
}
operation QubitsDemo () : Unit {
RunTests(Rx(PI() / 2.0, _));
RunTests(H);
}
By definition, the gate 1/sqrt(5) (I + 2iZ) should act on a qubit a|0> + b|1> to transform it into 1/sqrt(5) ((1+2i)a|0> + (1-2i)b|1>) but transformations of each RUS step does the following-
The ancillas are in |+> state at first
Starting form: 1/sqrt(2) (a,b,a,b,a,b,a,b)
CCNOT(ancillas, input): 1/sqrt(2) (a,b,a,b,a,b,b,a)
S(input): 1/sqrt(2) (a,ib,a,ib,a,ib,b,ia)
CCNOT(ancillas, input): 1/sqrt(2) (a,ib,a,ib,a,ib,ia,b)
Z(input) : 1/sqrt(2) (a,-ib,a,-ib,a,-ib,ia,-b)
Now measuring the ancillas in PauliX basis is equivalent to PauliZ measurement after applying H() to the state. Now I have 2 confusions, should I apply H x H x I or H x H x H to the combined state. Also neither of these transformations turn out to be equivalent to the V-gate defined in the first paragraph when both measurements are Zero. Where did I go wrong?
Reference: https://github.com/microsoft/Quantum/blob/master/samples/diagnostics/unit-testing/RepeatUntilSuccessCircuits.qs (1st sample code)
The transformation is correct, though it takes some time with pen and paper to verify it.
As a side note, we start with a state |+>|+>(a|0> + b|1>), which is 0.5 (a,b,a,b,a,b,a,b) in vector form (both |+> states contribute a 1/sqrt(2) to the coefficients). It will not affect our calculations of the state after the measurement, since it will have to be renormalized, but it's still worth noting.
After a sequence of CCNOT, S, CCNOT, Z we get 0.5 (a,-ib,a,-ib,a,-ib,ia,-b). Since we're measuring only the first two qubits in PauliX basis, we need to apply Hadamards only to the first two qubits, or H x H x I to the combined state.
I'll take the liberty to skip writing out the whole expression after applying Hadamards and fast-forward to the results of measurements, and here is why. We're only interested in the state of the input qubit if both measurements yielded 0, so it's sufficient to gather only the terms of the overall state which have |00> as the state of the first two qubits.
The state of the third qubit after measuring |00> on the first qubit will be: (3+i)a |0> - (3i+1)b |1>, multiplied by some normalization coefficient c.
c = 1/sqrt(|3+i|^2 + |3i+1|^2) = 1/sqrt(10)).
Now we need to check whether the state we got, |S_actual> = 1/sqrt(10) ((3+i)a |0> - (3i+1)b |1>)
is the same state as we'd expect to get from applying the V gate,
|S_expected> = 1/sqrt(5) ((1+2i)a |0> + (1-2i)b |1>). They do not look the same, but remember that in quantum computing the states are defined up to a global phase. Thus, if we can find a complex number p with an absolute value 1 for which |S_actual> = p * |S_expected>, the states will be effectively the same.
This translates into the following equations for p and amplitudes of |0> and |1>: (3+i)/sqrt(2) = p (1+2i) and -(3i+1)/sqrt(2) = p (1-2i). We solve both equations to get p = (1-i)/sqrt(2) which has indeed the absolute value 1.
Thus, we can conclude that indeed the state we got after all the transformations is indeed equivalent to the state we'd get by applying a V gate.
I would like to make a real-time filter using Flink.
the idea is to have a value by key stored as accumulator and to calculate a ratio versus the total sum for all keys.
I know it's impossible to share state between keyed operator thus I'm not able to calculate the total value
example :
k1,1
k2,3
k1,1
k2,5
k3,0
I need to calculate on the stream the following ratio
1/1 , 3/4, 2/5, 8/10, 0 (is always filtered) etc...
Thanks for help
Create a custom stateful operator with the following state:
int totalSum;
Map<Key,Ratio> map;
Every event increments the total sum, then update the map according to the event key.
Example:
After 1st event k1,1 your state is:
totalSum 1
map
k1, 1/1
And you emit the event: k1, 1/1
======
After 2nd event k2,3 your state is:
totalSum 4
map
k1, 1/1
k2, 3/4
And you emit the event: k2, 3/4
[.. continue]
In the chapter about Value Iteration algorithm to calculate optimal policy for MDPs, there is an algorithm:
function Value-Iteration(mdp,ε) returns a utility function
inputs: mdp, an MDP with states S, actions A(s), transition model P(s'|s,a),
rewards R(s), discount γ
ε, the maximum error allowed in the utility of any state
local variables: U, U', vectors of utilities for states in S, initially zero
δ, the maximum change in the utility of any state in an iteration
repeat
U ← U'; δ ← 0
for each state s in S do
U'[s] ← R(s) + γ max(a in A(s)) ∑ over s' (P(s'|s,a) U[s'])
if |U'[s] - U[s]| > δ then δ ← |U'[s] - U[s]|
until δ < ε(1-γ)/γ
return U
(I apologize for the formatting, but I need 10 rep to post picture and $latex formatting$ doesn't seem to work here.)
and also a chapter earlier there was a statement:
A discount factor of γ is equivalent to an interest rate of (1/γ) − 1.
Could anyone explain to me what does the interest rate (1/γ)-1 mean? How did they get it? Why is it used in the termination condition in the algorithm above?
The reward at t-1 is considered discounted by a factor gamma (y). That is to say, old = y x new. So new = (1/y) * old, and new - old = ((1/y) - 1) * old. That is your interest rate.
I am not so sure why it is used in the termination condition. The value of epsilon is arbitrary, anyway.
In fact, I believe this termination criterion is very bad. It does not work when y = 1. When y = 0, then the iteration should stop immediately, since it is enough to estimate perfect values. When y = 1, many iterations are necessary.
I want to move something a set distance. However in my system there is inertia/drag/negative accelaration. I'm using a simple calculation like this for it:
v = oldV + ((targetV - oldV) * inertia)
Applying that over a number of frames makes the movement 'ramp up' or decay, eg:
v = 10 + ((0 - 10) * 0.25) = 7.5 // velocity changes from 10 to 7.5 this frame
So I know the distance I want to travel and the acceleration, but not the initial velocity that will get me there. Maybe a better explanation is I want to know how hard to hit a billiard ball so that it stops on a certain point.
I've been looking at Equations of motion (http://en.wikipedia.org/wiki/Equations_of_motion) but can't work out what the correct one for my problem is...
Any ideas? Thanks - I am from a design not science background.
Update: Fiirhok has a solution with a fixed acceleration value; HTML+jQuery demo:
http://pastebin.com/ekDwCYvj
Is there any way to do this with a fractional value or an easing function? The benefit of that in my experience is that fixed acceleration and frame based animation sometimes overshoots the final point and needs to be forced, creating a slight snapping glitch.
This is a simple kinematics problem.
At some time t, the velocity (v) of an object under constant acceleration is described by:
v = v0 + at
Where v0 is the initial velocity and a is the acceleration. In your case, the final velocity is zero (the object is stopped) so we can solve for t:
t = -v0/a
To find the total difference traveled, we take the integral of the velocity (the first equation) over time. I haven't done an integral in years, but I'm pretty sure this one works out to:
d = v0t + 1/2 * at^2
We can substitute in the equation for t we developed ealier:
d = v0^2/a + 1/2 * v0^2 / a
And the solve for v0:
v0 = sqrt(-2ad)
Or, in a more programming-language format:
initialVelocity = sqrt( -2 * acceleration * distance );
The acceleration in this case is negative (the object is slowing down), and I'm assuming that it's constant, otherwise this gets more complicated.
If you want to use this inside a loop with a finite number of steps, you'll need to be a little careful. Each iteration of the loop represents a period of time. The object will move an amount equal to the average velocity times the length of time. A sample loop with the length of time of an iteration equal to 1 would look something like this:
position = 0;
currentVelocity = initialVelocity;
while( currentVelocity > 0 )
{
averageVelocity = currentVelocity + (acceleration / 2);
position = position + averageVelocity;
currentVelocity += acceleration;
}
If you want to move a set distance, use the following:
Distance travelled is just the integral of velocity with respect to time. You need to integrate your expression with respect to time with limits [v, 0] and this will give you an expression for distance in terms of v (initial velocity).