Bernoulli Bandit- Thompson sampling Alternate idea - sampling

In thompson sampling, we maintain beta parameters for each arm and sample from this beta distribution to pick the best arm.
Why can't we just maintain mean of beta distribution for each arm (alpha_k/(alpha_k+beta_k)) and pick arm with this probability.
For example, suppose we have 3 arms.
Arm 1 (alpha = 1, beta = 1), i.e. mean = 0.5
Arm 2 (alpha = 2, beta = 1), i.e. mean = 0.67
Arm 3 (alpha = 1, beta = 1), i.e. mean = 0.5
Why can't we pick them proportionately to their means
Arm 1 with probability 0.5/(0.5+0.67+0.5) = 0.3
Arm 2 with probability 0.67/(0.5+0.67+0.5) = 0.4
Arm 3 with probability 0.5/(0.5+0.67+0.5) = 0.3
Would it not converge?
I understand beta distribution based analysis but I am not able to see what is the probalem with choosing each arm based on its likelihood.

Related

Julia way to write k-step look ahead function?

Suppose I have two arrays representing a probabilistic graph:
2
/ \
1 -> 4 -> 5 -> 6 -> 7
\ /
3
Where the probability of going to state 2 is 0.81 and the probability of going to state 3 is (1-0.81) = 0.19. My arrays represent the estimated values of the states as well as the rewards. (Note: Each index of the array represents its respective state)
V = [0, 3, 8, 2, 1, 2, 0]
R = [0, 0, 0, 4, 1, 1, 1]
The context doesn't matter so much, it's just to give an idea of where I'm coming from. I need to write a k-step look ahead function where I sum the discounted value of rewards and add it to the estimated value of the kth-state.
I have been able to do this so far by creating separate functions for each step look ahead. My goal of asking this question is to figure out how to refactor this code so that I don't repeat myself and use idiomatic Julia.
Here is an example of what I am talking about:
function E₁(R::Array{Float64,1}, V::Array{Float64, 1}, P::Float64)
V[1] + 0.81*(R[1] + V[2]) + 0.19*(R[2] + V[3])
end
function E₂(R::Array{Float64,1}, V::Array{Float64, 1}, P::Float64)
V[1] + 0.81*(R[1] + R[3]) + 0.19*(R[2] + R[4]) + V[4]
end
function E₃(R::Array{Float64,1}, V::Array{Float64, 1}, P::Float64)
V[1] + 0.81*(R[1] + R[3]) + 0.19*(R[2] + R[4]) + R[5] + V[5]
end
.
.
.
So on and so forth. It seems that if I was to ignore E₁() this would be exceptionally easy to refactor. But because I have to discount the value estimate at two different states, I'm having trouble thinking of a way to generalize this for k-steps.
I think obviously I could write a single function that took an integer as a value and then use a bunch of if-statements but that doesn't seem in the spirit of Julia. Any ideas on how I could refactor this? A closure of some sort? A different data type to store R and V?
It seems like you essentially have a discrete Markov chain. So the standard way would be to store the graph as its transition matrix:
T = zeros(7,7)
T[1,2] = 0.81
T[1,3] = 0.19
T[2,4] = 1
T[3,4] = 1
T[5,4] = 1
T[5,6] = 1
T[6,7] = 1
Then you can calculate the probabilities of ending up at each state, given an intial distribution, by multiplying T' from the left (because usually, the transition matrix is defined transposedly):
julia> T' * [1,0,0,0,0,0,0] # starting from (1)
7-element Array{Float64,1}:
0.0
0.81
0.19
0.0
0.0
0.0
0.0
Likewise, the probability of ending up at each state after k steps can be calculated by using powers of T':
julia> T' * T' * [1,0,0,0,0,0,0]
7-element Array{Float64,1}:
0.0
0.0
0.0
1.0
0.0
0.0
0.0
Now that you have all probabilities after k steps, you can easily calculate expectations as well. Maybe it pays of to define T as a sparse matrix.

Stuck in implementing a method for mapping symbols to an interval - if-else loop not working properly implementation does not match theory

I am trying out an encoding - decoding method that had been asked in this post
https://stackoverflow.com/questions/40820958/matlab-help-in-implementing-a-mathematical-equation-for-generating-multi-level
and a related one Generate random number with given probability matlab
There are 2 parts to this question - encoding and decoding. Encoding of a symbolic sequence is done using inverse interval mapping using the map f_inv. The method of inverse interval mapping yields a real valued number. Based on the real valued number, we iterate the map f(). The solution in the post in the first link does not work - because once the final interval is found, the iteration of the map f() using the proposed solution does not yield the same exact symbolic array. So, I tried by directly implementing the equations for the forward iteration f() given in the paper for the decoding process, but the decoding does not generate the same symbolic sequence.
Here is a breif explanation of the problem.
Let there be an array b = [1,3,2,6,1] containing N = 5 integer valued elements with probability of occurence of each unique integer as 0.4, 0.2, 0.2, 0.2 respectively. The array b can take any integers from the unique symbol set 1,2,3,4,5,6,7,8. Let n = 8 elements in the symbol set. In essence, the probability for the above data b is
p= [ 0.4 (for symbol 1), 0.2 (for symbol 2) , 0.2 (symbol 3), 0 (for symbol 4 not occuring), 0 (for symbol 5), 0.2(for symbol 6), 0 (for symbol 7), 0 (for symbol 8)]
An interval [0,1] is split into 8 regions. Let, the interval for the data b assumed to be known as
Interval_b = [0, 0.4, 0.6, 0.8, 1];
In general, for n = 8 unique symbols, there are n = 8 intervals such as I_1, I_2, I_3, I_4, I_5, I_6, I_6,I_7,I_8 and each of these intervals is assigned a symbol such as [ 1 2 3 4 5 6 7 8]
Let, x = 0.2848 that has been obtained from the reverse interval mapping for the symbol array b from the solution for the encoding procedure in the link. There is a mapping rule which maps x to the symbol depending on the interval in which x lies and we should obtain the same symbol elements as in b. The rule is
Looks like the argument Interval passed to function ObtainSymbols should contain entries for all elements, including the ones with probability 0. This can be done by adding the statement
Interval = cumsum([0, p_arr]);
immediately before the calls to function ObtainSymbols.
The following is the output with this modificaiton:
...
p_arr = [p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8];
% unchanged script above this
% recompute Interval for all symbols
Interval = cumsum([0, p_arr]);
% [0 0.4 0.6 0.8 0.8 0.8 1.0 1.0 1.0]
% unchanged script below
[y1,symbol1] = ObtainSymbols(x(1),p_arr,Interval);
[y2,symbol2] = ObtainSymbols(y1,p_arr,Interval);
[y3,symbol3] = ObtainSymbols(y2,p_arr,Interval);
[y4,symbol4] = ObtainSymbols(y3,p_arr,Interval);
[y5,symbol5] = ObtainSymbols(y4,p_arr,Interval);
Symbols = [symbol1,symbol2,symbol3,symbol4,symbol5]
y = [y1,y2,y3,y4,y5]
% Symbols = [1 3 2 6 1]
% y = [0.7136 0.5680 0.8400 0.2000 0.5000]

PWM signal generation based on Mic input

I am using MPC 7555 controller. It has a 16 bit sigma delta ADC.
A signal called mic input is fed to this ADC pin. based upon the voltage , a PWM signal of same frequency of ADC signal sampling should be generated.
For e.g.
0.1 V = 2 percent
0.2 V = 4 percent
0.3 V = 6 percent....and so on
So, i thought the following logic -
5V - 0xFFFF in digital
0.1V - 1310
0.2V - 2620 and so on
So, dividing the digital value by 655 will give exact duty cycle value
1310/655 = 2
2620/655 = 4........
But digital pin could also show value of 1309 for 0.1 V which when divided by 655 would yield 1 and not 2.
Anyway i can avoid this or does any have a better solution, please share.
The task is to output PWM at the same rate as the ADC conversion rate.
Suppose the ADC conversion time is T (you can establish this by reading a free-run timer counter). And suppose the ADC conversion value is V. Then the PWM output time H spent "high" must be
H = T * V / 0xFFFF
Every time an ADC conversion is available, you (cancel any pending one-shot timer interrupt and) set the PWM output to 1 and trigger a one-shot timer at time H. When it interrupts, you set the PWM output to 0 (or the other way round if you have inverse logic).
If the input is 0x0000 or 0xFFFF you can employ an alternative strategy - set the output to 0 or 1, but don't deploy the one-shot timer.
To get the best fidelity in teh PWM signal, you would do better to work directly at the resolution of the PWM rather then calculate a percentage only to then convert that to a PWM count. Using integer percentage, you are effectively limiting your resolution to 6.64 bits per sample (i.e. log10(100)/log10(2)).
So let's say your PWM count per cycle is PWM_MAX, and your ADC maximum ADC_MAX, then the PWM high period would be:
pwm_high = adc_val * PWM_MAX / ADC_MAX ;
It is important to perform the multiplication first to avoid loss of information. If PWM_MAX is suficiently high, there is probably no need to worry about integer division rounding toward zero rather then to teh nearest integer, but if that is a concern (for low PWM_MAX ) then:
pwm_high = ((adc_val * PWM_MAX) + (ADC_MAX / 2)) / ADC_MAX ;
For example, soy your PWM_MAX is only 100 (i.e. the resolution truely is in integer percent), then in the first case:
pwm_high = 1310 * 100 / 0xFFFF = 1
and in the second:
pwm_high = ((1310 * 100) + 0x7FFF) / 0xFFFF = 2
However if PWM_MAX is a more suitable 4096 perhaps, then:
pwm_high = 1310 * 4096 / 0xFFFF = 81
or
pwm_high = ((1310 * 4096) + 0x7fff) / 0xFFFF = 82
With PWM_MAX at 4096 you have effectively 12 bits of resolution and will maintain much higher fidelity as well as directly calculating the correct PWM value.

Reduction or atomic operator on unknown global array indices

I have the following algorithm:
__global__ void Update(int N, double* x, double* y, int* z, double* out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
{
x[i] += y[i];
if (y[i] >= 0.)
out[z[i]] += x[i];
else
out[z[i]] -= x[i];
}
}
Important to note that out is smaller than x. Say x, y and z are always the same size, say 1000, and out is always smaller, say 100. z is the indices in out that each of x and y correspond to.
This is all find except the updates to out. There may be clashes across threads as z does not contain only unique values and has duplicates. Therefore I currently have this implemented with atomic versions of atomicAdd and subtract using compare and swap. This is obviously expensive and means my kernel takes 5-10x longer to run.
I would like to reduce this however the only way I can think of doing this is for each thread to have its own version of out (which can be large, 10000+, X 10000+ threads). This would mean I set up 10000 double[10000] (perhaps in shared?) call my kernel, and then sum across these arrays, perhaps in another kernel. Surely there must be a more elegant way to do this?
It might be worth noting that x, y, z and out reside in global memory. As my kernel (I have others like this) is very simple I have not decided to copy across bits to shared (nvvp on the kernel shows equal computation and memory so I am thinking not much performance to be gained when adding overhead of moving data from global to shared and back again, any thoughts?).
Method 1:
Build a set of "transactions". Since you only have one update per thread, you can easily build a fixed size "transaction" record, one entry per thread. Suppose I have 8 threads (for simplicity of presentation) and some arbitrary number of entries in my out table. Let's suppose my 8 threads wanted to do 8 transactions like this:
thread ID (i): 0 1 2 3 5 6 7
z[i]: 2 3 4 4 3 2 3
x[i]: 1.5 0.5 1.0 0.5 0.1 -0.2 -0.1
"transaction": 2,1.5 3,0.5 4,1.0 4,0.5 3,0.1 2,-0.2 3,-0.1
Now do a sort_by_key on the transactions, to arrange them in order of z[i]:
sorted: 2,1.5 2,-0.2 3,0.5 3,-0.1 3,0.1 4,1.0 4,0.5
Now do a reduce_by_key operation on the transactions:
keys: 2 3 4
values: 1.3 0.5 1.5
Now update out[i] according to the keys:
out[2] += 1.3
out[3] += 0.5
out[4] += 1.5
thrust and/or cub might be pre-built options for the sort and reduce operations.
Method 2:
As you say, you have arrays x, y, z, and out in global memory. If you are going to use z which is a "mapping" repeatedly, you might want to rearrange (group) or sort your arrays in order of z:
index (i): 0 1 2 3 4 5 6 7
z[i]: 2 8 4 8 3 1 4 4
x[i]: 0.2 0.4 0.3 0.1 -0.1 -0.4 0.0 1.0
group by z[i]:
index (i): 0 1 2 3 4 5 6 7
z[i]: 1 2 3 4 4 4 8 8
x[i]:-0.4 0.2 -0.1 0.3 0.0 1.0 0.4 0.1
This, or some variant of it, would allow you to eliminate having to repeatedly do the sorting operation in method 1 (again, if you were using the same "mapping" vector repeatedly).

Sorting coordinates of point cloud in accordance with X, Y or Z value

A is a series of points coordinates in 3D (X,Y,Z), for instance:
>> A = [1 2 0;3 4 7;5 6 9;9 0 5;7 8 4]
A =
1 2 0
3 4 7
5 6 9
9 0 5
7 8 4
I would like to sort the matrix with respect to "Y" (second column) values.
Here is the code that I am using:
>> tic;[~, loc] = sort(A(:,2));
SortedA = A(loc,:)
toc;
SortedA =
9 0 5
1 2 0
3 4 7
5 6 9
7 8 4
Elapsed time is **0.001525** seconds.
However, it can be very slow for a large set of data. I would appreciate it if anyone knows a more efficient approach.
Introductory Discussion
This answer would mainly talk about how one can harness a compute efficient GPU for solving the stated problem. The solution code to the stated problem presented in the question was -
[~, loc] = sort(A(:,2));
SortedA = A(loc,:);
There are essentially two parts to it -
Select the second column, sort them and get the sorted indices.
Index into the rows of input matrix with the sorted indices.
Now, Part 1 is compute intensive, which could be ported onto GPU, but Part 2 being an indexing work, could be done on CPU itself.
Proposed solution
So, considering all these, an efficient GPU solution would be -
gA = gpuArray(A(:,2)); %// Port only the second column of input matrix to GPU
[~, gloc] = sort(gA); %// compute sorted indices on GPU
SortedA = A(gather(gloc),:); %// get the sorted indices back to CPU with `gather`
%// and then use them to get sorted A
Benchmarking
Presented next is the benchmark code to compare the GPU version against the original solution, however do keep in mind that since we are running the GPU codes on different hardware as compared to the originally stated solution that runs on CPU, the benchmark results might vary from system to system.
Here's the benchmark code -
N = 3000000; %// datasize (number of rows in input)
A = rand(N,3); %// generate random large input
disp('------------------ With original solution on CPU')
tic
[~, loc] = sort(A(:,2));
SortedA = A(loc,:);
toc, clear SortedA loc
disp('------------------ With proposed solution on GPU')
tic
gA = gpuArray(A(:,2));
[~, gloc] = sort(gA);
SortedA = A(gather(gloc),:);
toc
Here are the benchmark results -
------------------ With original solution on CPU
Elapsed time is 0.795616 seconds.
------------------ With proposed solution on GPU
Elapsed time is 0.465643 seconds.
So, if you have a decent enough GPU, it's high time to try out GPU for sorting related problem and more so with MATLAB providing such easy GPU porting solutions.
System Configuration
MATLAB Version: 8.3.0.532 (R2014a)
Operating System: Windows 7
RAM: 3GB
CPU Model: Intel® Pentium® Processor E5400 (2M Cache, 2.70 GHz)
GPU Model: GTX 750Ti 2GB
Try sortrows, specifying column 2:
Asorted = sortrows(A,2)
Simpler, but actually slower now that I test it... Apparently sortrows is not so great if you're only considering 1 column to sort. It's probably best when you consider multiple columns in a certain order.
MATLAB does have a feature called sortrows() to do this but in my experience, it tends to be as slow as what you're doing for a general unstructured matrix.
Test:
N = 1e4;
A = rand(N,N);
tic;[~, loc] = sort(A(:,2));
SortedA = A(loc,:);
toc;
tic; sortrows(A,2); toc;
Gives:
Elapsed time is 0.515903 seconds.
Elapsed time is 0.525725 seconds.

Resources