Random Number generation with probability in matlab [closed] - arrays

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I need to simulate an information source with alphabet "a,b,c,d" with the respective probabilities of 0.1, 0.5, 0.2, 0.2. I do not know how to do it using MATLAB. Help is most appreciated.

You could first create an array containing the relative numbers of each character defined by their relative probabilities.
First set the max # of samples for any letter; doesn't have to be the same as the # of rand samples (later below):
maxSamplesEach = 100;
Define the data for the problem:
strings = ['a' 'b' 'c' 'd'];
probabilty = [0.1 0.5 0.2 0.2];
Construct a sample space weighted by relative probabilities:
count = 0;
for k = 1:size(strings,2)
for i = 1:probabilty(k)*maxSamplesEach
count = count+1;
totalSampleSpace(count) = strings(k);
end
end
Now define a range for the random numbers:
min = 1;
max = count;
Now generate a 100 random numbers from a uniform distribution from the range defined above:
N = 100;
randomSelections = round(min + (max-min).*rand(1,N));
Now here are your random samples taken from the distribution:
randomSamples = totalSampleSpace(randomSelections);
Next just count them up:
for k = 1:size(strings,2)
indices = [];
indices = find(randomSamples == strings(k));
disp(['Count samples for ', strings(k),' = ', num2str(size(indices,2))]);
end
Keep in mind that these results are statistical in nature so its highly unlikely that you will get the same relative contributions each time.
Example output:
Count samples for a = 11
Count samples for b = 49
Count samples for c = 19
Count samples for d = 21

you could do something as simple as follows. Simply create a large random vector using rand, this will create values between 0 and 1 with a uniform probability. So if you want a number to have a 10 percent chance of occurring you give it a range of 0.1, typically 0 to 0.1. You can then add more ranges to these same numbers to get what you want.
vals =rand(1,10000);
letters = cell(size(vals));
[letters{vals<0.1}] = deal ('a');
[letters{vals > 0.1 & vals <= 0.6}] = deal ('b');
[letters{vals > 0.6 & vals <= 0.8}] = deal ('c');
[letters{vals > 0.8 & vals <= 1}] = deal ('d');
The above code will return a 10000 character letter array with the described percentages.
Or you can do this dynamically as follows:
vals =rand(1,10000);
output= cell(size(vals));
letters2use = {'a','b','c','d'};
percentages = [0.1,0.5,0.2,0.2];
lowerBounds = [0,cumsum(percentages(1:end-1))];
upperBounds = cumsum(percentages);
for i = 1:numel(percentages)
[output{vals > lowerBounds(i) & vals <= upperBounds(i)}] = deal(letters2use{i}) ;
end
UPDATE
The above code has no guarantee of a certain number of occurrences of each letter, however the following does. Since from your comment it seems you need exactly a certain number of each the following code should do that by randomly assigning letters around
numElements = 10000;
letters2use = {'a','b','c','d'};
percentages = [0.1,0.5,0.2,0.2];
numEach = round(percentages*numElements);
while sum(numEach) < numElements
[~,idx] = max(mod(percentages*numElements,1));
numEach(idx) = numEach(idx) + 1;
end
while sum(numEach) > numElements
[~,idx] = min(mod(percentages*numElements,1));
numEach(idx) = numEach(idx) - 1;
end
indices = randperm(numElements);
output = cell(size(indices));
lower = [0,cumsum(numEach(1:end-1))]+1;
upper = cumsum(numEach);
for i = 1:numel(lower)
[output{indices(lower(i):upper(i))}] = deal(letters2use{i});
end
output

Related

Random sampling of elements from an array based on a target condition

I have an array (let's call it ElmInfo) of size Nx2 representing a geometry. In that array the element number and element volume are on the column 1 and column 2 respectively. The volume of elements largely vary. The sum of the volume of all elements leads to a value V which can be obtained in MATLAB as:
V=sum(ElmInfo(:,2));
I want to randomly sample elements from the array ElmInfo in such a way that the volume of sampled elements (with no repetition) will lead to a target volume V1. Note: V1 is less than V. So I don't know the number of elements to be sampled. I am giving an example. For a sampling case number of sampled element can be '10' whereas in other sampling number of sampled element can be '15'.
There is no straightforward MATLAB in-built function to meet the target condition. How can I implement the code in MATLAB?
Finally I got the answer of my question. Here is the solution I got from a contributor at MATLAB central. For the convenience of the stack overflow community I am posting the answer here.
TotVol=sum(ElmInfo(:,2));
DefVf = 1.5; % This is the volume fraction I want to sample
% Target sample volume
DefVolm_target = TotVol*(DefVf/100);
% **************************************
n = 300;
v = ElmInfo(:,2);
tol = 1e-6;
sample = [];
maxits = 10000;
for count = 1:maxits
p = randperm(n);
s = cumsum(v(p));
k = find(abs(s - DefVolm_target) < tol);
if ~isempty(k)
sample_indices = p(1:k(1));
sample = v(sample_indices);
fprintf('Sample found after %d iterations\n', count);
break
end
end
DefVol_sim=sum(sample);
sampled_Elm=sort(sample_indices);

How to subdivide an array?

I want to split an array into segments by percents. For example, divide 100 elements into segments occupying [1/3,1/4,5/12], but [1/3,1/4,5/12]*100=[33.3,25,41.7], so it's necessary to adjust them to integers [33,25,42] (others like [34,24,42] or [33,26,41] are also acceptable, a wandering within 1 is not important).
Currently I use a loop to do this recursively
function x = segment(n,pct)
x = n * pct;
y = fix(x);
r = x - y;
for ii = 1 : length(r)-1
[r(ii),r(ii+1)] = deal(fix(r(ii)),r(ii+1)+r(ii)-fix(r(ii)));
end
x = y + r;
segment(100,[1/3,1/4,5/12]) gives [33,25,42].
Is there better way without recursive loop?
You can just use Adiel's comment for most of the cases, and just catch any rogue result after and correct it, only if it needs to be corrected:
function out = segmt( n , pct )
% This works by itself in many cases
out = round(n.*pct) ;
% for the other cases:
% if the total is not equal to the initial number of point, the
% difference will be affected to the largest value (to minimize the
% percentage imbalance).
delta = sum(out) - n ;
if delta ~= 0
[~,idx] = max( out ) ;
out(idx) = out(idx) - delta ;
end

matlab: how to speed up the count of consecutive values in a cell array

I have the 137x19 cell array Location(1,4).loc and I want to find the number of times that horizontal consecutive values are present in Location(1,4).loc. I have used this code:
x=Location(1,4).loc;
y={x(:,1),x(:,2)};
for ii=1:137
cnt(ii,1)=sum(strcmp(x(:,1),y{1,1}{ii,1})&strcmp(x(:,2),y{1,2}{ii,1}));
end
y={x(:,1),x(:,2),x(:,3)};
for ii=1:137
cnt(ii,2)=sum(strcmp(x(:,1),y{1,1}{ii,1})&strcmp(x(:,2),y{1,2}{ii,1})&strcmp(x(:,3),y{1,3}{ii,1}));
end
y={x(:,1),x(:,2),x(:,3),x(:,4)};
for ii=1:137
cnt(ii,3)=sum(strcmp(x(:,1),y{1,1}{ii,1})&strcmp(x(:,2),y{1,2}{ii,1})&strcmp(x(:,3),y{1,3}{ii,1})&strcmp(x(:,4),y{1,4}{ii,1}));
end
y={x(:,1),x(:,2),x(:,3),x(:,4),x(:,5)};
for ii=1:137
cnt(ii,4)=sum(strcmp(x(:,1),y{1,1}{ii,1})&strcmp(x(:,2),y{1,2}{ii,1})&strcmp(x(:,3),y{1,3}{ii,1})&strcmp(x(:,4),y{1,4}{ii,1})&strcmp(x(:,5),y{1,5}{ii,1}));
end
... continue for all the columns. This code run and gives me the correct result but it's not automated and it's slow. Can you give me ideas to automate and speed up the code?
I think I will write an answer to this since I've not done so for a while.
First convert your cell Array to a matrix,this will ease the following steps by a lot. Then diff is the way to go
A = randi(5,[137,19]);
DiffA = diff(A')'; %// Diff creates a matrix that is 136 by 19, where each consecutive value is subtracted by its previous value.
So a 0 in DiffA would represent 2 consecutive numbers in A are equal, 2 consecutive 0s would mean 3 consecutive numbers in A are equal.
idx = DiffA==0;
cnt(:,1) = sum(idx,2);
To do 3 consecutive number counts, you could do something like:
idx2 = abs(DiffA(:,1:end-1))+abs(DiffA(:,2:end)) == 0;
cnt(:,2) = sum(idx2,2);
Or use another Diff, the abs is used to avoid negative number + positive number that also happens to give 0; otherwise only 0 + 0 will give you a 0; you can now continue this pattern by doing:
idx3 = abs(DiffA(:,1:end-2))+abs(DiffA(:,2:end-1))+abs(DiffA(:,3:end)) == 0
cnt(:,3) = sum(idx3,2);
In loop format:
absDiffA = abs(DiffA)
for ii = 1:W
absDiffA = abs(absDiffA(:,1:end-1) + absDiffA(:,1+1:end));
idx = (absDiffA == 0);
cnt(:,ii) = sum(idx,2);
end
NOTE: this method counts [0,0,0] twice when evaluating 2 consecutives, and once when evaluating 3 consecutives.

How to solve logistic regression using gradient Descent?

I was solving a exercise of a online course form coursera on machine learning. The problem statement is :
Suppose that a high school has a dataset representing 40 students who were admitted to college and 40 students who were not admitted. Each ( x(i), y(i) ) training example contains a student's score on two standardized exams and a label of whether the student was admitted.
Our task is to build a binary classification model that estimates college admission chances based on a student's scores on two exams. In the training data,
a. The first column of your x array represents all Test 1 scores, and the second column represents all Test 2 scores.
b. The y vector uses '1' to label a student who was admitted and '0' to label a student who was not admitted.
I have solved it by using predefined function named fminunc. Now , i am solving it by using gradient descent but my graph of cost vs number of iteration is not conversing i.e cost function value is not decreasing with number of iteration . My theta value is also not matching with the answer that should i get.
theta value that i got :
[-0.085260 0.047703 -0.022851]
theta value that i should get (answer) :
[-16.38 0.1483 0.1589]
My source code :
clear ; close all; clc
x = load('/home/utlesh/Downloads/ex4x.txt');
y = load('/home/utlesh/Downloads/ex4y.txt');
theta = [0,0,0];
alpha = 0.00002;
a = [0,0,0];
m = size(x,1);
x = [ones(m,1) x];
n = size(x,2);
y_hyp = y*ones(1,n);
for kk = 1:100000
hyposis = 1./(1 + exp(-(x*theta')));
x_hyp = hyposis*ones(1,n);
theta = theta - alpha*1/m*sum((x_hyp - y_hyp).*x);
a(kk,:) = theta ;
end
cost = [0];
for kk = 1:100000
h = 1./(1 + exp(-(x*a(kk,:)')));
cost(kk,:) = sum(-y .* log(h) - (1 - y) .* log(1 - h));
end
x_axis = [0];
for kk = 1:100000
x_axis(kk,:) = kk;
end
plot(x_axis,cost);
The graph that i got looks like that of 1/x;
Please tell me where i am doing mistake . If there is anything that i misunderstood please let me know .
What I can see missing is the usage of learning rate and weights. The weights can be adjusted in two modes online and batch.
The weights should be randomly assigned values between [-0.01,0.01]. I did an exercise as a part of my HW during my Master's. Below is the snippet:
assign values to weights between [-0.01,0.01] i.e. no. of weight values will be, no. of features + 1:
weights = -.01 + 0.02 * rand(3,1);
learnRate = 0.001;
Here running the code for set number of iterations: (It converged in 100 iterations also).
while iter < 100
old_output = new_output;
delta = zeros(cols-1,1);
for t = 1:rows
input = 0;
for j = 1:cols-1
input = input + weights(j) * numericdata(t,j);
end
new_output(t) = (1 ./ (1 + exp(-input)));
for j = 1:cols-1
delta(j) = delta(j) + (numericdata(t,4)-new_output(t)) * numericdata(t,j);
end
end
#Adjusting weights (Batch Mode):
for j=1:cols-1
weights(j) = weights(j) + learnRate * (delta(j));
end
error = abs(numericdata(:,4) - new_output);
errorStr(i) = (error(:));
error = 0;
iter = iter + 1;
i = i + 1;
end
Also, I had a talk with my professor, while studying it. He said, if the dataset given has the property to converge then you will see that when you randomly run it for different number of iterations.

Store values from a time series function in an array using a for loop in R

I am working with Bank of America time series data for stock prices. I am trying to store the forecasted value for a specific step ahead (in this case 1:20 steps) in an array. I then need to subtract each value of the array from each value of the test array. Then I have to square each value of the array, sum all the squared values of the array, then divide by N (N = number of steps forecasted ahead).
I have the following so far. Also, the quantmod and fpp libraries are needed for this.
---------Bank of America----------
library(quantmod)
library(fpp)
BAC = getSymbols('BAC',from='2009-01-02',to='2014-10-15',auto.assign=FALSE)
BAC.adj = BAC$BAC.Adjusted
BAC.daily=dailyReturn(BAC.adj,type='log')
test = tail(BAC.daily, n = 20)
train = head(BAC.daily, n = 1437)
Trying to write a function to forecast, extract requisite value (point forecast for time i), then store it in an array where I can perform operations on that array (i.e. - add, multiply, exponentiate, sum the values of the array)
MSE = function(N){
for(i in 1:(N)){
x = forecast(model1, h = i)
y = x$mean
w = as.matrix(as.double(as.matrix(unclass(y))))
p = array(test[i,]-w[i,])
}
}
and we also have:
model1 = Arima(train, order = c(0,2,0))
MSE = function(N){
result = vector("list", length = (N))
for(i in 1:(N)){
x = forecast(model1, h = i)
point_forecast = as.double(as.matrix(unclass(x$mean)))
result[i] = point_forecast
}
result = as.matrix(do.call(cbind, result))
}
Neither of these functions have worked so far. When I run the MSE function, I get the following errors:
> MSE(20)
There were 19 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In result[i] = point_forecast :
number of items to replace is not a multiple of replacement length
2: In result[i] = point_forecast :
number of items to replace is not a multiple of replacement length
3: In result[i] = point_forecast :
number of items to replace is not a multiple of replacement length
4: In result[i] = point_forecast :
When I run MSE2 function, I get the following ouput:
MSE2(20)
[1] -0.15824
When putting a print statement inside, it printed out 'p' as a singular number, just like above (even though that had been run for i = 20). The x,y, and w variable in the MSE2 function act as vectors as far as storing the output, so I do not understand why p does not as well.
I appreciate any help in this matter, thank you.
Sincerely,
Mitchell Healy
Your question has two MSE functions: one in the first code block and one in the second code block.
Also, library(forecast) is needed to run Arima and forecast.
My understanding of what you are trying to do in the first paragraph is to compute the 20-step ahead forecast error. That is, what is the error in forecasts from model1 20 days ahead, based on your test data. This can be done in the code below:
model1 <- Arima(train, order = c(0,2,0))
y_fcst<-forecast(model1,h=20)$mean
errors<-as.vector(y_fcst)-as.vector(test)
MSE.fcst<-mean(errors^2)
However, I'm not sure what you're trying to do here: an ARIMA(0,2,0) model is simply modelling the differences in returns as a random walk. That is, this model just differences the returns twice and assumes this twice-differenced data is white noise. There's no parameters other than $\sigma^2$ being estimated.
Rob Hyndman has a blog post covering computing errors from rolling forecasts.
My solution to finding the MSE is below. I used log adjusted daily return data from Bank of America gathered through quantmod. Then I subsetted the data (which had length 1457) into training[1:1437] and testing[1438:1457].
The solution is:
forc = function(N){
forecast = matrix(data = NA, nrow = (N) )
for(i in 1:N){
fit = Arima(BAC.adj[(1+(i-1)):(1437+(i-1))], order = c(0,0,4))
x = forecast(fit, h = 1)
forecast[i,] = as.numeric(x$mean)
}
error = test - forecast
error_squared = error^2
sum_error_squared = sum(error_squared)
MSE = sum_error_squared/N
MSE
}

Resources