adding array to an existing array - arrays

I perform calculations using 5 fold cross validation. I want to collect all the predictions in one array in order to avoid statistic calculation per fold. I have tried doing it by extending array of predictions by adding array to an existing array. For example:
for train_index, test_index in skf:
fold += 1
x_train, x_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
rf.fit(x_train, y_train)
predicted = rf.predict_proba(x_test)
round_predicted = rf.predict(x_test)
if fold>1:
allFolds_pred = np.concatenate((predicted, allFolds_pred), axis=1)
allFolds_rpred = np.concatenate((round_predicted, allFolds_rpred), axis=1)
allFolds_y = np.concatenate((y_test, allFolds_y), axis=1)
else:
allFolds_pred = predicted
allFolds_rpred = round_predicted
allFolds_y = y_test
fpr, tpr, _ = roc_curve(allFolds_y, llFolds_pred[:,1])
roc_auc = auc(fpr, tpr)
cm=confusion_matrix(allFolds_y, allFolds_rpred, labels=[0, 1])
Calculate statistics.
However it is not working. What is the best way to proceed? Is there any better way to do the same?

Related

Bootstrapping the uncertainty on an RMSE estimate of a location-scale generalized additive model

I have height data (numeric height data in cm; Height) of plants measured over time (numeric data expressed in days of the year; Doy). These data is grouped per genotype (factor data; Genotype) and individual plant (Factor data; Individual). I've managed to calculate the RMSE of the location-scale GAM but I can't figure out how to bootstrap the uncertainty estimate on the RMSE calculation given it is a hierarchical location-scale generalized additive model.
The code to extract the RMSE value looks something like this:
# The GAM
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype,
~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype),
family = gaulss(), # Gaussian location-scale
method = "REML",
data = data)
# Extract the model formula
form <- formula.gam(model)
# Cross-validation for the location
CV <- CVgam(form[[1]], data, nfold = 10, debug.level = 0, method = "GCV.Cp",
printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
# The root mean square error is given by taking the square root of the MSE
sqrt(CV$cvscale[1])`
There is only one height measurement per Individual per day of the year. I figure this is problematic in maintaining the exact same formulation of the GAM. In thsi regard, I was thinking of making sure that the same few Individuals of each genotype (let's say n = 4) were randomly sampled over each day of the year. I can't figure out how to proceed though. Any ideas?
I've tried several methods, such as the boot package and for loops. An example of one of things I've tried is:
lm=list();counter=0
lm2=list()
loops = 3
for (i in 1:loops){
datax <- data %>%
group_by(Doy, Genotype) %>%
slice_sample(prop = 0.6, replace = T)
datax
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype,
~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype),
family = gaulss(),
method = "REML",
data = datax)
# Extract the model formula
form <- formula.gam(model)
# Cross-validation for the location
CV <- CVgam(form[[1]], datax, nfold = 10, debug.level = 0, method = "GCV.Cp",
printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
RMSE[i] <- sqrt(CV$cvscale[c(1)])
}
RMSE
This loop runs very slow and just returns me 3 times the same RMSE values; Surely, there is an issue with the sampling.
Unfortunately, I can't share my data but maybe somebody has an idea on how to proceed?
Many thanks!

How to shape 2-feature input data for LSTM

I am using a RNN with LSTM nodes in Keras for a time series prediction. I have two input features and one output feature and I'm using a sliding window of size 4 and stepsize 1.
So I'm trying to prepare the arrays accordingly for the LSTM to handle the data. However, the dimensions don't seem to be right. I've got it to a point where I got the 3D array in the right shape for the network to take it, but the way the data is set up in the array does not seem right to me.
So, looking only at the training data, this is the raw data from the file:
train_input = df[['a','b']].values (this is of shape (354, 2))
train_output = df[['c']].values (this is of shape (354, 1))
Next I scale the data, after which the shape still remains the same. And then I want to use a loop in order to bring the data into the sliding window shape (window size 4, range 354):
train_input_window = []
train_output_window = []
for i in range(4, 354):
train_input_window.append(train_input_scaled[i-4:i, 0])
train_input_window.append(train_input_scaled[i-4:i, 1])
train_output_window.append(train_output_scaled[i, 0])
train_input_window = np.array(train_input_window)
train_output_window = np.array(train_output_window)
Now train_input_window is of shape (700, 4)
and train_output_window is of shape (350,)
So this is where the problem lies, I think. Because I can reshape the data into a 3D array that will work:
train_input3D = np.reshape(train_input_window, (350,4,2))
train_output3D = np.reshape(train_output_window, (350,1,1))
but I just don't think that the data is arranged correctly inside the arrays.
the training input looks somethin like this:
print(train_input3D)
[[[a a]
[a a]
[b b]
[b b]]
[[a a]
[a a]
[b b]
[b b]].....
shouldn't it be
[[[a b]
[a b]
[a b]
[a b]]
[[a b]
[a b]
[a b]
[a b]].....
I tried so much different stuff, and by now I'm so confused that I just hope I'm not also confusing everyone here by trying to explain.
So, is the input shape that I think my array should be in correct for what Im trying? If so, how do I arrange it that way?
Here is my complete code:
#Read Data
df = pd.ExcelFile('GPT.xlsx').parse('7Avg')
# Training Data
train_input = df[['Precip_7Sum','Temp_7Avg']].values#
train_output = df[['GWL_7Avg']].values
# Testing Data
test_input = df[['Precip_7SumT','Temp_7AvgT']].values#
test_output = df[['GWL_7AvgT']].values
# normalize / scale Data
input_scaler = MinMaxScaler(feature_range = (0, 1))
output_scaler = MinMaxScaler(feature_range = (0, 1))
train_input_scaled = input_scaler.fit_transform(train_input)
train_output_scaled = output_scaler.fit_transform(train_output)
test_input_scaled = input_scaler.transform(test_input)
test_output_scaled = output_scaler.transform(test_output)
# Convert Data into sliding window format
train_input_window = []
train_output_window = []
for i in range(4, 354):
train_input_window.append(train_input_scaled[i-4:i, 0])
train_input_window.append(train_input_scaled[i-4:i, 1])
train_output_window.append(train_output_scaled[i, 0])
train_input_window = np.array(train_input_window)
train_output_window = np.array(train_output_window)
test_input_window = []
test_output_window = []
for i in range(4, 354):
test_input_window.append(train_input_scaled[i-4:i, 0])
test_input_window.append(train_input_scaled[i-4:i, 1])
test_output_window.append(train_output_scaled[i, 0])
test_input_window = np.array(test_input_window)
test_output_window = np.array(test_output_window)
# Convert data into 3-D Formats
train_input3D = np.reshape(train_input_window, (350,train_input_window.shape[1],2)) # 3D tensor with shape (batch_size, timesteps, input_dim) // (nr. of samples, nr. of timesteps, nr. of features)
train_output3D = np.reshape(train_output_window, (350,1,1)) #
test_input3D = np.reshape(test_input_window, (350,test_input_window.shape[1],2))
# Instantiate model class
model = Sequential()
# Add LSTM layer
model.add(LSTM(units=1, return_sequences = True, input_shape = (4,2)))
# Add dropout layer to avoid over-fitting
model.add(Dropout(0.2))
# add three more LSTM and Dropouts
model.add(LSTM(units=1, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(units=1, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(units=1, return_sequences=True))
model.add(Dropout(0.2))
# Create dense layer at the end of the model to make model more robust
model.add(Dense(units = 1, output_shape = (4,1)))
# Compile model
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
# Training
model.fit(train_input3D, train_output_window, epochs = 100, batch_size = 4)
# Testing / predictions
train_predictions = model.predict(train_input3D)
test_predictions = model.predict(test_input3D)
# Reverse scaling of data for output data
train_predictions = input_scaler.inverse_transform(train_predictions)
test_predictions = input_scaler.inverse_transform(test_predictions)
orig_data = train_output.append(test_output)
Every help on this would be much appreciated. I hope I could get my problem across clearly enough and that someone who could help actually reads all of this :D

Reshaping tensors in a 3D numpy matrix

I'm essentially trying to accomplish this and then this but with a 3D matrix, say (128,128,60,6). The 4th dimension is an array vector that represents the diffusion array at that voxel, e.g.:
d[30,30,30,:] = [dxx, dxy, dxz, dyy, dyz, dzz] = D_array
Where dxx etc. are diffusion for a particular direction. D_array can also be seen as a triangular matrix (since dxy == dyx etc.). So I can use those 2 other answers to get from D_array to D_square, e.g.
D_square = [[dxx, dxy, dxz], [dyx, dyy, dyz],[dzx, dzy, dzz]]
I can't seem to figure out the next step however - how to apply that unit transformation of a D_array into D_square to the whole 3D volume.
Here's the code snippet that works on a single tensor:
#this solves an linear eq. that provides us with diffusion arrays at each voxel in a 3D space
D = np.einsum('ijkt,tl->ijkl',X,bi_plus)
#our issue at this point is we have a vector that represents a triangular matrix.
# first make a tri matx from the vector, testing on unit tensor first
D_tri = np.zeros((3,3))
D_array = D[30][30][30]
D_tri[np.triu_indices(3)] = D_array
# then getting the full sqr matrix
D_square = D_tri.T + D_tri
np.fill_diagonal(D_square, np.diag(D_tri))
So what would be the numpy-way of formulating that unit transformation of the Diffusion tensor to the whole 3D volume all at once?
Approach #1
Here's one using row, col indices from triu_indices for indexing along last two axes into an initialized output array -
def squareformnd_rowcol_integer(ar, n=3):
out_shp = ar.shape[:-1] + (n,n)
out = np.empty(out_shp, dtype=ar.dtype)
row,col = np.triu_indices(n)
# Get a "rolled-axis" view with which the last two axes come to the front
# so that we could index into them just like for a 2D case
out_rolledaxes_view = out.transpose(np.roll(range(out.ndim),2,0))
# Assign permuted version of input array into rolled output version
arT = np.moveaxis(ar,-1,0)
out_rolledaxes_view[row,col] = arT
out_rolledaxes_view[col,row] = arT
return out
Approach #2
Another one with the last two axes merged into one and then indexing with linear indices -
def squareformnd_linear_integer(ar, n=3):
out_shp = ar.shape[:-1] + (n,n)
out = np.empty(out_shp, dtype=ar.dtype)
row,col = np.triu_indices(n)
idx0 = row*n+col
idx1 = col*n+row
ar2D = ar.reshape(-1,ar.shape[-1])
out.reshape(-1,n**2)[:,idx0] = ar2D
out.reshape(-1,n**2)[:,idx1] = ar2D
return out
Approach #3
Finally altogether a new method using masking and should be better with performance as most masking based ones are when it comes to indexing -
def squareformnd_masking(ar, n=3):
out = np.empty((n,n)+ar.shape[:-1] , dtype=ar.dtype)
r = np.arange(n)
m = r[:,None]<=r
arT = np.moveaxis(ar,-1,0)
out[m] = arT
out.swapaxes(0,1)[m] = arT
new_axes = range(out.ndim)[2:] + [0,1]
return out.transpose(new_axes)
Timings on (128,128,60,6) shaped random array -
In [635]: ar = np.random.rand(128,128,60,6)
In [636]: %timeit squareformnd_linear_integer(ar, n=3)
...: %timeit squareformnd_rowcol_integer(ar, n=3)
...: %timeit squareformnd_masking(ar, n=3)
10 loops, best of 3: 103 ms per loop
10 loops, best of 3: 103 ms per loop
10 loops, best of 3: 53.6 ms per loop
A vectorized way to do it:
# Gets the triangle matrix
d_tensor = np.zeros(128, 128, 60, 3, 3)
triu_idx = np.triu_indices(3)
d_tensor[:, :, :, triu_idx[0], triu_idx[1]] = d
# Make it symmetric
diagonal = np.zeros(128, 128, 60, 3, 3)
idx = np.arange(3)
diagonal[:, :, :, idx, idx] = d_tensor[:, :, :, idx, idx]
d_tensor = np.transpose(d_tensor, (0, 1, 2, 4, 3)) + d_tensor - diagonal

Sort array elements by the frequency of its elements

Is it possible in matlab/octave to use the sort function to sort an array based on the relative frequency of their elements?
For example the array
m= [4,4,4,10,10,10,4,4,5]
should result in this array:
[5,10,10,10,4,4,4,4,4]
5 is the less frequent element and is on the top while 4 is the most frequent and it's on bottom.
Should one use the indices provided by histcount?
The following code first calculates how often each element occurs and then uses runLengthDecode to expand the unique elements.
m = [4,4,4,10,10,10,4,4,5];
u_m = unique(m);
elem_count = histc(m,u_m);
[elem_count, idx] = sort(elem_count);
m_sorted = runLengthDecode(elem_count, u_m(idx));
The definition of runLengthDecode is copied from this answer:
For MATLAB R2015a+:
function V = runLengthDecode(runLengths, values)
if nargin<2
values = 1:numel(runLengths);
end
V = repelem(values, runLengths);
end
For versions before R2015a:
function V = runLengthDecode(runLengths, values)
%// Actual computation using column vectors
V = cumsum(accumarray(cumsum([1; runLengths(:)]), 1));
V = V(1:end-1);
%// In case of second argument
if nargin>1
V = reshape(values(V),[],1);
end
%// If original was a row vector, transpose
if size(runLengths,2)>1
V = V.'; %'
end
end
One way would be to use accumarray to find the count of each number (I suspect you can use histcounts(m,max(m))) but then you have to clear all the 0s).
m = [4,4,4,10,10,10,4,4,5];
[~,~,subs]=unique(m);
freq = accumarray(subs,subs,[],#numel);
[~,i2] = sort(freq(subs),'descend');
m(i2)
By combinging my approach with that of m.s. you can get a simpler solution:
m = [4,4,4,10,10,10,4,4,5];
[U,~,i1]=unique(m);
freq= histc(m,U);
[~,i2] = sort(freq(i1),'descend');
m(i2)
You could count the number of repetitions with bsxfun, sort that, and apply that sorting to m:
[~, ind] = sort(sum(bsxfun(#eq,m,m.')));
result = m(ind);

Calculation data from one array to another

I have two array, the first one is data_array(50,210), the second one is dest_array(210,210). The goal, using data from data_array to calculate the values of dest_array at specific indicies, without using for-loop.
I do it in such way:
function [ out ] = grid_point( row,col,cg_row,cg_col,data,kernel )
ker_len2 = floor(length(kernel)/2);
op1_vals = data((row - ker_len2:row + ker_len2),(col - ker_len2:col + ker_len2));
out(cg_row,cg_col) = sum(sum(op1_vals.*kernel)); %incorre
end
function [ out ] = sm(dg_X, dg_Y)
%dg_X, dg_Y - arrays of size 210x210, the values - coordinates of data in data_array,
%index of each element - position this data at 210x210 grid
data_array = randi(100,50,210); %data array
kernel = kernel_sinc2d(17,'hamming'); %sinc kernel for calculations
ker_len2 = floor(length(kernel)/2);
%adding the padding for array, to avoid
%the errors related to boundaries of data_array
data_array = vertcat(data_array(linspace(ker_len2+1,2,ker_len2),:),...
data_array,...
data_array(linspace(size(data_array,1)-1,size(data_array,1) - ker_len2,ker_len2),:));
data_array = horzcat(data_array(:,linspace(ker_len2+1,2,ker_len2)),...
data_array,...
data_array(:,linspace(size(data_array,2)-1,(size(data_array,2) - ker_len2,ker_len2)));
%cg_X, cg_Y - arrays of indicies for X and Y directions
[cg_X,cg_Y] = meshgrid(linspace(1,210,210),linspace(1,210,210));
%for each point at grid(210x210) formed by cg_X and cg_Y,
%we should calculate the value, using the data from data_array(210,210).
%after padding, data_array will have size (50 + ker_len2*2, 210 + ker_len2*2)
dest_array = arrayfun(#(y,x,cy,cx) grid_point(y, x, cy, cx, data_array, kernel),...
dg_Y, dg_X, cg_Y, cg_X);
end
But, it seems that arrayfun cannot resolve my problem, because I use arrays with different sizes. Have somebody the ideas of this?
I am not completely sure, but judging from the title, this may be what you want:
%Your data
data_array_small = rand(50,210)
data_array_large = zeros(210,210)
%Indicating the points of interest
idx = randperm(size(data_array_large,1));
idx = idx(1:size(data_array_small,1))
%Now actually use the information:
data_array_large(idx,:) = data_array_small

Resources