How to do a loop over a colnames of data frames stored in a list in r - loops

I'm seeking for a technique to conduct a beta regression across columns in a list of many data frames. The data in the data frames come from two different data frames (one containing dependent factors and another containing independent variables (environmental). Subsequently, to experiment with other methods, I decided to paste a column to environmental data frames acquired from multiple tables.
Environmental data frame
Ev = data.frame(a = runif(20,1,999), b = runif(20,1,7000), c = runif(20,1,3000), d = runif(20, 1, 250)))
Biotic index data frame
Index = data.frame(Pielou = runif(20,0,1), Simpson = runif(20,0,1), LCBD = runif(20, 0, 1), Q = runif(20, 0, 0.8), D = runif(20,0,0.6))
Here I pasted each Bio column in environmental df.
LT = list()
for(i in seq_along(Index)){
LT[[i]] = data.frame(Index[,i], Ev)
}
Then, I performed the regression.
Ldf1 = list();LBM1 = list()
for(i in seq_along(LT)){
LBM1[[i]] = betareg(LT[[i]][[1]] ~ LT[[i]][[i+1]])
Ldf1[[i]] = summary(LBM[[i]])
}
However, the loop only provided me with four results of the first data frame.

Related

Bootstrapping the uncertainty on an RMSE estimate of a location-scale generalized additive model

I have height data (numeric height data in cm; Height) of plants measured over time (numeric data expressed in days of the year; Doy). These data is grouped per genotype (factor data; Genotype) and individual plant (Factor data; Individual). I've managed to calculate the RMSE of the location-scale GAM but I can't figure out how to bootstrap the uncertainty estimate on the RMSE calculation given it is a hierarchical location-scale generalized additive model.
The code to extract the RMSE value looks something like this:
# The GAM
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype,
~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype),
family = gaulss(), # Gaussian location-scale
method = "REML",
data = data)
# Extract the model formula
form <- formula.gam(model)
# Cross-validation for the location
CV <- CVgam(form[[1]], data, nfold = 10, debug.level = 0, method = "GCV.Cp",
printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
# The root mean square error is given by taking the square root of the MSE
sqrt(CV$cvscale[1])`
There is only one height measurement per Individual per day of the year. I figure this is problematic in maintaining the exact same formulation of the GAM. In thsi regard, I was thinking of making sure that the same few Individuals of each genotype (let's say n = 4) were randomly sampled over each day of the year. I can't figure out how to proceed though. Any ideas?
I've tried several methods, such as the boot package and for loops. An example of one of things I've tried is:
lm=list();counter=0
lm2=list()
loops = 3
for (i in 1:loops){
datax <- data %>%
group_by(Doy, Genotype) %>%
slice_sample(prop = 0.6, replace = T)
datax
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype,
~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype),
family = gaulss(),
method = "REML",
data = datax)
# Extract the model formula
form <- formula.gam(model)
# Cross-validation for the location
CV <- CVgam(form[[1]], datax, nfold = 10, debug.level = 0, method = "GCV.Cp",
printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
RMSE[i] <- sqrt(CV$cvscale[c(1)])
}
RMSE
This loop runs very slow and just returns me 3 times the same RMSE values; Surely, there is an issue with the sampling.
Unfortunately, I can't share my data but maybe somebody has an idea on how to proceed?
Many thanks!

R: Adding columns from one data frame to another, non-matching number of rows

I have a .txt file with millions of rows of data - DateTime (1-min intervals) and Precipitation.
I have a .csv file with thousands of rows of data - DateTime (daily intevals), MaxTemp, MinTemp, WindSpd, WindDir.
I import the .txt file as a data frame and do a few transformations. I then move this into a new data frame.
I import the .csv file as a data frame do a few transformations. I then want to add the columns from this data frame into the new data frame (total of 7 columns). However, R throws an error: "Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 10382384, 32868, 1"
I know the number of rows is different, however, this is the format I need for the next step in processing. This could be easily done in Excel were it not for the crazy amount of rows.
Simulated code is below, which produces the same error:
a <- as.character(c(1,2,3,4,5,6,7,8,9,10))
b <- c(paste("Date", a))
c <- c(rnorm(10, mean = 5, sd = 2.1))
Frame1 <- data.frame(b,c)
d <- as.character(c(1,2,3))
e <- c(paste("Date", d))
f <- c(rnorm(3, mean = 1, sd = 0.7))
g <- c(rnorm(3, mean = 3, sd = 2))
h <- c(rnorm(3, mean = 8, sd = 1))
Frame2 <- data.frame(e,f,g,h)
NewFrame <- cbind(Frame1)
NewFrame <- cbind(NewFrame, Frame2)
I have tried a *_join but it throws error: "Error: by must be supplied when x and y have no common variables.
i use by = character()` to perform a cross-join." which to me reads like it wants to match things up, which I don't need. I really just need to plop these two datasets side-by-side for the next processing step. Help?
The data frames MUST have an equal number of rows. To compensate then, I just added a bunch of rows to the smaller dataset so that it contains the same amount of rows as the larger dataset (in my case, it will always be the .csv file) and filled it with "NA" values. The following application I use for downstream processing knows how to handle the "NA" values so this works well for me.
I've run the solution with a representative dataset and I am able to cbind the two data frames together.
Sample code with the simulated dataset:
#create data frame 1
a <- as.character(c(1:10))
b <- c(paste("Date", a))
c <- c(rnorm(10, mean = 5, sd = 2.1))
Frame1 <- data.frame(b,c)
#create date frame 2
d <- as.character(c(1,2,3))
e <- c(paste("Date", d))
f <- c(rnorm(3, mean = 1, sd = 0.7))
g <- c(rnorm(3, mean = 3, sd = 2))
h <- c(rnorm(3, mean = 8, sd = 1))
Frame2 <- data.frame(e,f,g,h)
#find the maximum number of rows
maxlen <- max(nrow(Frame1), nrow(Frame2))
#finds the minimum number of rows
rowrow <- min(nrow(Frame1), nrow(Frame2))
#adds enough rows to the smaller dataset to equal the number of rows
#in the larger dataset. Populates the rows with "NA" values
Frame2[rowrow+(maxlen-rowrow),] <- NA
#creates the new data frame from the two frames
NewFrame <- cbind(NewFrame, Frame2)

How to select corresponding value in two data sets using matlab?

I have a two datasets (x,y) in in table:
x = [4.14;5.07;3.61;4.07;3.68;4.13;3.95;3.88;5.41;6.14]
y = [69.78;173.07;19.28;32.88;15.87;53.73;41.69;35.14;228.08;267.11];
tb = table(x,y)
edges = linspace(30, 0, 61);
Based on this I have written following program
for k = 1:length(x)
New(k) = find(x(k)>edges,1,'last');
end
I want to see datasets y which satisfying condition of of above x values.

Parallel MATLAB - Create a distributed vector

I have a relativelly small vector in Matlab
R = randn(1,1000);
Now I would like to create a much bigger vector by selecting a specified set of elements like so
Q = R([1 5 8 5 8 1 3 4 19 1, etc]);
The number of the selected elements numel(Q) is 1,000,000+, very big. Is it possible to do this step such that the resulting vector Q is automatically a distributed array, ready for parallel processing on a multicore machine?
Thanks!
The approaches mentioned here assumes that you want to have at least R and Q as the distributed arrays.
Approach #1
This solution would be based on this very smart solution -
N = 3;
R = randn(1,N,'distributed');
[~,ind] = sort(rand(numel(R)));
Q = R(ind(:));
Note that for the above code, ind would be on the client side. If you would like to have it as a distributed array too, use this -
N = 3;
R = randn(1,N,'distributed');
ind = ones(N,'distributed');
[~,ind(:,:)] = sort(rand(numel(R)));
Q = R(ind(:))
Output -
R =
0.3080 0.8227 0.4248
Q =
0.8227 0.3080 0.4248 0.4248 0.8227 0.3080 0.3080 0.4248 0.8227
In your case, N = 1000.
Approach #2
If you don't care about how many times an element from R is repeated in Q, then you may use this -
R = randn(1,N,'distributed');
Q = R(reshape(ceil(N*rand(N)),1,[]));

Calculation data from one array to another

I have two array, the first one is data_array(50,210), the second one is dest_array(210,210). The goal, using data from data_array to calculate the values of dest_array at specific indicies, without using for-loop.
I do it in such way:
function [ out ] = grid_point( row,col,cg_row,cg_col,data,kernel )
ker_len2 = floor(length(kernel)/2);
op1_vals = data((row - ker_len2:row + ker_len2),(col - ker_len2:col + ker_len2));
out(cg_row,cg_col) = sum(sum(op1_vals.*kernel)); %incorre
end
function [ out ] = sm(dg_X, dg_Y)
%dg_X, dg_Y - arrays of size 210x210, the values - coordinates of data in data_array,
%index of each element - position this data at 210x210 grid
data_array = randi(100,50,210); %data array
kernel = kernel_sinc2d(17,'hamming'); %sinc kernel for calculations
ker_len2 = floor(length(kernel)/2);
%adding the padding for array, to avoid
%the errors related to boundaries of data_array
data_array = vertcat(data_array(linspace(ker_len2+1,2,ker_len2),:),...
data_array,...
data_array(linspace(size(data_array,1)-1,size(data_array,1) - ker_len2,ker_len2),:));
data_array = horzcat(data_array(:,linspace(ker_len2+1,2,ker_len2)),...
data_array,...
data_array(:,linspace(size(data_array,2)-1,(size(data_array,2) - ker_len2,ker_len2)));
%cg_X, cg_Y - arrays of indicies for X and Y directions
[cg_X,cg_Y] = meshgrid(linspace(1,210,210),linspace(1,210,210));
%for each point at grid(210x210) formed by cg_X and cg_Y,
%we should calculate the value, using the data from data_array(210,210).
%after padding, data_array will have size (50 + ker_len2*2, 210 + ker_len2*2)
dest_array = arrayfun(#(y,x,cy,cx) grid_point(y, x, cy, cx, data_array, kernel),...
dg_Y, dg_X, cg_Y, cg_X);
end
But, it seems that arrayfun cannot resolve my problem, because I use arrays with different sizes. Have somebody the ideas of this?
I am not completely sure, but judging from the title, this may be what you want:
%Your data
data_array_small = rand(50,210)
data_array_large = zeros(210,210)
%Indicating the points of interest
idx = randperm(size(data_array_large,1));
idx = idx(1:size(data_array_small,1))
%Now actually use the information:
data_array_large(idx,:) = data_array_small

Resources