I have been trying to run logistic regression looping all columns in raw file (predictors) with binary outcome (outcome.csv file).
raw <- data.frame(matrix(nrow=500, ncol=10))
out <- read.csv(file="outcome.csv", header=T)
models <- list()
res <- list()
a <- colnames(raw)
for(i in 1:length(raw){
models[[i]] <- summary(glm(out$blue ~ raw[,i] + out$sex, data=raw , family= binomial ) )
res[[i]] <- paste("logistic", a[[i]], ".txt", sep="")
write.table(models, res, row.names=FALSE, quote=FALSE, sep="\t")
}
it kept saying this
Error in model.frame.default(formula = out$blue ~ raw[, i] + out$MALE, :
variable lengths differ (found for 'raw[, i]')
Any suggestions related for loop or apply would be very much appreciated.
Thanks.
Related
I have a dataframe called test2 and I would like to make multiple plots or subplots which have the same x axis that is month variable and other variables in this dataframe, each as y axis in a plot. My codes is below and it gives me error message... Can you help see how to fix it? Thank in advance.
plot_analysis <- list()
col<-names(test2)[!names(test2)%in%"month"]
for(i in col){
print(i)
plot_analysis[i] <- ggplot(data=test2, aes(month))+
geom_bar(aes(fill=as.factor(col), position="fill")) +
xlab("month") + ylab("") + scale_y_continuous(labels=scales::percent) + scale_x_discrete(limits = month.abb)
}
Warning messages:
1: Ignoring unknown aesthetics: position
2: In plot_analysis[i] <- ggplot(data = test2, aes(month)) + geom_bar(aes(fill = as.factor(col), :
number of items to replace is not a multiple of replacement length
I don't have your data but I did an example.
library(tidyverse)
#library(dplyr)
#library(purrr) # has the map I used to keep functional programming
colnames(mtcars) %>%
map(function(x) mtcars %>%
ggplot(aes(carb)) +
geom_bar(aes(fill=x, position="cyl")))
for loop need a print
for (i in colnames(mtcars)) {
print(
mtcars %>%
ggplot(aes(carb)) +
geom_bar(aes(fill=i, position="cyl"))
)
}
I fitted a logistic regression model in 10-fold cv. I can use the pROC package to get the AUC but it seems the AUC is not for the 10-fold CV because the cvAUC library gave a different AUC. I suspect the AUC from pROC is for one fold. Please how can extract the joint AUC for the 10-fold using the pROC library?
data(iris)
data <- iris[which(iris$Species=="setosa" | iris$Species=="versicolor"),]
data$ID <- seq.int(nrow(data))
table(data$Species)
data$Species <-as.factor(data$Species)
confusion_matrices <- list()
accuracy <- c()
for (i in c(1:10)) {
set.seed(3456)
folds <- caret::createFolds(data$Species, k = 10)
test <- data[data$ID %in% folds[[i]], ]
train <- data[data$ID %in% unlist(folds[-i]), ]
model1 <- glm(as.factor(Species)~ ., family = binomial, data = train)
summary(model1)
pred <- predict(model1, newdata = test, type = "response")
predR <- as.factor( pred >= 0.5)
df <- data.frame(cbind(test$Species, predR))
df_list <- lapply(df, as.factor)
confusion_matrices[[i]] <- caret::confusionMatrix(df_list[[2]], df_list[[1]])
accuracy[[i]] <- confusion_matrices[[i]]$overall["Accuracy"]
}
library(pander)
library(dplyr)
names(accuracy) <- c("Fold 1",....,"Fold 10")
accuracy %>%
pander::pandoc.table()
mean(accuracy)
I have a list that includes 20 matrices. I want to calculate Pearson's correlation betweeen all matrices. but I can not find any possible code or functions? Could you please give some tips for doing so.
something like:
a=matrix(1:8100, ncol = 90)
b=matrix(8100:16199, ncol = 90)
c=matrix(sample(16200:24299),ncol = 90)
z=list(a,b,c)
I find this:
https://rdrr.io/cran/lineup/man/corbetw2mat.html and try it:
library(lineup)
corbetw2mat(z[a], z[b], what = "all")
I've got the following error:
Error in corbetw2mat(z[a], z[b], what = "all") :
(list) object cannot be coerced to type 'double'
I want a list like this for the result:
a & b
correlations
a & c
correlations
b & c
correlations
Thanks
I will create a smaller data set to illustrate the solution below.
To get pairwise combinations the best option is to compute a matrix of combinations with combn and then loop through it, in this case a lapply loop.
set.seed(1234) # Make the results reproducible
a <- matrix(1:9, ncol = 3)
b <- matrix(rnorm(9), ncol = 3)
c <- matrix(sample(1:9), ncol = 3)
sample_list <- list(a, b, c)
cmb <- combn(3, 2)
res <- lapply(seq.int(ncol(cmb)), function(i) {
cor(sample_list[[ cmb[1, i] ]], sample_list[[ cmb[2, i] ]])
})
The results are in the list res.
Note that sample is a base r function, so I changed the name to sample_list.
I use a netCDF file which stores one variable and has following dimensions: lon, lat, time.
Generally speaking I wish to compare it against different data that I have already in R stored as dataframe - first two columns are coordinates in WGS84, and next are values for specific time.
So I wrote following code.
# since # ncFile$dim$time$units say: [1] "days since 1900-1-1"
daysFromDate <- function(data1, data2="1900-01-01")
{
round(as.numeric(difftime(data1,data2,units = "days")))
}
#study area:
lon <- c(40.25, 48)
lat <- c(16, 24.25)
myTime <- c(daysFromDate("2008-01-16"), daysFromDate("2011-12-31"))
varName <- "spei"
require(ncdf4)
require(RCurl)
x <- getBinaryURL("http://digital.csic.es/bitstream/10261/104742/3/SPEI_01.nc")
ncFile <- nc_open(x)
LonIdx <- which( ncFile$dim$lon$vals >= lon[1] | ncFile$dim$lon$vals <= lon[2])
LatIdx <- which( ncFile$dim$lat$vals >= lat[1] & ncFile$dim$lat$vals <= lat[2])
TimeIdx <- which( ncFile$dim$time$vals >= myTime[1] & ncFile$dim$time$vals <= myTime[2])
MyVariable <- ncvar_get( ncFile, varName)[ LonIdx, LatIdx, TimeIdx]
I thought that data frame will be returned so that I will be able to easily manipulate data (in example - check correlation or create a plot).
Unfortunately 3-dimensional list has been returned instead.
How can I reformat this to data frame with following columns X-Y-Time1-Time2-...
So, example data will looks as follows
X Y 2014-01-01 2014-01-02 2014-01-02
50 17 0.5 0.4 0.3
where 0.5, 0.4 and 0.3 are example variable values
Or maybe there is different solution?
Ok, try following code, but it assumes that ranges are dense filled. And I changed lon test from or to and
require(ncdf4)
nc <- nc_open("SPEI_01.nc")
print(nc)
lon <- ncvar_get(nc, "lon")
lat <- ncvar_get(nc, "lat")
time <- ncvar_get(nc, "time")
lonIdx <- which( lon >= 40.25 & lon <= 48.00)
latIdx <- which( lat >= 16.00 & lat <= 24.25)
myTime <- c(daysFromDate("2008-01-16"), daysFromDate("2011-12-31"))
timeIdx <- which(time >= myTime[1] & time <= myTime[2])
data <- ncvar_get(nc, "spei")[lonIdx, latIdx, timeIdx]
indices <- expand.grid(lon[lonIdx], lat[latIdx], time[timeIdx])
print(length(indices))
class(indices)
summary(indices)
str(indices)
df <- data.frame(cbind(indices, as.vector(data)))
summary(df)
str(df)
UPDATE
ok, looks like I got the idea what do you want, but have do direct solution. What I've got so far is this: split data frame using either split() function or data.table package. After splitting by X&Y, you'll get lists of small data frames where X&Y are a constant for a given frame. Probably is it possible to transpose and recombine them back, but I have no idea how. It might be a good idea to continue to work with data as columns, Lists are nested, could be flattened, and here is link for splitting in R: http://www.uni-kiel.de/psychologie/rexrepos/posts/dfSplitMerge.html
Code, as continued from previous example
require(data.table)
colnames(df) <- c("X","Y","Time","spei")
df$Time <- as.Date(df$Time, origin="1900-01-01")
dt <- as.data.table(df)
summary(dt)
# Taken from https://github.com/Rdatatable/data.table/issues/1389
# x data.table
# f use `by` argument instead - unlike data.frame
# drop logical default FALSE will include `by` columns in resulting data.tables - unlike data.frame
# by character column names on which split into lists
# flatten logical default FALSE will result in recursive nested list having data.table as leafs
# ... ignored
split.data.table <- function(x, f, drop = FALSE, by, flatten = FALSE, ...){
if(missing(by) && !missing(f)) by = f
stopifnot(!missing(by), is.character(by), is.logical(drop), is.logical(flatten), !".ll" %in% names(x), by %in% names(x), !"nm" %in% by)
if(!flatten){
.by = by[1L]
tmp = x[, list(.ll=list(.SD)), by = .by, .SDcols = if(drop) setdiff(names(x), .by) else names(x)]
setattr(ll <- tmp$.ll, "names", tmp[[.by]])
if(length(by) > 1L) return(lapply(ll, split.data.table, drop = drop, by = by[-1L])) else return(ll)
} else {
tmp = x[, list(.ll=list(.SD)), by=by, .SDcols = if(drop) setdiff(names(x), by) else names(x)]
setattr(ll <- tmp$.ll, 'names', tmp[, .(nm = paste(.SD, collapse = ".")), by = by, .SDcols = by]$nm)
return(ll)
}
}
# here is data.table split
q <- split.data.table(dt, by = c("X","Y"), drop=FALSE)
str(q)
# here is data frame split
qq <- split(df, list(df$X, df$Y))
str(qq)
numberofusers=75000
numberofitems=65000
number.of.factors=10
# N is a numberofusers*numberofitems sparse Matrix (loaded from a dataset).
#X,Y matrices are already available and have dimensions
# (numberofusers,number.of.factors) and
#(numberofitems,number.of.factors) respectively
ptempuser<-rep(0,numberofitems)
tempuser<-rep(0,numberofitems)
Y.big<-t(Y)%*%Y
for (i in 1:numberofusers) {
matrixproduct1 <- matrix(0,numberofitems,number.of.factors)
nonzerolistforthatuser <- which(N[i,]!=0)
tempuser[nonzerolistforthatuser] <- alpha*N[i,nonzerolistforthatuser]
ptempuser[nonzerolistforthatuser] <- 1
matrixproduct1[nonzerolistforthatuser,] <-tempuser[nonzerolistforthatuser]*Y[nonzerolistforthatuser,]
finalproductmatrix1 <- matrix(0,number.of.factors,number.of.factors)
finalproductmatrix1 <- t(Y)[,nonzerolistforthatuser] %*% matrixproduct1[nonzerolistforthatuser,]
tempuser <- 1+tempuser
matrixproduct2 <- t(Y)
matrixproduct2[,nonzerolistforthatuser] <- t(Y)[,nonzerolistforthatuser]*tempuser[nonzerolistforthatuser]
Agen<-Y.big + finalproductmatrix1
dim1<-dim(Y.big)
dim2<-dim(finalproductmatrix1)
if(dim1[1]!=dim2[1]){
print(i)
print(dim1[1])
print(dim2[1])
}
if(dim1[2]!=dim2[2]){
print(i)
print(dim1[2])
print(dim2[2])
}
finalproductmatrix2 <- matrixproduct2[,nonzerolistforthatuser] %*% cbind(ptempuser[nonzerolistforthatuser])
X[i,] <- (ginv(Y.big+finalproductmatrix1+diag(rep(lambda,number.of.factors))))%*%(finalproductmatrix2)
}
I get the error as 'Error in Y.big + finalproductmatrix1 : non-conformable arrays' . But I even tried doing Agen<-Y.big+ final productmatrix1 inside the function and that has no problem. So surely the dimensions are not causing a problem. Still I get non conformable.
Please tell me what to do. I am stuck on this for hours. I have also checked for the dimension condition and that shows no print results. So I am confused.