In R how to combine dataframes - arrays

I am a complete novice when it comes to R and posting on here so apologies in advance. What I am trying to do is combine 2 dataframes
data <- seq(0,1000)
info <- data.frame(x,a,b)
where info can have up to approx. 50,000 rows. What I need to do is divide each entry in data by each x in info and then work use pbeta with the answer. eg
data[1,1] = 1
and
info$x = seq(1:10)
then I would need
sum(pbeta(1/1,a,b), pbeta(1/2,a,b), pbeta(1/3,a,b) ... pbeta(1/10,a,b))
At the moment I am using a loop to go through each element of data and perform the calculations. Is there a way to avoid using a loop. Shown below
while (value <= max)
{x<-(value/info$x);
alp<-(info$Alpha);
bet <-(info$beta);
rat<-(info$rate);
ans <-(1/(1-exp(-1*(sum((1-pbeta(x,alp,bet))*r)))));
data <- rbind(data,data.frame(ans, value));
value <- (value + ((max-1)/1000));
}
Apologies for my lack of knowledge on this and how to post. Any help would be greatly appreciated.

Your while loop is confusing me because it doesn't seem to do what you describe before. However, maybe this is helpful:
data <- (1:5)/10
info <- data.frame(x = 1:3, a = 1 + (1:3)/10, b = 1 - (1:3)/10)
vapply(data, function(x, info) sum(pbeta(x/info$x, info$a, info$b)),
info = info,
FUN.VALUE = 0.1)
#[1] 0.1009253 0.2234949 0.3571645 0.4994744 0.6493979

Related

Interpret a bootstrap output

I applied bootstrapping for a logistic regression model. As far as I understood correctly, the biasin the bootstrap output should help to evaluate, whether my logistic regression model is representative for the true population, right? I have presence-only data of two time points in a rather small study area (about 500 ha) and I want to test an elevational upward-shift of the species distribution since the past time point. Thereby, I randomly selected pseudo-absences in the study area and it might be good to evaluate the reliability of the model.
If bootstrapping is a good way to go for me, I still get stuck with the interpretation of the bootstrapping output.
The output looks like this:
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = comdat, statistic = boot.h, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 4.776178e+01 2.190671e+00 2.853876e+01
t2* 1.709549e-02 1.580777e-04 9.383476e-03
t3* -4.969446e-05 -8.607768e-07 2.099336e-05
t4* -9.436566e+01 -3.473776e+00 3.345420e+01
t5* -5.454165e-02 -2.403711e-03 3.085717e-02
t6* 1.488151e-05 6.536446e-07 8.284733e-06
t7* 1.024497e-01 3.803788e-03 3.604996e-02
t8* -2.725339e-05 -1.028595e-06 9.672357e-06
Questions:
Although some predictors turned out as significant in the orignial model, I wonder whether that is meaningful, as the orginal coeffcients are far below zero?
How can I judge about whether the bias is big or not? So, whether my original model is okay or not?
Thanks in advance and sorry, if I didnt get it well in general. Thats possible, as I am very unsure about the bootstrapping and also the reliability of my model (I included 150 randomly selected pseudo-absences)
boot.h <- function(data, indices) {
data <- data[indices, ]
mod <- glm(formula = mound ~
+ aspect + I(aspect^2) + + year
+ elevation +I(elevation^2) + elevation:year +year:I(elevation^2)
, family = binomial, data
= data)
coefficients (mod)
}
boot.k <- boot(data = comdat, statistic = boot.h, R = 999)
plot(boot.k, index = 2)
plot(boot.k, index = 3)
plot(boot.k, index = 4)
plot(boot.k, index = 5)
plot(boot.k, index = 6)
plot(boot.k, index = 7)
plot(boot.k, index = 8)
boot.conf.2 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=2)
boot.conf.3 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=3)
boot.conf.4 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=4)

Error message: operands could not be broadcast together with shapes (65536,) (2,)

I'm trying to remove outliers from a dataframe:
df.shape
(65536, 3)
To do so, I created a function, where Tag is the label of the columns:
def outliers(dataset, Tag):
Q1 = dataset[Tag].quantile(0.25)
Q3 = dataset[Tag].quantile(0.75)
IQR = Q3 -Q1
Lsup = Q3 + 1,5*IQR
Linf = Q1 - 1,5*IQR
list = dataset.index[(dataset[Tag] > upper_bound) or (dataset[Tag] < lower_bound)]
return list
Then I created an empty list to store the output indices from the multiple columns:
index_list = []
for columns in ['L4553', 'F5432']:
index_list.extend(outliers(df, columns))
After this, the error appears:
operands could not be broadcast together with shapes (65536,) (2,)
Could you guys help me, please? I don't know what to do.
This code needs 3 fixes to work:
You don't set the variables upper_bound or lower_bound inside the function, so I'm assuming these are actually supposed to be Lsup and Linf.
I'm assuming you meant to use decimals instead of commas when setting Lsup and Linf. This should dix the error you were getting (since using a comma makes them tuples).
You'll find you get another error though, which can be fixed by using np.logical_or() when checking for values that fall outside of the upper or lower bound.
With these changes your codes should look like the following:
import numpy as np
def outliers(dataset, Tag):
Q1 = dataset[Tag].quantile(0.25)
Q3 = dataset[Tag].quantile(0.75)
IQR = Q3 -Q1
Lsup = Q3 + 1.5*IQR
Linf = Q1 - 1.5*IQR
list = dataset.index[np.logical_or(dataset[Tag] > Lsup, dataset[Tag] < Linf)]
return list

R Multiplying each element of an array by a different number

I am trying to multiply each element of an array by an integer (along the first dimension). The tricky thing is that this integer will change for each element.
An example :
test <- array(dim = c(3,5,7))
test[1,,] <- 1
test[2,,] <- 10
test[3,,] <- 100
vec <- c(1,2,3)
The result I want is an array with the same dimension (3,5,7) and along the first dimension :
test[1,,] * vec[1]
test[2,,] * vec[2]
test[3,,] * vec[3]
This means
Result <- array(dim = c(3,5,7))
Result[1,,] <- 1
Result[1,,] <- 20
Result[1,,] <- 300
I think I am quite close with different functions like outer or apply but I think there is an easier way, as I have a lot of data to treat. For now, I found the outer function, and I should select something like the diagonal of the result.
Can someone help ?
slice.index might be helpful here
Result <- test * vec[slice.index(test, 1)]
How about
test*replicate(7, replicate(5, vec))
What's wrong with using apply like this?
sapply(1:length(vec), function(i) test[i,,]<<- test[i,,]*vec[i])
In this case you can just do
Result <- test*vec
Note that this will only work if the dimension that is being split and multiplied is the first one.

Create an array 1*3 containing only one 1 and rest 0

I am just learning matlab now. I faced a difficulty in creating an array of 3 elements in a row.
I wrote a code
Source = randi ([0,1],1,3);
which gave me output
[1,1,0].....
[0,1,1]....
but I was willing to get only one 1 and two zeros in the output instead of getting two 1 and one zero.
I know I am wrong because I am using randi function and gives random value of 0 & 1 and output I get can be [0,0,1] ... [1,0,0]... too.
My clear problem is to only get only one 1 if I repeat as many times. e.g. I should get only [0,0,1] or [0,1,0] or [1,0,0].
Hope I can get solution.
Thank you.
Ujwal
Here's a way using randperm:
n = 3; %// total number of elements
m = 1; %// number of ones
x = [ones(1,m) zeros(1,n-m)];
x = x(randperm(numel(x)));
Here is a couple of alternative solutions for your problem.
Create zero-filled matrix and set random element to one:
x = zeros(1, 3);
x(randi(3)) = 1;
Create 1x3 eye matrix and randomly circshift it:
x = circshift(eye(1,3), [0, randi(3)]);

Referencing an array value from in a function in R

I imagine I am missing something quite simple here, or I am barking up the wrong tree completely, however I have been trying to sort this out over a number of days and my novice R skills haven't been able to crack it.
I am looking for a method to reference an array of values from within a R function. I am creating a simulated population, I have individuals age, sex and ethnicity and I want to simulate the presence of absence of diabetes. I have the prevalence of diabetes by age bracket, gender and ethnicity which I have made into a 2(gender)x11(age bracket)x6(ethnicity) array. What I want to do is the reference the correct cell within the array and used that with a runif called to run a bernoulli trial per individual.
The code below is the current version however I have tried a number of different methods with varying results:
function(AB,sex,eth){
AB<-AB
sex<- sex
eth<-as.numeric(eth)
#make matrix reference
#make 'european' equal to 'other'
eth <- ifelse(eth==7,6,eth)
#change male from a 0 coding to a 2 for array lookup
sex <- ifelse(sex==1,1,2)
#remove seven from AB due to diab data starting at 30-34 age bracket
agebracket <- AB-7
#random number drawn
diabbase <- runif(census$Total.Sex[AB],0,1)
#census$total.sex gives the total number in each age bracket
#array assignment
arrayvalue <- Darray[agebracket,sex,eth]
diab <- ifelse((diabbase >= (Darray[agebracket,sex,eth])) ,1,0)
return(diab)
}
if i call the function from the command line with "arrayvalue" returned rather than "diab" and individual values submitted rather than variables (ie diabtest <- diabgen(10,1,1) ) it returns the correct value from the array but if I submit the variables(ie diabtest <- diabgen(AB,sex,eth) it returns an empty array.
If I can give further info that might make what i am talking about clearer please let me know I would be more than happy to do so, it seems so easy but it is doing my head in. I am open to any suggestions on other/better ways of doing the same thing, any hints appreciated.
This maybe doesn't solve your problem (I'll update as needed), but it is a simple simulated dataframe for your conditions (2x11x6 factors)
brackets <- round(seq(15, 85, length.out = 12))
brlabels <- character()
for (i in 1:11) {
brlabels[i] <- paste(brackets[i], "to", brackets[i + 1], sep = " ")
}
AB <- cut(round(runif(100, 18, 80)), breaks = brackets, labels = brlabels)
sex <- factor(sample(c(1,2), 100, replace = TRUE), levels = c(1,2), labels = c("Male", "Female"))
eth <- factor(sample(c(1:6), 100, replace = TRUE), levels = c(1:6), labels = c("French", "German", "Swedish", "Polish", "Greek", "Italian"))
somerandombusiness <- rnorm(100, 50, 4)
sim.df <- data.frame(somerandombusiness)
sim.df$AB <- AB
sim.df$sex <- sex
sim.df$eth <- eth
It may be more cumbersome to select a specific intersection of the three at first, but most of the tools to deal with factor variables expect a dataframe.
Edit 1
You could do something like:
runif(1,0) >= (sim.df[which(sim.df$AB=="34 to 40"&sim.df$sex=="Male"&sim.df$eth=="German"), 1])
But I'm still not sure why you would want to. For one, with my method there is no way to be sure that all possible combinations are enumerated. You could up the sample size to a few thousand without much trouble but that would only make it really really likely that every combination existed. In this case I've chose one that does exist.
You could do this more easily w/ something like table(sim.df$eth, sim.df[, 1] > 60) which will give a cross-tab of all the somerandombusiness values > 60 and various ethnicities.

Resources