Interpret a bootstrap output - logistic-regression

I applied bootstrapping for a logistic regression model. As far as I understood correctly, the biasin the bootstrap output should help to evaluate, whether my logistic regression model is representative for the true population, right? I have presence-only data of two time points in a rather small study area (about 500 ha) and I want to test an elevational upward-shift of the species distribution since the past time point. Thereby, I randomly selected pseudo-absences in the study area and it might be good to evaluate the reliability of the model.
If bootstrapping is a good way to go for me, I still get stuck with the interpretation of the bootstrapping output.
The output looks like this:
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = comdat, statistic = boot.h, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 4.776178e+01 2.190671e+00 2.853876e+01
t2* 1.709549e-02 1.580777e-04 9.383476e-03
t3* -4.969446e-05 -8.607768e-07 2.099336e-05
t4* -9.436566e+01 -3.473776e+00 3.345420e+01
t5* -5.454165e-02 -2.403711e-03 3.085717e-02
t6* 1.488151e-05 6.536446e-07 8.284733e-06
t7* 1.024497e-01 3.803788e-03 3.604996e-02
t8* -2.725339e-05 -1.028595e-06 9.672357e-06
Questions:
Although some predictors turned out as significant in the orignial model, I wonder whether that is meaningful, as the orginal coeffcients are far below zero?
How can I judge about whether the bias is big or not? So, whether my original model is okay or not?
Thanks in advance and sorry, if I didnt get it well in general. Thats possible, as I am very unsure about the bootstrapping and also the reliability of my model (I included 150 randomly selected pseudo-absences)
boot.h <- function(data, indices) {
data <- data[indices, ]
mod <- glm(formula = mound ~
+ aspect + I(aspect^2) + + year
+ elevation +I(elevation^2) + elevation:year +year:I(elevation^2)
, family = binomial, data
= data)
coefficients (mod)
}
boot.k <- boot(data = comdat, statistic = boot.h, R = 999)
plot(boot.k, index = 2)
plot(boot.k, index = 3)
plot(boot.k, index = 4)
plot(boot.k, index = 5)
plot(boot.k, index = 6)
plot(boot.k, index = 7)
plot(boot.k, index = 8)
boot.conf.2 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=2)
boot.conf.3 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=3)
boot.conf.4 <- boot.ci(boot.out =boot.k, conf = 0.95,
type=c("norm","prec","bca"), index=4)

Related

MatchIt - how to make matching date specific?

I'm trying to use MatchIt to create two sets of matched investment companies (treatment vs control).
I need to match the treatment companies to the control companies using only data from the 1-3 years proceeding the treatment.
For example if a company received treatment in 2009, then I would want to match it using data from 2009, 2008, 2007 (My after treatment effects dummy would hold a value from 2010 onwards in this case)
I am unsure how to add this selection into my matching code, which currently looks like this:
matchit(signatory ~ totalUSD + brownUSD + country + strategy, data = panel6, method = "full")
Should I consider using the 'after' treatments effects dummy in some way?
Any tips for how I add this in would be greatly appreciated!
There is no straightforward way to do this in MatchIt. You can set a caliper, which requires the control companies to be within a certain number of years from a treated company, but there isn't a way to require that control companies have a year strictly before the treated company. You can perform exact matching on year so that the treated and control companies have exactly the same year using the exact argument.
Another, slightly more involved way is to construct a distance matrix yourself and set to Inf any distances between units that are forbidden to match with each other. The first step would be estimating a propensity score, which you can do manually or using matchit(). Then you construct a distance matrix, and for each entry in the distance matrix, decide whether to set the distance to Inf. FInaly, you can supply the distance matrix to the distance argument of matchit(). Here's how you would do that:
#Estimate the propensity score
ps <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = NULL)$distance
#Create the distance matrix
dist <- optmatch::match_on(signatory ~ ps, data = panel6)
#Loop through the matrix and set set disallowed matches to Inf
t <- which(panel6$signatory == 1)
u <- which(panel6$signatory != 1)
for (i in seq_along(t)) {
for (j in seq_along(u)) {
if (panel6$year[u[j]] > panel6$year[t[i]] || panel6$year[u[j]] < panel6$year[t[i]] - 2)
dist[i,j] <- Inf
}
}
#Note: can be vectorized for speed but shouldn't take long regardless
#Supply the distance matrix to matchit() and match
m <- matchit(signatory ~ totalUSD + brownUSD + country + strategy,
data = panel6, method = "full", distance = dist)
That should work. You can verify by looking at individual groups of matched companies using match.data():
md <- match.data(m, data = panel6)
md <- md[with(md, order(subclass, signatory)),]
View(md) #assuming you're using RStudio
You should see that within subclasses, the control units are 0-2 years below the treated units.

Matlab: Sum corresponding values if index is within a range

I have been going crazy trying to figure a way to speed this up. Right now my current code talks ~200 sec looping over 77000 events. I was hoping someone might be able to help me speed this up because I have to do about 500 of these.
Problem:
I have arrays (both 200000x1) that correspond to Energy and Position of a hit over 77000 events. I have the range of each event separated into two arrays, event_start and event_end. First thing I do is look for the position in a specific range, then I put the correspond energy in its own array. To get what I need out of this information, I loop through each event and its corresponding start/end to sum up all the energies from each it hit. My code is below:
indx_pos = find(pos>0.7 & pos<2.0);
energy = HitEnergy(indx_pos);
for i=1:n_events
Etotal(i) = sum(energy(find(indx_pos>=event_start(i) …
& indx_pos<=event_end(i))));
end
Sample input & output:
% Sample input
% pos and energy same length
n_events = 3;
event_start = [1 3 7]';
event_end = [2 6 8]';
pos = [0.75 0.8 2.1 3.6 1.9 0.5 21.0 3.1]';
HitEnergy = [0.002 0.004 0.01 0.0005 0.08 0.1 1.7 0.007]';
% Sample Output
Etotal = 0.0060
0.0800
0
Approach #1: Generic case
One approach with bsxfun and matrix-multiplication -
mask = bsxfun(#ge,indx_pos,event_start.') & bsxfun(#le,indx_pos,event_end.')
Etotal = energy.'*mask
This could be a bit memory-hungry if indx_pos has lots of elements in it.
Approach #2: Non-overlapping start/end ranges case
One can use accumarray for this special case like so -
%// Setup ID array for use in accumarray later on
loc(numel(pos))=0; %// Fast pre-allocation scheme
valids = event_end+1<=numel(pos);
loc(event_end(valids)+1) = -1*(1:sum(valids));
loc(event_start) = loc(event_start)+(1:numel(event_end));
id = cumsum(loc);
%// Set elements as zeros in HitEnergy that do not satisfy the criteria:
%// pos>0.7 & pos<2.0
HitEnergy_select = (pos>0.7 & pos<2.0).*HitEnergy(:);
%// Discard elments in HitEnergy_select & id that have IDs as zeros
HitEnergy_select = HitEnergy_select(id~=0);
id = id(id~=0);
%// Accumulate summations as done inside the loop in the original code
Etotal = accumarray(id(:),HitEnergy_select);
The problem is that for every event you are searching the entire vector indx_pos.
Constrain your search inside the loop to only the range from event_start(i) to event_end(i):
for i = 1:n_events
I = event_start(i):event_end(i);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal(i) = sum(HitEnergy(I(posIIsWithinRange)));
end
You could also use a vectorized version based on run length decoding and vectorizing the notion of colon. (Download the functions coloncatrld and runLengthDecode.)
I = coloncatrld(event_start, event_end);
energy = HitEnergy(I);
eventNum = runLengthDecode(event_end - event_start+1);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal = accumarray(eventNum(posIIsWithinRange), energy(posIIsWithinRange), [n_events,1]);
This is similar to Divakar's Approach #2 with the addition that it should work for overlapping ranges too.

Theano - logistic regression example weight vector becomes NaN?

I am doing a tutorial (code here) and video here (13:00 minutes in).
My only change is using the mnist training set from a different location (creating a one-hot encoding) but it is not working. I literally copy-pasted all the code (except for the mnist loading) in this example. Here is the code:
import theano
from theano import tensor as T
import numpy as np
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata("MNIST Original")
trX, teX, trY_digit, teY_digit = train_test_split(mnist.data, mnist.target, test_size=.4)
#Get one-hot encoding
enc = OneHotEncoder()
enc.fit([[n] for n in range(10)])
trY, teY = sparse_to_floatX(enc.transform(trY_digit[:,newaxis])), sparse_to_floatX(enc.transform(teY_digit[:,newaxis]))
def floatX(X):
return np.asarray(X, dtype=theano.config.floatX)
def init_weights(shape):
return theano.shared(floatX(np.random.randn(*shape) * 0.1))
def model(X, w):
return T.nnet.softmax(T.dot(X, w))
X = T.fmatrix()
Y = T.fmatrix()
w = init_weights((784, 10))
py_x = model(X, w)
y_pred = T.argmax(py_x, axis=1)
cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * 0.05]]
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
for i in range(10):
print w.get_value()
cost = train(trX, trY)
print i, predict(teX)
The weight vector updates once, and becomes all NaN on the second update. I am very new to theano, but I am looking for tips to figure this out, especially if someone has already done this tutorial.
UPDATE.
It looks like the gradient is the issue.
When I add this
the_grad = T.sum(gradient)
f_grad = theano.function(inputs=[X, Y], outputs=the_grad, allow_input_downcast=True)
print f_grad(trX, trY)
It prints NaN. This appears to be the correct usage of T.grad though.
UPDATE 2.
When I change the cost function to this:
cost = T.mean(T.sum(T.sqr(py_x - Y), axis=1), axis=0)
It is working now but I only have 70% accuracy which is really bad.
UPDATE 3.
I downloaded the MNIST data used in the tutorial and it worked with 92% accuary.
I am not sure why my first mnist datasource was dying with the crossentropy cost, and then performing really poor with mean squared error cost function.

In R how to combine dataframes

I am a complete novice when it comes to R and posting on here so apologies in advance. What I am trying to do is combine 2 dataframes
data <- seq(0,1000)
info <- data.frame(x,a,b)
where info can have up to approx. 50,000 rows. What I need to do is divide each entry in data by each x in info and then work use pbeta with the answer. eg
data[1,1] = 1
and
info$x = seq(1:10)
then I would need
sum(pbeta(1/1,a,b), pbeta(1/2,a,b), pbeta(1/3,a,b) ... pbeta(1/10,a,b))
At the moment I am using a loop to go through each element of data and perform the calculations. Is there a way to avoid using a loop. Shown below
while (value <= max)
{x<-(value/info$x);
alp<-(info$Alpha);
bet <-(info$beta);
rat<-(info$rate);
ans <-(1/(1-exp(-1*(sum((1-pbeta(x,alp,bet))*r)))));
data <- rbind(data,data.frame(ans, value));
value <- (value + ((max-1)/1000));
}
Apologies for my lack of knowledge on this and how to post. Any help would be greatly appreciated.
Your while loop is confusing me because it doesn't seem to do what you describe before. However, maybe this is helpful:
data <- (1:5)/10
info <- data.frame(x = 1:3, a = 1 + (1:3)/10, b = 1 - (1:3)/10)
vapply(data, function(x, info) sum(pbeta(x/info$x, info$a, info$b)),
info = info,
FUN.VALUE = 0.1)
#[1] 0.1009253 0.2234949 0.3571645 0.4994744 0.6493979

Referencing an array value from in a function in R

I imagine I am missing something quite simple here, or I am barking up the wrong tree completely, however I have been trying to sort this out over a number of days and my novice R skills haven't been able to crack it.
I am looking for a method to reference an array of values from within a R function. I am creating a simulated population, I have individuals age, sex and ethnicity and I want to simulate the presence of absence of diabetes. I have the prevalence of diabetes by age bracket, gender and ethnicity which I have made into a 2(gender)x11(age bracket)x6(ethnicity) array. What I want to do is the reference the correct cell within the array and used that with a runif called to run a bernoulli trial per individual.
The code below is the current version however I have tried a number of different methods with varying results:
function(AB,sex,eth){
AB<-AB
sex<- sex
eth<-as.numeric(eth)
#make matrix reference
#make 'european' equal to 'other'
eth <- ifelse(eth==7,6,eth)
#change male from a 0 coding to a 2 for array lookup
sex <- ifelse(sex==1,1,2)
#remove seven from AB due to diab data starting at 30-34 age bracket
agebracket <- AB-7
#random number drawn
diabbase <- runif(census$Total.Sex[AB],0,1)
#census$total.sex gives the total number in each age bracket
#array assignment
arrayvalue <- Darray[agebracket,sex,eth]
diab <- ifelse((diabbase >= (Darray[agebracket,sex,eth])) ,1,0)
return(diab)
}
if i call the function from the command line with "arrayvalue" returned rather than "diab" and individual values submitted rather than variables (ie diabtest <- diabgen(10,1,1) ) it returns the correct value from the array but if I submit the variables(ie diabtest <- diabgen(AB,sex,eth) it returns an empty array.
If I can give further info that might make what i am talking about clearer please let me know I would be more than happy to do so, it seems so easy but it is doing my head in. I am open to any suggestions on other/better ways of doing the same thing, any hints appreciated.
This maybe doesn't solve your problem (I'll update as needed), but it is a simple simulated dataframe for your conditions (2x11x6 factors)
brackets <- round(seq(15, 85, length.out = 12))
brlabels <- character()
for (i in 1:11) {
brlabels[i] <- paste(brackets[i], "to", brackets[i + 1], sep = " ")
}
AB <- cut(round(runif(100, 18, 80)), breaks = brackets, labels = brlabels)
sex <- factor(sample(c(1,2), 100, replace = TRUE), levels = c(1,2), labels = c("Male", "Female"))
eth <- factor(sample(c(1:6), 100, replace = TRUE), levels = c(1:6), labels = c("French", "German", "Swedish", "Polish", "Greek", "Italian"))
somerandombusiness <- rnorm(100, 50, 4)
sim.df <- data.frame(somerandombusiness)
sim.df$AB <- AB
sim.df$sex <- sex
sim.df$eth <- eth
It may be more cumbersome to select a specific intersection of the three at first, but most of the tools to deal with factor variables expect a dataframe.
Edit 1
You could do something like:
runif(1,0) >= (sim.df[which(sim.df$AB=="34 to 40"&sim.df$sex=="Male"&sim.df$eth=="German"), 1])
But I'm still not sure why you would want to. For one, with my method there is no way to be sure that all possible combinations are enumerated. You could up the sample size to a few thousand without much trouble but that would only make it really really likely that every combination existed. In this case I've chose one that does exist.
You could do this more easily w/ something like table(sim.df$eth, sim.df[, 1] > 60) which will give a cross-tab of all the somerandombusiness values > 60 and various ethnicities.

Resources