How to get sjPlot::tab_model to show probabilities for a logistic regression with interaction - logistic-regression

I have a logistic regression model with an interaction term. I am trying to figure out how to present probabilities for the interaction terms with tab_model.
Example data and model:
dat <- cbind(Species = rep(letters[1:10], each = 5),
threat_cat = rep(c("recreation", "climate", "pollution", "fire", "invasive_spp"), 10),
impact.pres = sample(0:1, size = 50, replace = T),
threat.pres = sample(0:1, size = 50, replace = T))
mod <- glm(impact.pres ~ 0 + threat_cat/threat.pres, data = dat, family = "binomial")
summary(mod)
Call:
glm(formula = impact.pres ~ 0 + threat_cat/threat.pres, family = "binomial",
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.89302 -0.66805 0.00013 0.66805 1.79412
Coefficients:
Estimate Std. Error z value Pr(>|z|)
threat_catclimate 5.108e-01 7.303e-01 0.699 0.484
threat_catfire 1.609e+00 1.095e+00 1.469 0.142
threat_catinvasive_spp -1.386e+00 1.118e+00 -1.240 0.215
threat_catpollution 1.386e+00 1.118e+00 1.240 0.215
threat_catrecreation -1.386e+00 1.118e+00 -1.240 0.215
threat_catclimate:threat.pres -5.108e-01 1.592e+00 -0.321 0.748
threat_catfire:threat.pres -2.018e+01 3.261e+03 -0.006 0.995
threat_catinvasive_spp:threat.pres 1.792e+00 1.443e+00 1.241 0.214
threat_catpollution:threat.pres 3.514e-16 1.581e+00 0.000 1.000
threat_catrecreation:threat.pres 1.995e+01 2.917e+03 0.007 0.995
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 69.315 on 50 degrees of freedom
Residual deviance: 45.511 on 40 degrees of freedom
AIC: 65.511
Number of Fisher Scoring iterations: 17
if I run tab_model(mod), it returns odds ratios for both the categorical variables and interaction terms.
However, I am interested in probabilities, to go along with a nice figure, made with:
plot_model(mod, type = "int")+
coord_flip()
I know that I can create a function to calculate probabilities for categorical coefficients, and have tried to use that with tab_model:
prob <- function(p) {exp(coef(p)) / (1 + exp(coef((p)))}
tab_model(mod, transform = prob)
This only gave me correct probabilities for the categorical coefficients, and not for the interaction terms. The second time I tried, it threw an error (Error: $ operator is invalid for atomic vectors), even though I didn't change anything.
Am I missing something? Is there a way to get tab_model to print the same probabiliites that were shown in my figure, at least for the interaction, where threat.pres = 1?
If not, how can I extract the data that plot_model used, such as in the format of a dataframe?
A related bit of code I could use help with, for the figure- for plot_model, the command to show_values doesn't seem to work if type = "int", nor do any of the options to display significance...
Any help would be appreciated!

Related

Creating a composite biomarker score using logistic regression coefficients

I have done a standard logistic regression model including 4 cytokines looking at whether they can predict relapse or remission of disease. I wanted to create a composite biomarker score of these 4 markers so that I can then enter into further predictive analysis of outcome e.g. ROC curves and Kaplan Meier. I was planning on doing this by extracting the β coefficients from the multivariable logistic regression with all (standardized) biomarkers and then multiply those with the (standardized) biomarker levels to create a composite. I just wondered whether this method was ok and how I can go about this using R?
This is my logistic regression model and output. I wanted to use combinations of these four variables to make a composite biomarker score weighted by their respective coefficients and then to produce ROC curves looking at whether these biomarkers can predict outcome.
Thanks for your help.
summary(m1)
Call:
glm(formula = Outcome ~ TRAb + TACI + BCMA + BAFF, family = binomial,
data = Timepoint.1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4712 0.1884 0.3386 0.5537 1.6212
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.340e+00 2.091e+00 3.032 0.00243 **
TRAb -9.549e-01 3.574e-01 -2.672 0.00755 **
TACI -6.576e-04 2.715e-04 -2.422 0.01545 *
BCMA -1.485e-05 1.180e-05 -1.258 0.20852
BAFF -2.351e-03 1.206e-03 -1.950 0.05120 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 72.549 on 64 degrees of freedom
Residual deviance: 48.068 on 60 degrees of freedom
AIC: 58.068
Number of Fisher Scoring iterations: 5

If I have a large list of coordinates, how can I extract the y-values that correspond to a specific x-value?

I have three datasets that compile into one big dataset.
Data1 has x-values ranging from 0-47 (ordered), with many y-values (a small error) attached to an x-value. In total there are approx 100000 y values.
Data 2 and 3 are similar but with x-values 48-80 and 80-95 respectively.
The end goal is to produce a standard deviation for each x value (therefore 96 in total), based on the numerous y-values. Therefore, I think I should first extract the y-values for each x-value out of these datasets and then determine the standard deviation as per the norm.
In mathematica, I have tried using the select and part functions to no avail.
Statistically it would be better to provide a prediction interval with the predicted value of y.
There is a video about that here:-
Intervals (for the Mean Response and a Single Response) in Simple Linear Regression
Illustrating with some example data, stored here as a QR code.
qrimage = Import["https://i.stack.imgur.com/s7Ul7.png"];
data = Uncompress#BarcodeRecognize#qrimage;
ListPlot[data, Frame -> True, Axes -> None]
Setting 66 & 95% confidence levels
cl = Map[Function[σ, 2 (CDF[NormalDistribution[0, 1], σ] - 0.5)], {1, 2}];
(* trying a quadratic linear fit *)
lm = LinearModelFit[data, {1, a, a^2}, a];
bands = lm["SinglePredictionBands", ConfidenceLevel -> #] & /# cl;
(* x value for an observation outside of the sample observations *)
x0 = 50;
(* Predicted value of y *)
y0 = lm[x0]
39.8094
(* Least-squares regression of Y on X *)
Normal[lm]
26.4425 - 0.00702613 a + 0.0054873 a^2
(* Confidence interval for y0 given x0 *)
b1 = bands /. a -> x0;
(* R^2 goodness of fit *)
lm["RSquared"]
0.886419
b2 = {bands, {Normal[lm]}};
(* Prediction intervals plotted over the data range *)
Show[
Plot[b2, {a, 0, 100}, PlotRange -> {{0, 100}, Automatic}, Filling -> {1 -> {2}}],
ListPlot[data],
ListPlot[{{x0, lm[x0]}}, PlotStyle -> Red],
Graphics[{Red, Line[{{x0, Min[b1]}, {x0, Max[b1]}}]}],
Frame -> True, Axes -> None]
Row[{"For x0 = ", x0, ", y0 = ", y0,
" with 95% prediction interval ", y0, " ± ", y0 - Min[b1]}]
For x0 = 50, y0 = 39.8094 with 95% prediction interval 39.8094 ± 12.1118
Addressing your requirement:
The end goal is to produce a standard deviation for each x value (therefore 96 in total), based on the numerous y-values.
The best measure for this may be the standard errors, which can be found via
lm["SinglePredictionConfidenceIntervalTable"] and lm["SinglePredictionErrors"]
They will provide "standard errors for the predicted response of single observations". If you have multiple y values for a single x there will still just be one standard error for each x value.
Ref: https://reference.wolfram.com/language/ref/LinearModelFit.html (Details & Options)
See if you can adapt this
exampledata={{1,1},{1,2},{1,4},{2,1},{2,2},{2,2},{3,4},{3,5},{3,12}};
(*first a manual calculation to see what the answer should be*)
{StandardDeviation[{1,2,4}],StandardDeviation[{1,2,2}],StandardDeviation[{4,5,12}]}
(*and now automate the calculation*)
(*if your x values are not exact this will need to be changed*)
x=Union[Map[First,exampledata]];
y[x_]:=Map[Last,Cases[exampledata,{x,_}]];
std=Map[StandardDeviation[y[#]]&,x]
(*{Sqrt[7/3], 1/Sqrt[3], Sqrt[19]}*)
(*{Sqrt[7/3], 1/Sqrt[3], Sqrt[19]}*)
Since you have 100000 pairs this might speed it up.
You have said that your data is sorted on x so I won't sort it here.
If your data isn't sorted this will produce incorrect results.
exampledata={{1,1},{1,2},{1,4},{2,1},{2,2},{2,2},{3,4},{3,5},{3,12}};
y[x_]:=Map[Last,x];
std=Map[StandardDeviation[y[#]]&, SplitBy[exampledata,First]]
That should give exactly the same results, with fewer passes through the data. You might compare the timing of the two methods and verify that they do produce exactly the same results.
Reading this over, I am not absolutely certain that I exactly correctly understood your verbal description the form of your data structure. I thought you had a long list of {x,y} points with lots of repeated x values. If it looks like I misunderstood and you could include a tiny example bit of Mathematica code holding some of your sample data then I would edit my code to match.

Best way to pick random elements from an array with at least a min diff in R

I would like to randomly choose from an array a certain number of elements in a way that those respect always a limit in their reciprocal distance.
For example, having a vector a <- seq(1,1000), how can I pick 20 elements with a minimum distance of 15 between each other?
For now, I am using a simple iteration for which I reject the choice whenever is too next to any element, but it is cumbersome and tends to be long if the number of elements to pick is high. Is there a best-practice/function for this?
EDIT - Summary of answers and analysis
So far I had two working answers which I wrapped in two specific functions.
# dash2 approach
# ---------------
rand_pick_min <- function(ar, min.dist, n.picks){
stopifnot(is.numeric(min.dist),
is.numeric(n.picks), n.picks%%1 == 0)
if(length(ar)/n.picks < min.dist)
stop('The number of picks exceeds the maximum number of divisions that the array allows which is: ',
floor(length(ar)/min.dist))
picked <- array(NA, n.picks)
copy <- ar
for (i in 1:n.picks) {
stopifnot(length(copy) > 0)
picked[i] <- sample(copy, 1)
copy <- copy[ abs(copy - picked[i]) >= min.dist ]
}
return(picked)
}
# denis approach
# ---------------
rand_pick_min2 <- function(ar, min.dist, n.picks){
require(Surrogate)
stopifnot(is.numeric(min.dist),
is.numeric(n.picks), n.picks%%1 == 0)
if(length(ar)/n.picks < min.dist)
stop('The number of picks exceeds the maximum number of divisions that the array allows which is: ',
floor(length(ar)/min.dist))
lar <- length(ar)
dist <- Surrogate::RandVec(a=min.dist, b=(lar-(n.picks)*min.dist),
s=lar, n=(n.picks+1), m=1, Seed=sample(1:lar, size = 1))$RandVecOutput
return(cumsum(round(dist))[1:n.picks])
}
Using the same example proposed I run 3 tests. Firstly, the effective validity of the minimum limit
# Libs
require(ggplot2)
require(microbenchmark)
# Inputs
a <- seq(1, 1000) # test vector
md <- 15 # min distance
np <- 20 # number of picks
# Run
dist_vec <- c(sapply(1:500, function(x) c(dist(rand_pick_min(a, md, np))))) # sol 1
dist_vec2 <- c(sapply(1:500, function(x) c(dist(rand_pick_min2(a, md, np))))) # sol 2
# Tests - break the min
cat('Any distance breaking the min in sol 1?', any(dist_vec < md), '\n') # FALSE
cat('Any distance breaking the min in sol 2?', any(dist_vec2 < md), '\n') # FALSE
Secondly, I tested for the distribution of the resulting distances, obtaining the first two plots in order of solution (sol1 [A] is dash2's sol, while sol2 [B] is denis' one).
pa <- ggplot() + theme_classic() +
geom_density(aes_string(x = dist_vec), fill = 'lightgreen') +
geom_vline(aes_string(xintercept = mean(dist_vec)), col = 'darkred') + xlab('Distances')
pb <- ggplot() + theme_classic() +
geom_density(aes_string(x = dist_vec2), fill = 'lightgreen') +
geom_vline(aes_string(xintercept = mean(dist_vec)), col = 'darkred') + xlab('Distances')
print(pa)
print(pb)
Lastly, I computed the computational times needed for the two approaches as following and obtaining the last figure.
comp_times <- microbenchmark::microbenchmark(
'solution_1' = rand_pick_min(a, md, np),
'solution_2' = rand_pick_min2(a, md, np),
times = 500
)
ggplot2::autoplot(comp_times); ggsave('stckoverflow2.png')
Enlighted by the results, I am asking my-self if the distance distribution as it is should be expected or it is a deviation due to the applied methods.
EDIT2 - Answer to the last question, following the comment made by denis
Using many more sampling procedures (5000), I produced a pdf of the resulting positions and indeed your approach contains some artefact that makes your solution (B) deviate from the one I needed. Nonetheless, it would be interesting to have the ability to enforce a specific final distribution of positions.
If you want to avoid the hit and miss methods, you will have to translate your problem into a sampling of distances with constraints on the sum of your distances.
Basically how i translate what you want: your N positions sampled are equivalent to N+1 distance, ranging from the minimum distance to the size of your vector - N*mindist (the case where all your samples are packed together). You then need to constrain the sum of the distances to be equal to 1000 (the size of your vector).
In this case the solution will use Surrogate::RandVec from Surrogate package (see Random sampling to give an exact sum), that allows a sampling with a fixed sum.
library(Surrogate)
a <- seq(1,1000)
mind <- 15
N <- 20
dist <- Surrogate::RandVec(a=mind, b=(1000-(N)*mind), s=1000, n=(N+1), m=1, Seed=sample(1:1000, size = 1))$RandVecOutput
pos <- cumsum(round(dist))[1:20]
pos
> pos
[1] 22 59 76 128 204 239 289 340 389 440 489 546 567 607 724 773 808 843 883 927
dist is the sampling f the distance. You reconstruct your position by making the sum of the distances. It gives you pos, the vector of your index positions.
The advantage is that you can get any value, and that your sampling is supposed to be random. For the speed part I don't know, you'll need to compare to your method for your big data case.
Here is an histogramm of 1000 try:
I think the best solution, which guarantees randomness in some sense (I'm not exactly sure what sense!) may be:
Pick a random element
Remove all elements that are too close to that element
Pick another element
Return to 2.
So:
min_dist <- 15
a <- seq(1, 1000)
picked <- integer(20)
copy <- a
for (i in 1:20) {
stopifnot(length(copy) > 0)
picked[i] <- sample(copy, 1)
copy <- copy[ abs(copy - picked[i]) >= min_dist ]
}
Whether this is faster than sample-and-reject may depend on the characteristics of the original vector. Also, as you can see, you are not guaranteed to be able to get all the elements you want, though in your particular case there won't be a problem because 19 intervals of width 30 could never cover the whole of seq(1, 1000).

How to efficiently evaluate or approximate a road Clothoid?

I'm facing the problem of computing values of a clothoid in C in real-time.
First I tried using the Matlab coder to obtain auto-generated C code for the quadgk-integrator for the Fresnel formulas. This essentially works great in my test scnearios. The only issue is that it runs incredibly slow (in Matlab as well as the auto-generated code).
Another option was interpolating a data-table of the unit clothoid connecting the sample points via straight lines (linear interpolation). I gave up after I found out that for only small changes in curvature (tiny steps along the clothoid) the results were obviously degrading to lines. What a surprise...
I know that circles may be plotted using a different formula but low changes in curvature are often encountered in real-world-scenarios and 30k sampling points in between the headings 0° and 360° didn't provide enough angular resolution for my problems.
Then I tried a Taylor approximation around the R = inf point hoping that there would be significant curvatures everywhere I wanted them to be. I soon realized I couldn't use more than 4 terms (power of 15) as the polynom otherwise quickly becomes unstable (probably due to numerical inaccuracies in double precision fp-computation). Thus obviously accuracy quickly degrades for large t values. And by "large t values" I'm talking about every point on the clothoid that represents a curve of more than 90° w.r.t. the zero curvature point.
For instance when evaluating a road that goes from R=150m to R=125m while making a 90° turn I'm way outside the region of valid approximation. Instead I'm in the range of 204.5° - 294.5° whereas my Taylor limit would be at around 90° of the unit clothoid.
I'm kinda done randomly trying out things now. I mean I could just try to spend time on the dozens of papers one finds on that topic. Or I could try to improve or combine some of the methods described above. Maybe there even exists an integrate function in Matlab that is compatible with the Coder and fast enough.
This problem is so fundamental it feels to me I shouldn't have that much trouble solving it. any suggetions?
about the 4 terms in Taylor series - you should be able to use much more. total theta of 2pi is certainly doable, with doubles.
you're probably calculating each term in isolation, according to the full formula, calculating full factorial and power values. that is the reason for losing precision extremely fast.
instead, calculate the terms progressively, the next one from the previous one. Find the formula for the ratio of the next term over the previous one in the series, and use it.
For increased precision, do not calculate in theta by rather in the distance, s (to not lose the precision on scaling).
your example is an extremely flat clothoid. if I made no mistake, it goes from (25/22) pi =~ 204.545° to (36/22) pi =~ 294.545° (why not include these details in your question?). Nevertheless it should be OK. Even 2 pi = 360°, the full circle (and twice that), should pose no problem.
given: r = 150 -> 125, 90 degrees turn :
r s = A^2 = 150 s = 125 (s+x)
=> 1+(x/s) = 150/125 = 1 + 25/125 x/s = 1/5
theta = s^2/2A^2 = s^2 / (300 s) = s / 300 ; = (pi/2) * (25/11) = 204.545°
theta2 = (s+x)^2/(300 s) = (6/5)^2 s / 300 ; = (pi/2) * (36/11) = 294.545°
theta2 - theta = ( 36/25 - 1 ) s / 300 == pi/2
=> s = 300 * (pi/2) * (25/11) = 1070.99749554 x = s/5 = 214.1994991
A^2 = 150 s = 150 * 300 * (pi/2) * (25/11)
a = sqrt (2 A^2) = 300 sqrt ( (pi/2) * (25/11) ) = 566.83264608
The reference point is at r = Infinity, where theta = 0.
we have x = a INT[u=0..(s/a)] cos(u^2) d(u) where a = sqrt(2 r s) and theta = (s/a)^2. write out the Taylor series for cos, and integrate it, term-by-term, to get your Taylor approximation for x as function of distance, s, along the curve, from the 0-point. that's all.
next you have to decide with what density to calculate your points along the clothoid. you can find it from a desired tolerance value above the chord, for your minimal radius of 125. these points will thus define the approximation of the curve by line segments, drawn between the consecutive points.
I am doing my thesis in the same area right now.
My approach is the following.
at each point on your clothoid, calculate the following (change in heading / distance traveled along your clothoid), by this formula you can calculate the curvature at each point by this simple equation.
you are going to plot each curvature value, your x-axis will be the distance along the clothoid, the y axis will be the curvature. By plotting this and applying very easy linear regression algorithm (search for Peuker algorithm implementation in your language of choice)
you can easily identify where are the curve sections with value of zero (Line has no curvature), or linearly increasing or decreasing (Euler spiral CCW/CW), or constant value != 0 (arc has constant curvature across all points on it).
I hope this will help you a little bit.
You can find my code on github. I implemented some algorithms for such problems like Peuker Algorithm.

How to determine the most repeated values into a interval of a vector array matlab

this is my question:
I want to know which and how many times is a value repeated in a interval of a vector array, I know that many people will tell me that use "hist", but I did it and the results isn't exact enough, let me show you in a picture my problem:
In the past picture, you can see in blue the "Data"; and I have used 3 kinds of values: 1st "Mode", 2nd "Mean" and finally "Most repeated value in Histogram" which means that I used something like [a,b]=hist(Data), then Mayor Value = b(a==max(a)) and is very important to do NOT use a predefined range; but this picture doesn't represent the most repeted values, so let me show you another pic, which is a closer view of the data:
That blue "Data", which vary between (0-0.5)E-5 approximately is the interval that I need to obtain, but as you can see, the others three values are not close enough. And "mode" value is just "0". I hope that you can help me to solve this problem, thanks by the way!.
Ok to be more clear, I add this new pic:
What exactly I'm looking for is to get an interval, like in this example I wrote manually 0.1 - 0.4 E-4 (in purple), so the function will say:
[A,B]=magicfunction(Data);
A=[0.1E-4 0.4E-4]; B=[123];
Where B=123 means the amount of data contained in that interval, as you can see I just ingress vector "Data", nothing else.
In the next link you can get the "Data":
https://drive.google.com/file/d/0B4WGV21GqSL5Vk0tRUdLNk5XVnc/edit?usp=sharing
isn't taking the max of a hist in a range what you want? you almost got it, you just didn't define the bins well. For example:
range=4750:5050;
[counts val]=hist(data(range),unique(data(range)));
most_repeated _value_in_range=val(counts==max(counts));
Edit:
Following the clarification, what you want is a statistical bound regarding the histogram width around it's maximum (most frequent value) , here's a solution:
[c, v]=hist(data,linspace(min(data),max(data),num_of_bins));
range=find(c>1/exp(1)*max(c)); % can be also c>0.5*max(c) etc...
A=[v(range(1)) v(range(end))];
B=sum(c(range));
Let's test with some fake data:
t=linspace(-50,50,1e3);
data=0.3*exp(-(t-30).^2)+0.2*exp(-(t-10).^2)+0.3*exp(-(t+10).^2)+0.01*randn(1,numel(t));
[c, v]=hist(data,linspace(min(data),max(data),numel(t)));
range=find(c>1/exp(1)*max(c));
A=[v(range(1)) v(range(end))];
B=sum(c(range));
plot(t,data,'b'); hold on
plot([min(t) max(t)],[A(1) A(1)] ,'--r');
plot([min(t) max(t)],[A(2) A(2)] ,'--r');
B
B =
518
Of course you can change the definition of "width" of the histogram, I took 1/e to 1/e you can take full width at half max (c>0.5*max(c)), or narrower according to the type of data used, etc...
The function below is designed based on several assumptions:
The "interval" of interest is close to 0.
The majority of the samples are small.
The basic idea is to first filter out the samples that are too big, and then define the interval based on the sorted array of the remaining samples.
function [A, B] = magicfunction(data)
% Assuming the outlier samples only exist in the positive side, some
% samples of big, positive values can be excluded in order to obtain a
% better estimation of "the interval". Here we exclude the
% samples that are greater than mean(A)+K1*std(A), where K1 is empirically
% selected as 1.0
K1 = 1.0;
filtered_data = data( data < mean(data)+K1*std(data));
sorted_data = sort(filtered_data);
% Define the interval in terms of the percentile in the
% sorted_data. Here the interval is empirically selected as [0, 0.75]
interval = [0 0.75];
% Map the percentile interval to the actual index in sorted_data.
% Note that interval_index(1) cannot be smaller than 1, and
% interval_index(2) cannot be greater than length(sorted_data)
interval_index = round( length(sorted_data)*interval );
interval_index(1) = max(1, interval_index(1));
interval_index(2) = min(length(sorted_data), interval_index(2));
% Assign output A in terms of the value in the sorted_data
A = sorted_data(interval_index)
% Assign output B
B = sum( data>A(1) & data<A(2) )
% Visualization
x = [1:length(data)];
figure;
subplot(211);
plot(x, data, ...
x, repmat(A(:)', length(data),1) ); grid on;
legend('data', 'lower bound', 'upper bound');
xlim([1 20000]);
subplot(212);
plot(x, data, ...
x, repmat(A(:)', length(data),1) ); grid on;
legend('data', 'lower bound', 'upper bound');
ylim([0, 3*10^-5]);
xlim([1 20000]);
Feeding the data provided in your question into the function yields the following plot:
You may want to empirically tune the two variables in the function to obtain the desired result.
K1
interval

Resources