Creating a composite biomarker score using logistic regression coefficients - logistic-regression

I have done a standard logistic regression model including 4 cytokines looking at whether they can predict relapse or remission of disease. I wanted to create a composite biomarker score of these 4 markers so that I can then enter into further predictive analysis of outcome e.g. ROC curves and Kaplan Meier. I was planning on doing this by extracting the β coefficients from the multivariable logistic regression with all (standardized) biomarkers and then multiply those with the (standardized) biomarker levels to create a composite. I just wondered whether this method was ok and how I can go about this using R?
This is my logistic regression model and output. I wanted to use combinations of these four variables to make a composite biomarker score weighted by their respective coefficients and then to produce ROC curves looking at whether these biomarkers can predict outcome.
Thanks for your help.
summary(m1)
Call:
glm(formula = Outcome ~ TRAb + TACI + BCMA + BAFF, family = binomial,
data = Timepoint.1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4712 0.1884 0.3386 0.5537 1.6212
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.340e+00 2.091e+00 3.032 0.00243 **
TRAb -9.549e-01 3.574e-01 -2.672 0.00755 **
TACI -6.576e-04 2.715e-04 -2.422 0.01545 *
BCMA -1.485e-05 1.180e-05 -1.258 0.20852
BAFF -2.351e-03 1.206e-03 -1.950 0.05120 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 72.549 on 64 degrees of freedom
Residual deviance: 48.068 on 60 degrees of freedom
AIC: 58.068
Number of Fisher Scoring iterations: 5

Related

Using brms for a logistic regression when the outcome is imperfectly defined

I'm trying to use brms package to run a model where my dependent variable Y is an estimate of the latent variable (Disease=0 absent or Disease=1 present with a probability p).
I've a dataframe bd which contains a dichotomous variable Y (result of a test either positive 1 or negative 0 for assessing the disease status), and 3 covariates (X1 numeric, X2 and X3 as factors)
Y~q #where q is a Bernoulli event and q depends on the false positive and false negative fraction of the test.
q­­~ pSe +(1-p)(1-Sp) # true positive and false positive depending on the true probability of disease p
The model I want to finally obtain is on the form:
logit(p)~ X1 + X2 + X3 #I want to determine the impacts of my Xi's on the latent variable p
I used brms package with a non linear formula but struggle with specific problems
bform <- bf(
Y~q, #defining my Bernoulli event
nlf(q~ Se * p+(1-p) * (1-Sp)),
nlf(p~inv_logit(X1 + X2 + X3)),
Se+Sp~1,
nl=TRUE,
family=bernoulli("identity"))
I put some priors on the test sensitivity ans specificity using beta priors but letting default priors for logistic regression coefficients => bprior
bprior <- set_prior("beta(4.6, 0.86)", nlpar="Se", lb=0, ub=1)+
set_prior("beta(77.55, 4.4)", nlpar="Sp", lb=0, ub=1)
My final model looks like (using the previously created list bform and bprior ):
brm(bform, data=bd, prior=bprior, init="0")
When running the model I only get posteriors from the Se and Sp parameters but I am not able to see any coefficient associated with my covariables X1, X2, X3.
I guess my model has a mistake but I'm not able to see what's happen.
Any help would be greatly appreciated!!!
I expected to get output from the line code p~inv_logit(X1 + X2 + X3)) to be able to determine coefficients associated with this logistic regression (accounting for imperfect dependent variable estimation)

logistic regression - exercise odds ratio and probability of belonging

I have a question related to an exercise that is probably very easy, but I'm getting quite desperate with it. I would be very grateful if someone could explain me the solution.
A logistic regression yields the following result:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2406 0.2078 1.158 0.2470
x 0.5015 0.2214 2.265 0.0235 *
(a) Calculate odds ratio for 𝑥 and interpret the value!
Note: 𝑂𝑅 = exp(𝑏)
(b) Calculate the probability of belonging to group 1 for a
person with 𝑥 = 2

How to get sjPlot::tab_model to show probabilities for a logistic regression with interaction

I have a logistic regression model with an interaction term. I am trying to figure out how to present probabilities for the interaction terms with tab_model.
Example data and model:
dat <- cbind(Species = rep(letters[1:10], each = 5),
threat_cat = rep(c("recreation", "climate", "pollution", "fire", "invasive_spp"), 10),
impact.pres = sample(0:1, size = 50, replace = T),
threat.pres = sample(0:1, size = 50, replace = T))
mod <- glm(impact.pres ~ 0 + threat_cat/threat.pres, data = dat, family = "binomial")
summary(mod)
Call:
glm(formula = impact.pres ~ 0 + threat_cat/threat.pres, family = "binomial",
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.89302 -0.66805 0.00013 0.66805 1.79412
Coefficients:
Estimate Std. Error z value Pr(>|z|)
threat_catclimate 5.108e-01 7.303e-01 0.699 0.484
threat_catfire 1.609e+00 1.095e+00 1.469 0.142
threat_catinvasive_spp -1.386e+00 1.118e+00 -1.240 0.215
threat_catpollution 1.386e+00 1.118e+00 1.240 0.215
threat_catrecreation -1.386e+00 1.118e+00 -1.240 0.215
threat_catclimate:threat.pres -5.108e-01 1.592e+00 -0.321 0.748
threat_catfire:threat.pres -2.018e+01 3.261e+03 -0.006 0.995
threat_catinvasive_spp:threat.pres 1.792e+00 1.443e+00 1.241 0.214
threat_catpollution:threat.pres 3.514e-16 1.581e+00 0.000 1.000
threat_catrecreation:threat.pres 1.995e+01 2.917e+03 0.007 0.995
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 69.315 on 50 degrees of freedom
Residual deviance: 45.511 on 40 degrees of freedom
AIC: 65.511
Number of Fisher Scoring iterations: 17
if I run tab_model(mod), it returns odds ratios for both the categorical variables and interaction terms.
However, I am interested in probabilities, to go along with a nice figure, made with:
plot_model(mod, type = "int")+
coord_flip()
I know that I can create a function to calculate probabilities for categorical coefficients, and have tried to use that with tab_model:
prob <- function(p) {exp(coef(p)) / (1 + exp(coef((p)))}
tab_model(mod, transform = prob)
This only gave me correct probabilities for the categorical coefficients, and not for the interaction terms. The second time I tried, it threw an error (Error: $ operator is invalid for atomic vectors), even though I didn't change anything.
Am I missing something? Is there a way to get tab_model to print the same probabiliites that were shown in my figure, at least for the interaction, where threat.pres = 1?
If not, how can I extract the data that plot_model used, such as in the format of a dataframe?
A related bit of code I could use help with, for the figure- for plot_model, the command to show_values doesn't seem to work if type = "int", nor do any of the options to display significance...
Any help would be appreciated!

Probability: Estimating NoSQL Query Size / COUNT Using Random Samples

I have a very large NoSQL database. Each item in the database is assigned a uniformly distributed random value between 0 and 1. This database is so large that performing a COUNT on queries does not yield acceptable performance, but I'd like to use the random values to estimate COUNT.
The idea is this:
Run a query and order the query by the random value. Random values are indexed, so it's fast.
Grab the lowest N values, and see how big the largest value is, say R.
Estimate COUNT as N / R
The question is two-fold:
Is N / R the best way to estimate COUNT? Maybe it should be (N+1)/R? Maybe we could look at the other values (average, variance, etc), and not just the largest value to get a better estimate?
What is the error margin on this estimated value of COUNT?
Note: I thought about posting this in the math stack exchange, but given this is for databases, I thought it would be more appropriate here.
This actually would be better on math or statistics stack exchange.
The reasonable estimate is that if R is large and x is your order statistic, then R is approximately n / x - 1. About 95% of the time the error will be within 2 R / sqrt(n) of this. So looking at the 100th element will estimate the right answer to within about 20%. Looking at the 10,000th element will estimate it to within about 2%. And the millionth element will get you the right answer to within about 0.2%.
To see this, start with the fact that the n'th order statistic has a Beta distribution with parameters 𝛼 = n and β = R + 1 - n. Which means that the mean value of the n'th smallest value out of R values is n/(R+1). And its variance is 𝛼β / ((𝛼 + β)^2 (𝛼 + β + 1)). If we assume that R is much larger than n, then this is approximately n R / R^3 = n / R^2. Which means that our standard deviation is sqrt(n) / R.
If x is our order statistic, this means that (n / x) - 1 is a reasonable estimate of R. And how much is it off by? Well, we can use the tangent line approximation. The function (n / x) - 1 has a derivative of - n / x^2 Its derivative at x = n/(R+1) is therefore (R + 1)^2 / n. Which for large R is roughly R^2 / n. Stick in our standard deviation of sqrt(n) / R and we come up with an error proportional to R / sqrt(n). Since a 95% confidence interval would be 2 standard deviations, you probably will have an error of around 2 R / sqrt(n).

Compute trigram probability from bigrams probabilities

Given bigram probabilities for words in a text, how would one compute trigram probabilities?
For example, if we know that P(dog cat) = 0.3 and P(cat mouse) = 0.2
how do we find the probability of P(dog cat mouse)?
Thank you!
In the following I consider a trigram as three random variables A,B,C. So dog cat horse would be A=dog, B=cat, C=horse.
Using the chain rule: P(A,B,C) = P(A,B) * P(C|A,B). Now your stuck if you want to stay exact.
What you can do is assuming C is independent of A given B. Then it holds that P(C|A,B) = P(C|B). And P(C|B) = P(C,B) / P(B), which you should be able to compute from your trigram frequencies. Note that in your case P(C|B) should really be the probability of C following a B, so it's the probability of a BC divided by the probability of a B*.
So to sum it up, when using the conditional independence assumption:
P(ABC) = P(AB) * P(BC) / P(B*)
And to compute P(B*) you have to sum up the probabilities for all trigrams beginning with B.

Resources