logistic regression - exercise odds ratio and probability of belonging - logistic-regression

I have a question related to an exercise that is probably very easy, but I'm getting quite desperate with it. I would be very grateful if someone could explain me the solution.
A logistic regression yields the following result:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2406 0.2078 1.158 0.2470
x 0.5015 0.2214 2.265 0.0235 *
(a) Calculate odds ratio for 𝑥 and interpret the value!
Note: 𝑂𝑅 = exp(𝑏)
(b) Calculate the probability of belonging to group 1 for a
person with 𝑥 = 2

Related

Creating a composite biomarker score using logistic regression coefficients

I have done a standard logistic regression model including 4 cytokines looking at whether they can predict relapse or remission of disease. I wanted to create a composite biomarker score of these 4 markers so that I can then enter into further predictive analysis of outcome e.g. ROC curves and Kaplan Meier. I was planning on doing this by extracting the β coefficients from the multivariable logistic regression with all (standardized) biomarkers and then multiply those with the (standardized) biomarker levels to create a composite. I just wondered whether this method was ok and how I can go about this using R?
This is my logistic regression model and output. I wanted to use combinations of these four variables to make a composite biomarker score weighted by their respective coefficients and then to produce ROC curves looking at whether these biomarkers can predict outcome.
Thanks for your help.
summary(m1)
Call:
glm(formula = Outcome ~ TRAb + TACI + BCMA + BAFF, family = binomial,
data = Timepoint.1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4712 0.1884 0.3386 0.5537 1.6212
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.340e+00 2.091e+00 3.032 0.00243 **
TRAb -9.549e-01 3.574e-01 -2.672 0.00755 **
TACI -6.576e-04 2.715e-04 -2.422 0.01545 *
BCMA -1.485e-05 1.180e-05 -1.258 0.20852
BAFF -2.351e-03 1.206e-03 -1.950 0.05120 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 72.549 on 64 degrees of freedom
Residual deviance: 48.068 on 60 degrees of freedom
AIC: 58.068
Number of Fisher Scoring iterations: 5

Probability: Estimating NoSQL Query Size / COUNT Using Random Samples

I have a very large NoSQL database. Each item in the database is assigned a uniformly distributed random value between 0 and 1. This database is so large that performing a COUNT on queries does not yield acceptable performance, but I'd like to use the random values to estimate COUNT.
The idea is this:
Run a query and order the query by the random value. Random values are indexed, so it's fast.
Grab the lowest N values, and see how big the largest value is, say R.
Estimate COUNT as N / R
The question is two-fold:
Is N / R the best way to estimate COUNT? Maybe it should be (N+1)/R? Maybe we could look at the other values (average, variance, etc), and not just the largest value to get a better estimate?
What is the error margin on this estimated value of COUNT?
Note: I thought about posting this in the math stack exchange, but given this is for databases, I thought it would be more appropriate here.
This actually would be better on math or statistics stack exchange.
The reasonable estimate is that if R is large and x is your order statistic, then R is approximately n / x - 1. About 95% of the time the error will be within 2 R / sqrt(n) of this. So looking at the 100th element will estimate the right answer to within about 20%. Looking at the 10,000th element will estimate it to within about 2%. And the millionth element will get you the right answer to within about 0.2%.
To see this, start with the fact that the n'th order statistic has a Beta distribution with parameters 𝛼 = n and β = R + 1 - n. Which means that the mean value of the n'th smallest value out of R values is n/(R+1). And its variance is 𝛼β / ((𝛼 + β)^2 (𝛼 + β + 1)). If we assume that R is much larger than n, then this is approximately n R / R^3 = n / R^2. Which means that our standard deviation is sqrt(n) / R.
If x is our order statistic, this means that (n / x) - 1 is a reasonable estimate of R. And how much is it off by? Well, we can use the tangent line approximation. The function (n / x) - 1 has a derivative of - n / x^2 Its derivative at x = n/(R+1) is therefore (R + 1)^2 / n. Which for large R is roughly R^2 / n. Stick in our standard deviation of sqrt(n) / R and we come up with an error proportional to R / sqrt(n). Since a 95% confidence interval would be 2 standard deviations, you probably will have an error of around 2 R / sqrt(n).

How to efficiently evaluate or approximate a road Clothoid?

I'm facing the problem of computing values of a clothoid in C in real-time.
First I tried using the Matlab coder to obtain auto-generated C code for the quadgk-integrator for the Fresnel formulas. This essentially works great in my test scnearios. The only issue is that it runs incredibly slow (in Matlab as well as the auto-generated code).
Another option was interpolating a data-table of the unit clothoid connecting the sample points via straight lines (linear interpolation). I gave up after I found out that for only small changes in curvature (tiny steps along the clothoid) the results were obviously degrading to lines. What a surprise...
I know that circles may be plotted using a different formula but low changes in curvature are often encountered in real-world-scenarios and 30k sampling points in between the headings 0° and 360° didn't provide enough angular resolution for my problems.
Then I tried a Taylor approximation around the R = inf point hoping that there would be significant curvatures everywhere I wanted them to be. I soon realized I couldn't use more than 4 terms (power of 15) as the polynom otherwise quickly becomes unstable (probably due to numerical inaccuracies in double precision fp-computation). Thus obviously accuracy quickly degrades for large t values. And by "large t values" I'm talking about every point on the clothoid that represents a curve of more than 90° w.r.t. the zero curvature point.
For instance when evaluating a road that goes from R=150m to R=125m while making a 90° turn I'm way outside the region of valid approximation. Instead I'm in the range of 204.5° - 294.5° whereas my Taylor limit would be at around 90° of the unit clothoid.
I'm kinda done randomly trying out things now. I mean I could just try to spend time on the dozens of papers one finds on that topic. Or I could try to improve or combine some of the methods described above. Maybe there even exists an integrate function in Matlab that is compatible with the Coder and fast enough.
This problem is so fundamental it feels to me I shouldn't have that much trouble solving it. any suggetions?
about the 4 terms in Taylor series - you should be able to use much more. total theta of 2pi is certainly doable, with doubles.
you're probably calculating each term in isolation, according to the full formula, calculating full factorial and power values. that is the reason for losing precision extremely fast.
instead, calculate the terms progressively, the next one from the previous one. Find the formula for the ratio of the next term over the previous one in the series, and use it.
For increased precision, do not calculate in theta by rather in the distance, s (to not lose the precision on scaling).
your example is an extremely flat clothoid. if I made no mistake, it goes from (25/22) pi =~ 204.545° to (36/22) pi =~ 294.545° (why not include these details in your question?). Nevertheless it should be OK. Even 2 pi = 360°, the full circle (and twice that), should pose no problem.
given: r = 150 -> 125, 90 degrees turn :
r s = A^2 = 150 s = 125 (s+x)
=> 1+(x/s) = 150/125 = 1 + 25/125 x/s = 1/5
theta = s^2/2A^2 = s^2 / (300 s) = s / 300 ; = (pi/2) * (25/11) = 204.545°
theta2 = (s+x)^2/(300 s) = (6/5)^2 s / 300 ; = (pi/2) * (36/11) = 294.545°
theta2 - theta = ( 36/25 - 1 ) s / 300 == pi/2
=> s = 300 * (pi/2) * (25/11) = 1070.99749554 x = s/5 = 214.1994991
A^2 = 150 s = 150 * 300 * (pi/2) * (25/11)
a = sqrt (2 A^2) = 300 sqrt ( (pi/2) * (25/11) ) = 566.83264608
The reference point is at r = Infinity, where theta = 0.
we have x = a INT[u=0..(s/a)] cos(u^2) d(u) where a = sqrt(2 r s) and theta = (s/a)^2. write out the Taylor series for cos, and integrate it, term-by-term, to get your Taylor approximation for x as function of distance, s, along the curve, from the 0-point. that's all.
next you have to decide with what density to calculate your points along the clothoid. you can find it from a desired tolerance value above the chord, for your minimal radius of 125. these points will thus define the approximation of the curve by line segments, drawn between the consecutive points.
I am doing my thesis in the same area right now.
My approach is the following.
at each point on your clothoid, calculate the following (change in heading / distance traveled along your clothoid), by this formula you can calculate the curvature at each point by this simple equation.
you are going to plot each curvature value, your x-axis will be the distance along the clothoid, the y axis will be the curvature. By plotting this and applying very easy linear regression algorithm (search for Peuker algorithm implementation in your language of choice)
you can easily identify where are the curve sections with value of zero (Line has no curvature), or linearly increasing or decreasing (Euler spiral CCW/CW), or constant value != 0 (arc has constant curvature across all points on it).
I hope this will help you a little bit.
You can find my code on github. I implemented some algorithms for such problems like Peuker Algorithm.

Implementing Geometric Median

When I google for Geometric median, I got this link Geometric median
but I have no clue how to implement it in C . I am not very good at understanding this Mathematical Explanation. Lets Say I have 11 pair of co-ordinates how will I calculate the geometric median for the same.
I am trying to solve this problem Grid CIty. I was given a Hint that geometric median will help me achieve it. I am not looking for a final solution. If someone can guide me to a right path that would help.
Thanks is Advance
Below is the list of co-ordinates a (test case). result : 3 4
1 2
1 7
2 2
2 3
2 5
3 4
4 2
4 5
4 6
5 3
6 5
I don't think this is solvable without an iterative algorithm.
Here is a pseudocode solution similar to the hill-climbing version, except that it works to arbitrary accuracy, and in higher dimensions.
CurrentPoint = Mean(Points)
While (CurrentPoint - PreviousPoint) Length > 0.01 Do
For Each Point in Points Do
Vector = CurrentPoint - Point
Vector Length = Vector Length - 1.0
Point2 = Point + Vector
Add Point2 To Points2
Loop
PreviousPoint = CurrentPoint
CurrentPoint = Mean(Points2)
Loop
Notes:
The constant 0.01 does not guarantee the result to be within 0.01 of the true value. Use smaller values for better precision.
The constant 1.0 should be adjusted to (I'm guessing) about 1/5 the distance between the furthest points. Too small values will slow down the algorithm, but too large values will cause inaccuracies probably leading an to infinite loop.
To resolve this problem, you just have to compute the mean for each coordinate and round up the result.
It should resolve your problem.
You are not obliged to use the concept of Geometric median; so seeing that it is not easy to calculate, you better solve your problem without calculating it!
Here is an idea for an algorithm/implementation.
Start at any point (e.g. the first point in the given data).
Calculate the sum of distances for current point and the 8 neighboring points (+/-1 in each direction, x and y)
If one of the neighbors is better than current point, update the current point and start from 1
(Found the optimal distance; now choose the best point among those with equal distance)
Calculate the sum of distances for current point and the 3 neighboring points (-1 in each direction, x and y)
If one of the neighbors is the same as current point, update the current point and continue from 5
The answer is (xi, yj) where xi
is the median of all the x's and yj is the median of all the y's.
As I comment the solution to your problem is not the geometric mean, but the arithmetic mean.
If you have to calculate the arithmetic mean, you need to sum all the values of the column and divide the answer by the number of elements.

Information Gain and Entropy

I recently read this question regarding information gain and entropy. I think I have a semi-decent grasp on the main idea, but I'm curious as what to do with situations such as follows:
If we have a bag of 7 coins, 1 of which is heavier than the others, and 1 of which is lighter than the others, and we know the heavier coin + the lighter coin is the same as 2 normal coins, what is the information gain associated with picking two random coins and weighing them against each other?
Our goal here is to identify the two odd coins. I've been thinking this problem over for a while, and can't frame it correctly in a decision tree, or any other way for that matter. Any help?
EDIT: I understand the formula for entropy and the formula for information gain. What I don't understand is how to frame this problem in a decision tree format.
EDIT 2: Here is where I'm at so far:
Assuming we pick two coins and they both end up weighing the same, we can assume our new chances of picking H+L come out to 1/5 * 1/4 = 1/20 , easy enough.
Assuming we pick two coins and the left side is heavier. There are three different cases where this can occur:
HM: Which gives us 1/2 chance of picking H and a 1/4 chance of picking L: 1/8
HL: 1/2 chance of picking high, 1/1 chance of picking low: 1/1
ML: 1/2 chance of picking low, 1/4 chance of picking high: 1/8
However, the odds of us picking HM are 1/7 * 5/6 which is 5/42
The odds of us picking HL are 1/7 * 1/6 which is 1/42
And the odds of us picking ML are 1/7 * 5/6 which is 5/42
If we weight the overall probabilities with these odds, we are given:
(1/8) * (5/42) + (1/1) * (1/42) + (1/8) * (5/42) = 3/56.
The same holds true for option B.
option A = 3/56
option B = 3/56
option C = 1/20
However, option C should be weighted heavier because there is a 5/7 * 4/6 chance to pick two mediums. So I'm assuming from here I weight THOSE odds.
I am pretty sure I've messed up somewhere along the way, but I think I'm on the right path!
EDIT 3: More stuff.
Assuming the scale is unbalanced, the odds are (10/11) that only one of the coins is the H or L coin, and (1/11) that both coins are H/L
Therefore we can conclude:
(10 / 11) * (1/2 * 1/5) and
(1 / 11) * (1/2)
EDIT 4: Going to go ahead and say that it is a total 4/42 increase.
You can construct a decision tree from information-gain considerations, but that's not the question you posted, which is only the compute the information gain (presumably the expected information gain;-) from one "information extraction move" -- picking two random coins and weighing them against each other. To construct the decision tree, you need to know what moves are affordable from the initial state (presumably the general rule is: you can pick two sets of N coins, N < 4, and weigh them against each other -- and that's the only kind of move, parametric over N), the expected information gain from each, and that gives you the first leg of the decision tree (the move with highest expected information gain); then you do the same process for each of the possible results of that move, and so on down.
So do you need help to compute that expected information gain for each of the three allowable values of N, only for N==1, or can you try doing it yourself? If the third possibility obtains, then that would maximize the amount of learning you get from the exercise -- which after all IS the key purpose of homework. So why don't you try, edit your answer to show you how you proceeded and what you got, and we'll be happy to confirm you got it right, or try and help correct any misunderstanding your procedure might reveal!
Edit: trying to give some hints rather than serving the OP the ready-cooked solution on a platter;-). Call the coins H (for heavy), L (for light), and M (for medium -- five of those). When you pick 2 coins at random you can get (out of 7 * 6 == 42 possibilities including order) HL, LH (one each), HM, MH, LM, ML (5 each), MM (5 * 4 == 20 cases) -- 2 plus 20 plus 20 is 42, check. In the weighting you get 3 possible results, call them A (left heavier), B (right heavier), C (equal weight). HL, HM, and ML, 11 cases, will be A; LH, MH, and LM, 11 cases, will be B; MM, 20 cases, will be C. So A and B aren't really distinguishable (which one is left, which one is right, is basically arbitrary!), so we have 22 cases where the weight will be different, 20 where they will be equal -- it's a good sign that the cases giving each results are in pretty close numbers!
So now consider how many (equiprobable) possibilities existed a priori, how many a posteriori, for each of the experiment's results. You're tasked to pick the H and L choice. If you did it at random before the experiment, what would be you chances? 1 in 7 for the random pick of the H; given that succeeds 1 in 6 for the pick of the L -- overall 1 in 42.
After the experiment, how are you doing? If C, you can rule out those two coins and you're left with a mystery H, a mystery L, and three Ms -- so if you picked at random you'd have 1 in 5 to pick H, if successful 1 in 4 to pick L, overall 1 in 20 -- your success chances have slightly more than doubled. It's trickier to see "what next" for the A (and equivalently B) cases because they're several, as listed above (and, less obviously, not equiprobable...), but obviously you won't pick the known-lighter coin for H (and viceversa) and if you pick one of the 5 unweighed coins for H (or L) only one of the weighed coins is a candidate for the other role (L or H respectively). Ignoring for simplicity the "non equiprobable" issue (which is really kind of tricky) can you compute what your chances of guessing (with a random pick not inconsistent with the experiment's result) would be...?

Resources