Using brms for a logistic regression when the outcome is imperfectly defined - logistic-regression

I'm trying to use brms package to run a model where my dependent variable Y is an estimate of the latent variable (Disease=0 absent or Disease=1 present with a probability p).
I've a dataframe bd which contains a dichotomous variable Y (result of a test either positive 1 or negative 0 for assessing the disease status), and 3 covariates (X1 numeric, X2 and X3 as factors)
Y~q #where q is a Bernoulli event and q depends on the false positive and false negative fraction of the test.
q­­~ pSe +(1-p)(1-Sp) # true positive and false positive depending on the true probability of disease p
The model I want to finally obtain is on the form:
logit(p)~ X1 + X2 + X3 #I want to determine the impacts of my Xi's on the latent variable p
I used brms package with a non linear formula but struggle with specific problems
bform <- bf(
Y~q, #defining my Bernoulli event
nlf(q~ Se * p+(1-p) * (1-Sp)),
nlf(p~inv_logit(X1 + X2 + X3)),
Se+Sp~1,
nl=TRUE,
family=bernoulli("identity"))
I put some priors on the test sensitivity ans specificity using beta priors but letting default priors for logistic regression coefficients => bprior
bprior <- set_prior("beta(4.6, 0.86)", nlpar="Se", lb=0, ub=1)+
set_prior("beta(77.55, 4.4)", nlpar="Sp", lb=0, ub=1)
My final model looks like (using the previously created list bform and bprior ):
brm(bform, data=bd, prior=bprior, init="0")
When running the model I only get posteriors from the Se and Sp parameters but I am not able to see any coefficient associated with my covariables X1, X2, X3.
I guess my model has a mistake but I'm not able to see what's happen.
Any help would be greatly appreciated!!!
I expected to get output from the line code p~inv_logit(X1 + X2 + X3)) to be able to determine coefficients associated with this logistic regression (accounting for imperfect dependent variable estimation)

Related

If I have a large list of coordinates, how can I extract the y-values that correspond to a specific x-value?

I have three datasets that compile into one big dataset.
Data1 has x-values ranging from 0-47 (ordered), with many y-values (a small error) attached to an x-value. In total there are approx 100000 y values.
Data 2 and 3 are similar but with x-values 48-80 and 80-95 respectively.
The end goal is to produce a standard deviation for each x value (therefore 96 in total), based on the numerous y-values. Therefore, I think I should first extract the y-values for each x-value out of these datasets and then determine the standard deviation as per the norm.
In mathematica, I have tried using the select and part functions to no avail.
Statistically it would be better to provide a prediction interval with the predicted value of y.
There is a video about that here:-
Intervals (for the Mean Response and a Single Response) in Simple Linear Regression
Illustrating with some example data, stored here as a QR code.
qrimage = Import["https://i.stack.imgur.com/s7Ul7.png"];
data = Uncompress#BarcodeRecognize#qrimage;
ListPlot[data, Frame -> True, Axes -> None]
Setting 66 & 95% confidence levels
cl = Map[Function[σ, 2 (CDF[NormalDistribution[0, 1], σ] - 0.5)], {1, 2}];
(* trying a quadratic linear fit *)
lm = LinearModelFit[data, {1, a, a^2}, a];
bands = lm["SinglePredictionBands", ConfidenceLevel -> #] & /# cl;
(* x value for an observation outside of the sample observations *)
x0 = 50;
(* Predicted value of y *)
y0 = lm[x0]
39.8094
(* Least-squares regression of Y on X *)
Normal[lm]
26.4425 - 0.00702613 a + 0.0054873 a^2
(* Confidence interval for y0 given x0 *)
b1 = bands /. a -> x0;
(* R^2 goodness of fit *)
lm["RSquared"]
0.886419
b2 = {bands, {Normal[lm]}};
(* Prediction intervals plotted over the data range *)
Show[
Plot[b2, {a, 0, 100}, PlotRange -> {{0, 100}, Automatic}, Filling -> {1 -> {2}}],
ListPlot[data],
ListPlot[{{x0, lm[x0]}}, PlotStyle -> Red],
Graphics[{Red, Line[{{x0, Min[b1]}, {x0, Max[b1]}}]}],
Frame -> True, Axes -> None]
Row[{"For x0 = ", x0, ", y0 = ", y0,
" with 95% prediction interval ", y0, " ± ", y0 - Min[b1]}]
For x0 = 50, y0 = 39.8094 with 95% prediction interval 39.8094 ± 12.1118
Addressing your requirement:
The end goal is to produce a standard deviation for each x value (therefore 96 in total), based on the numerous y-values.
The best measure for this may be the standard errors, which can be found via
lm["SinglePredictionConfidenceIntervalTable"] and lm["SinglePredictionErrors"]
They will provide "standard errors for the predicted response of single observations". If you have multiple y values for a single x there will still just be one standard error for each x value.
Ref: https://reference.wolfram.com/language/ref/LinearModelFit.html (Details & Options)
See if you can adapt this
exampledata={{1,1},{1,2},{1,4},{2,1},{2,2},{2,2},{3,4},{3,5},{3,12}};
(*first a manual calculation to see what the answer should be*)
{StandardDeviation[{1,2,4}],StandardDeviation[{1,2,2}],StandardDeviation[{4,5,12}]}
(*and now automate the calculation*)
(*if your x values are not exact this will need to be changed*)
x=Union[Map[First,exampledata]];
y[x_]:=Map[Last,Cases[exampledata,{x,_}]];
std=Map[StandardDeviation[y[#]]&,x]
(*{Sqrt[7/3], 1/Sqrt[3], Sqrt[19]}*)
(*{Sqrt[7/3], 1/Sqrt[3], Sqrt[19]}*)
Since you have 100000 pairs this might speed it up.
You have said that your data is sorted on x so I won't sort it here.
If your data isn't sorted this will produce incorrect results.
exampledata={{1,1},{1,2},{1,4},{2,1},{2,2},{2,2},{3,4},{3,5},{3,12}};
y[x_]:=Map[Last,x];
std=Map[StandardDeviation[y[#]]&, SplitBy[exampledata,First]]
That should give exactly the same results, with fewer passes through the data. You might compare the timing of the two methods and verify that they do produce exactly the same results.
Reading this over, I am not absolutely certain that I exactly correctly understood your verbal description the form of your data structure. I thought you had a long list of {x,y} points with lots of repeated x values. If it looks like I misunderstood and you could include a tiny example bit of Mathematica code holding some of your sample data then I would edit my code to match.

How to form the logic to solve differential equations using Euler's method in C

We have the following mathematical formulas to solve differential equations by Euler's method
xn+1 = xn + h
yn+1 = yn + h*f(xn, yn)
Suppose we have been provided y(x0)=any value,then we have x0 and y0 and also h has been provided by the user.
I am having problem to understand how can I accept the function f(xn, yn) from the user since the function can be either algebraic,trigonometric,exponential or logarithmic and the function would be of type dy/dx=(expression).The program code should be able to solve any differential expression entered by the user and the answers of the approximations must be correct upto 4 decimal places.
It is possible to accept the expression as a String but I would not be able to perform calculations on String.Any suggestions or solutions would be appreciated.
Site of Euler's method:http://calculuslab.deltacollege.edu/ODE/7-C-1/7-C-1-h-c.html
Example Input:
expression input:
dy/dx=x+2y
initial conditions input:
x0=0, y0=0
enter step size: h=0.25
enter the value for which you want to find solution for: x=1
Example output:
output

Resampling two vectors with interp1 or spline

Situation:
I was trying to compare two signal vectors (y1 & y2 with time vectors x1 & x2) with different lengths (len(y1)=1000>len(y2)=800). For this, I followed the main piece of advice given hardly everywhere: to use interp1 or spline. In order to 'expand' y2 towards y1 in number of samples through an interpolation.
So I want:
length(y1)=length(y2_interp)
However, in these functions you have to give the points 'x' where to interpolate (xq), so I generate a vector with the resampled points I want to compute:
xq = x2(1):(length(x2))/length(x1):x2(length(x2));
y2_interp = interp1(x2,y2,xq,'spline'); % or spline method directly
RMS = rms(y1-y2_interp)
The problem:
When I resample the x vector in 'xq' variable, as the faction of lengths is not an integer it gives me not the same length for 'y2_interp' as 'y1'. I cannot round it for the same problem.
I tried interpolate using the 'resample' function:
y2_interp=resample(y2,length(y1),length(y2),n);
But I get an aliasing problem and I want to avoid filters if possible. And if n=0 (no filters) I get some sampling problems and more RMS.
The two vectors are quite long, so my misalignment is just of 2 or 3 points.
What I'm looking for:
I would like to find a way of interpolating one vector but having as a reference the length of another one, and not the points where I want to interpolate.
I hope I have explained it well... Maybe I have some misconception. It's more than i'm curious about any possible idea.
Thanks!!
The function you are looking for here is linspace
To get an evenly spaced vector xq with the same endpoints as x2 but the same length as x1:
xq = (x2(1),x2(end),length(x1));
It is not sufficient to interpolate y2 to get the right number of samples, the samples should be at locations corresponding to samples of y1.
Thus, you want to interpolate y2 at the x-coordinates where you have samples for y1, which is given by x1:
y2_interp = interp1(x2,y2,x1,'spline');
RMS = rms(y1-y2_interp)

AI : evaluate mass of a spaceship via prod (exert force lightly) and sense change in its velocity

Problem
I have to code AI to find mass of a spaceship in a game.
My AI can exert a little force c to the spaceship, to measure the mass via change of velocity.
However, my AI can access only current position of spaceship ,x, in every time-step.
Mass is not constant, but it is safe to assume that it will not change too fast.
For simplicity :-
Let the space be 1D, and has no gravity.
Timestep is always 1 second.
Forces
There are many forces that exert on the spaceship currently, e.g. gravity, an automatic propulsion system controlled by an unknown AI, collision impulse, etc.
The summation of these forces is b, which depends on t (time).
Acceleration a for a certain timestep is calculated by a game-play formula which is out of my control:-
a = (b+c)/m ................. (1)
The velocity v is updated as:-
v = vOld + a ................. (2)
The position x is updated as:-
x = xOld + v ................. (3)
The order of execution (1)-(3) is also unknown, i.e. AI should not rely on such order.
My poor solution
I will exert c0=0.001 for a few second and compare result against when I exert c1=-0.001.
I would assume that b and m are constant for the time period.
I calculate acceleration via :-
t 0 1 2 3 (Exert force `c0` at `t1`, `c1` at `t2`)
x 0 1 2 3 (The number are points in timeline that I sampling x.)
v 0 1 2 (v0=x1-x0, v1=x2-x1, ... )
a 0 1 (a0=v1-v0, ... )
Now I know acceleration of 2 points of timeline, and I can cache c because I am the one who exert it.
With a = (b+c)/m, with unknown b and m and known a0,a1,c0 and c1:-
a0 = (b+c0)/m
a1 = (b+c1)/m
I can solve them to find b and m.
However, my assumption is wrong at the beginning.
b and m are actually not constants.
This problem might be viewed in a more casual way :-
Many persons are trying to lift a heavy rock.
I am one of them.
How can I measure the mass of the rock (with feeling from my hand) without interrupt them too much?

Interest Rate in Value Iteration Algorithm

In the chapter about Value Iteration algorithm to calculate optimal policy for MDPs, there is an algorithm:
function Value-Iteration(mdp,ε) returns a utility function
inputs: mdp, an MDP with states S, actions A(s), transition model P(s'|s,a),
rewards R(s), discount γ
ε, the maximum error allowed in the utility of any state
local variables: U, U', vectors of utilities for states in S, initially zero
δ, the maximum change in the utility of any state in an iteration
repeat
U ← U'; δ ← 0
for each state s in S do
U'[s] ← R(s) + γ max(a in A(s)) ∑ over s' (P(s'|s,a) U[s'])
if |U'[s] - U[s]| > δ then δ ← |U'[s] - U[s]|
until δ < ε(1-γ)/γ
return U
(I apologize for the formatting, but I need 10 rep to post picture and $latex formatting$ doesn't seem to work here.)
and also a chapter earlier there was a statement:
A discount factor of γ is equivalent to an interest rate of (1/γ) − 1.
Could anyone explain to me what does the interest rate (1/γ)-1 mean? How did they get it? Why is it used in the termination condition in the algorithm above?
The reward at t-1 is considered discounted by a factor gamma (y). That is to say, old = y x new. So new = (1/y) * old, and new - old = ((1/y) - 1) * old. That is your interest rate.
I am not so sure why it is used in the termination condition. The value of epsilon is arbitrary, anyway.
In fact, I believe this termination criterion is very bad. It does not work when y = 1. When y = 0, then the iteration should stop immediately, since it is enough to estimate perfect values. When y = 1, many iterations are necessary.

Resources