If I have a large list of coordinates, how can I extract the y-values that correspond to a specific x-value? - database

I have three datasets that compile into one big dataset.
Data1 has x-values ranging from 0-47 (ordered), with many y-values (a small error) attached to an x-value. In total there are approx 100000 y values.
Data 2 and 3 are similar but with x-values 48-80 and 80-95 respectively.
The end goal is to produce a standard deviation for each x value (therefore 96 in total), based on the numerous y-values. Therefore, I think I should first extract the y-values for each x-value out of these datasets and then determine the standard deviation as per the norm.
In mathematica, I have tried using the select and part functions to no avail.

Statistically it would be better to provide a prediction interval with the predicted value of y.
There is a video about that here:-
Intervals (for the Mean Response and a Single Response) in Simple Linear Regression
Illustrating with some example data, stored here as a QR code.
qrimage = Import["https://i.stack.imgur.com/s7Ul7.png"];
data = Uncompress#BarcodeRecognize#qrimage;
ListPlot[data, Frame -> True, Axes -> None]
Setting 66 & 95% confidence levels
cl = Map[Function[σ, 2 (CDF[NormalDistribution[0, 1], σ] - 0.5)], {1, 2}];
(* trying a quadratic linear fit *)
lm = LinearModelFit[data, {1, a, a^2}, a];
bands = lm["SinglePredictionBands", ConfidenceLevel -> #] & /# cl;
(* x value for an observation outside of the sample observations *)
x0 = 50;
(* Predicted value of y *)
y0 = lm[x0]
39.8094
(* Least-squares regression of Y on X *)
Normal[lm]
26.4425 - 0.00702613 a + 0.0054873 a^2
(* Confidence interval for y0 given x0 *)
b1 = bands /. a -> x0;
(* R^2 goodness of fit *)
lm["RSquared"]
0.886419
b2 = {bands, {Normal[lm]}};
(* Prediction intervals plotted over the data range *)
Show[
Plot[b2, {a, 0, 100}, PlotRange -> {{0, 100}, Automatic}, Filling -> {1 -> {2}}],
ListPlot[data],
ListPlot[{{x0, lm[x0]}}, PlotStyle -> Red],
Graphics[{Red, Line[{{x0, Min[b1]}, {x0, Max[b1]}}]}],
Frame -> True, Axes -> None]
Row[{"For x0 = ", x0, ", y0 = ", y0,
" with 95% prediction interval ", y0, " ± ", y0 - Min[b1]}]
For x0 = 50, y0 = 39.8094 with 95% prediction interval 39.8094 ± 12.1118
Addressing your requirement:
The end goal is to produce a standard deviation for each x value (therefore 96 in total), based on the numerous y-values.
The best measure for this may be the standard errors, which can be found via
lm["SinglePredictionConfidenceIntervalTable"] and lm["SinglePredictionErrors"]
They will provide "standard errors for the predicted response of single observations". If you have multiple y values for a single x there will still just be one standard error for each x value.
Ref: https://reference.wolfram.com/language/ref/LinearModelFit.html (Details & Options)

See if you can adapt this
exampledata={{1,1},{1,2},{1,4},{2,1},{2,2},{2,2},{3,4},{3,5},{3,12}};
(*first a manual calculation to see what the answer should be*)
{StandardDeviation[{1,2,4}],StandardDeviation[{1,2,2}],StandardDeviation[{4,5,12}]}
(*and now automate the calculation*)
(*if your x values are not exact this will need to be changed*)
x=Union[Map[First,exampledata]];
y[x_]:=Map[Last,Cases[exampledata,{x,_}]];
std=Map[StandardDeviation[y[#]]&,x]
(*{Sqrt[7/3], 1/Sqrt[3], Sqrt[19]}*)
(*{Sqrt[7/3], 1/Sqrt[3], Sqrt[19]}*)
Since you have 100000 pairs this might speed it up.
You have said that your data is sorted on x so I won't sort it here.
If your data isn't sorted this will produce incorrect results.
exampledata={{1,1},{1,2},{1,4},{2,1},{2,2},{2,2},{3,4},{3,5},{3,12}};
y[x_]:=Map[Last,x];
std=Map[StandardDeviation[y[#]]&, SplitBy[exampledata,First]]
That should give exactly the same results, with fewer passes through the data. You might compare the timing of the two methods and verify that they do produce exactly the same results.
Reading this over, I am not absolutely certain that I exactly correctly understood your verbal description the form of your data structure. I thought you had a long list of {x,y} points with lots of repeated x values. If it looks like I misunderstood and you could include a tiny example bit of Mathematica code holding some of your sample data then I would edit my code to match.

Related

Resampling two vectors with interp1 or spline

Situation:
I was trying to compare two signal vectors (y1 & y2 with time vectors x1 & x2) with different lengths (len(y1)=1000>len(y2)=800). For this, I followed the main piece of advice given hardly everywhere: to use interp1 or spline. In order to 'expand' y2 towards y1 in number of samples through an interpolation.
So I want:
length(y1)=length(y2_interp)
However, in these functions you have to give the points 'x' where to interpolate (xq), so I generate a vector with the resampled points I want to compute:
xq = x2(1):(length(x2))/length(x1):x2(length(x2));
y2_interp = interp1(x2,y2,xq,'spline'); % or spline method directly
RMS = rms(y1-y2_interp)
The problem:
When I resample the x vector in 'xq' variable, as the faction of lengths is not an integer it gives me not the same length for 'y2_interp' as 'y1'. I cannot round it for the same problem.
I tried interpolate using the 'resample' function:
y2_interp=resample(y2,length(y1),length(y2),n);
But I get an aliasing problem and I want to avoid filters if possible. And if n=0 (no filters) I get some sampling problems and more RMS.
The two vectors are quite long, so my misalignment is just of 2 or 3 points.
What I'm looking for:
I would like to find a way of interpolating one vector but having as a reference the length of another one, and not the points where I want to interpolate.
I hope I have explained it well... Maybe I have some misconception. It's more than i'm curious about any possible idea.
Thanks!!
The function you are looking for here is linspace
To get an evenly spaced vector xq with the same endpoints as x2 but the same length as x1:
xq = (x2(1),x2(end),length(x1));
It is not sufficient to interpolate y2 to get the right number of samples, the samples should be at locations corresponding to samples of y1.
Thus, you want to interpolate y2 at the x-coordinates where you have samples for y1, which is given by x1:
y2_interp = interp1(x2,y2,x1,'spline');
RMS = rms(y1-y2_interp)

Matlab - Distances of two lines

I have two lines, one straight and one curvy. Both have an arbitrary number of x and y values defining the lines - the number of x and y values are not the same for either line. I am attempting to get separate distances of points between the curved line coordinates and the straight line coordinates. You can think of discrete integration to get a better picture of what I'm talking about, something along the lines of this: http://www.scientific-solutions.ch/tech/origin/products/images/calculus_integral.gif
By adding the different distances, I would get the area. The part on which I am stuck is the actual synchronization of the points. I can simply compare the x and y values of the straight and curve coordinates every ten indices for example because the curved coordinates are time dependent (as in the points do not change at a general rate). I need a way to synchronize the actual coordinates of the two sets of points. I thought about interpolating both sets of points to a specific number of points, but again, the time dependence of the curved set of points makes that solution void.
Could someone please suggest a good way of doing this, outlining the basics? I really appreciate the help.
Code to be tried (pseudo):
xLine = [value1 value2 ...]
yLine = [value1 value2 ...]
xCurve = [value1 value2 ...]
yCurve = [value1 value2 ...]
xLineInterpolate = %interpolate of every 10 points of x until a certain value. same with yLineInterpolate, xCurveInterpolate and yCurveInterpolate.
Then, I could just take the same index from each array and do some algebra to get the distance. My worry is that my line values increase at a constant rate whereas my curve values sometimes do not change (x and y values have different rates of change) and sometimes do. Would such an interpolation method be wrong then?
If I understand correctly, you want to know the distance between a straight line and a curve. The easiest way is to perform a coordinate transformation such that the straight line is the new x-axis. In that frame, the y-values of the curved line are the distances you seek.
This coordinate transformation is equal to a rotation and a translation, as in the following:
% estimate coefficients for straight line
sol = [x1 ones(size(x1))] \ y1;
m = sol(1); %# slope
b = sol(2); %# y-offset at x=0
% shift curved line down by b
% (makes the straight line go through the origin)
y2 = y2 - b;
% rotate the curved line by -atan(m)
% (makes the straight line run parallel to the x-axis)
a = -atan(m);
R = [+cos(a) -sin(a)
+sin(a) +cos(a)];
XY = R*[x2; y2];
% the distances are then the values of y3.
x3 = XY(1,:);
y3 = XY(2,:);
You need to use interpolation. I don't see how the time-dependence is relevant here - perhaps you are thinking of fitting a straight line to both curves? That's a bad idea.
You can do a simple interpolation for any curve just by assuming that every two adjacent points are connected by a straight line. This can be shown to be a reasonable approximation for the curve.
So, let's say you are looking at (x1,y1) and (x2,y2) which are adjacent to each other and now you choose an x3 that is between x1 and x2 (x1 < x2 < x3), and want to find the y3 value.
A simple way to find y3 is the following:
p=(x3-x1)/(x2-x1)
y3=y1+p*(y2-y1)
The idea is that p shows the relative position between x1 and x2 (0.5 would be the middle, for example), and then you use p as the relative position between y1 and y2.

Uniformly sampling on hyperplanes

Given the vector size N, I want to generate a vector <s1,s2, ..., sn> that s1+s2+...+sn = S.
Known 0<S<1 and si < S. Also such vectors generated should be uniformly distributed.
Any code in C that helps explain would be great!
The code here seems to do the trick, though it's rather complex.
I would probably settle for a simpler rejection-based algorithm, namely: pick an orthonormal basis in n-dimensional space starting with the hyperplane's normal vector. Transform each of the points (S,0,0,0..0), (0,S,0,0..0) into that basis and store the minimum and maximum along each of the basis vectors. Sample uniformly each component in the new basis, except for the first one (the normal vector), which is always S, then transform back to the original space and check if the constraints are satisfied. If they are not, sample again.
P.S. I think this is more of a maths question, actually, could be a good idea to ask at http://maths.stackexchange.com or http://stats.stackexchange.com
[I'll skip "hyper-" prefix for simplicity]
One of possible ideas: generate many uniformly distributed points in some enclosing volume and project them on the target part of plane.
To get uniform distribution the volume must be shaped like the part of plane but with added margins along plane normal.
To uniformly generate points in such volumewe can enclose it in a cube and reject everything outside of the volume.
select margin, let's take margin=S for simplicity (once margin is positive it affects only performance)
generate a point in cube [-M,S+M]x[-M,S+M]x[-M,S+M]
if distance to the plane is more than M, reject the point and go to #2
project the point on the plane
check that projection falls into [0,S]x[0,S]x[0,S], if not - reject and go to #2
add this point to the resulting set and go to #2 is you need more points
The problem can be mapped to that of sampling on linear polytopes for which the common approaches are Monte Carlo methods, Random Walks, and hit-and-run methods (see https://www.jmlr.org/papers/volume19/18-158/18-158.pdf for examples a short comparison). It is related to linear programming, and can be extended to manifolds.
There is also the analysis of polytopes in compositional data analysis, e.g. https://link.springer.com/content/pdf/10.1023/A:1023818214614.pdf, which provide an invertible transformation between the plane and the polytope that can be used for sampling.
If you are working on low dimensions, you can use also rejection sampling. This means you first sample on the plane containing the polytope (defined by your inequalities). This later method is easy to implement (and wasteful, of course), the GNU Octave (I let the author of the question re-implement in C) code below is an example.
The first requirement is to get vector orthogonal to the hyperplane. For a sum of N variables this is n = (1,...,1). The second requirement is a point on the plane. For your example that could be p = (S,...,S)/N.
Now any point on the plane satisfies n^T * (x - p) = 0
we assume also that x_i >= 0
With these given you compute an orthonormal basis on the plane (the nullity of the vector n) and then create random combination on that bases. Finally you map back to the original space and apply your constraints on the generated samples.
# Example in 3D
dim = 3;
S = 1;
n = ones(dim, 1); # perpendicular vector
p = S * ones(dim, 1) / dim;
# null-space of the perpendicular vector (transposed, i.e. row vector)
# this generates a basis in the plane
V = null (n.');
# These steps are just to reduce the amount of samples that are rejected
# we build a tight bounding box
bb = S * eye(dim); # each column is a corner of the constrained region
# project on the null-space
w_bb = V \ (bb - repmat(p, 1, dim));
wmin = min (w_bb(:));
wmax = max (w_bb(:));
# random combinations and map back
nsamples = 1e3;
w = wmin + (wmax - wmin) * rand(dim - 1, nsamples);
x = V * w + p;
# mask the points inside the polytope
msk = true(1, nsamples);
for i = 1:dim
msk &= (x(i,:) >= 0);
endfor
x_in = x(:, msk); # inside the polytope (your samples)
x_out = x(:, !msk); # outside the polytope
# plot the results
scatter3 (x(1,:), x(2,:), x(3,:), 8, double(msk), 'filled');
hold on
plot3(bb(1,:), bb(2,:), bb(3,:), 'xr')
axis image

MATLAB: Interpolating to find the x value of the intersection between a line and a curve

Here is the graph I currently have
:
The Dotted Blue line represented the y value that corresponds to the x value I am looking for. I am trying to find the x values of the line's intersections with the blue curve(Upper).Since the interesections do not fall on a point that has already been defined, we need to interpolate a point that falls onto the Upper plot.
Here is the information I have:
LineValue - The y value of the intersection and the value of the dotted line( y = LineValue)
Frequency - an array containing the x value coordinates seen on this plot. The interpolated values of Frequency that corresponds to LineValue are what we are looking for
Upper/Lower - arrays containing the y value info for this graph
This solution is an improvement on Amro's answer. Instead of using fzero you can simply calculate the intersection of the line by looking for transition in the first-difference of the series created by a logical comparison to LineValue. So, using Amro's sample data:
>> x = linspace(-100,100,100);
>> y = 1-2.*exp(-0.5*x.^2./20)./(2*pi) + randn(size(x))*0.002;
>> LineValue = 0.8;
Find the starting indices of those segments of consecutive points that exceed LineValue:
>> idx = find(diff(y >= LineValue))
idx =
48 52
You can then calculate the x positions of the intersection points using weighted averages (i.e. linear interpolation):
>> x2 = x(idx) + (LineValue - y(idx)) .* (x(idx+1) - x(idx)) ./ (y(idx+1) - y(idx))
x2 =
-4.24568579887939 4.28720287203057
Plot these up to verify the results:
>> figure;
>> plot(x, y, 'b.-', x2, LineValue, 'go', [x(1) x(end)], LineValue*[1 1], 'k:');
The advantages of this approach are:
The determination of the intersection points is vectorized so will work regardless of the number of intersection points.
Determining the intersection points arithmetically is presumably faster than using fzero.
Example solution using FZERO:
%# data resembling your curve
x = linspace(-100,100,100);
f = #(x) 1-2.*exp(-0.5*x.^2./20)./(2*pi) + randn(size(x))*0.002;
VALUE = 0.8;
%# solve f(x)=VALUE
z1 = fzero(#(x)f(x)-VALUE, -10); %# find solution near x=-10
z2 = fzero(#(x)f(x)-VALUE, 10); %# find solution near x=+10
%# plot
plot(x,f(x),'b.-'), hold on
plot(z1, VALUE, 'go', z2, VALUE, 'go')
line(xlim(), [VALUE VALUE], 'Color',[0.4 0.4 0.4], 'LineStyle',':')
hold off
Are the step sizes in your data series the same?
Is the governing equation assumed to be cubic, sinuisoidal, etc..?
doc interpl
Find the zero crossings

Need some help calculating percentile

An rpc server is given which receives millions of requests a day. Each request i takes processing time Ti to get processed. We want to find the 65th percentile processing time (when processing times are sorted according to their values in increasing order) at any moment. We cannot store processing times of all the requests of the past as the number of requests is very large. And so the answer need not be exact 65th percentile, you can give some approximate answer i.e. processing time which will be around the exact 65th percentile number.
Hint: Its something to do how a histogram (i.e. an overview) is stored for a very large data without storing all of data.
Take one day's data. Use it to figure out what size to make your buckets (say one day's data shows that the vast majority (95%?) of your data is within 0.5 seconds of 1 second (ridiculous values, but hang in)
To get 65th percentile, you'll want at least 20 buckets in that range, but be generous, and make it 80. So you divide your 1 second window (-0.5 seconds to +0.5 seconds) into 80 buckets by making each 1/80th of a second wide.
Each bucket is 1/80th of 1 second. Make bucket 0 be (center - deviation) = (1 - 0.5) = 0.5 to itself + 1/80th of a second. Bucket 1 is 0.5+1/80th - 0.5 + 2/80ths. Etc.
For every value, find out which bucket it falls in, and increment a counter for that bucket.
To find 65th percentile, get the total count, and walk the buckets from zero until you get to 65% of that total.
Whenever you want to reset, set the counters all to zero.
If you always want to have good data available, keep two of these, and alternate resetting them, using the one you reset least recently as having more useful data.
Use an updown filter:
if q < x:
q += .01 * (x - q) # up a little
else:
q += .005 * (x - q) # down a little
Here a quantile estimator q tracks the x stream,
moving a little towards each x.
If both factors were .01, it would move up as often as down,
tracking the 50 th percentile.
With .01 up, .005 down, it floats up, 67 th percentile;
in general, it tracks the up / (up + down) th percentile.
Bigger up/down factors track faster but noisier --
you'll have to experiment on your real data.
(I have no idea how to analyze updowns, would appreciate a link.)
The updown() below works on long vectors X, Q in order to plot them:
#!/usr/bin/env python
from __future__ import division
import sys
import numpy as np
import pylab as pl
def updown( X, Q, up=.01, down=.01 ):
""" updown filter: running ~ up / (up + down) th percentile
here vecs X in, Q out to plot
"""
q = X[0]
for j, x in np.ndenumerate(X):
if q < x:
q += up * (x - q) # up a little
else:
q += down * (x - q) # down a little
Q[j] = q
return q
#...............................................................................
if __name__ == "__main__":
N = 1000
up = .01
down = .005
plot = 0
seed = 1
exec "\n".join( sys.argv[1:] ) # python this.py N= up= down=
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, suppress=True ) # .2f
title = "updown random.exponential: N %d up %.2g down %.2g" % (N, up, down)
print title
X = np.random.exponential( size=N )
Q = np.zeros(N)
updown( X, Q, up=up, down=down )
# M = np.zeros(N)
# updown( X, M, up=up, down=up )
print "last 10 Q:", Q[-10:]
if plot:
fig = pl.figure( figsize=(8,3) )
pl.title(title)
x = np.arange(N)
pl.plot( x, X, "," )
pl.plot( x, Q )
pl.ylim( 0, 2 )
png = "updown.png"
print >>sys.stderr, "writing", png
pl.savefig( png )
pl.show()
An easier way to get the value that represents a given percentile of a list or array is the scoreatpercentile function in the scipy.stats module.
>>>import scipy.stats as ss
>>>ss.scoreatpercentile(v,65)
there's a sibling percentileofscore to return the percentile given the value
you will need to store a running sum and a total count.
then check out standard deviation calculations.

Resources