While I appreciate this question is math-heavy, the real answer for this question will be helpful for all those, who are dealing with MongoDB's $bucket operator (or its SQL analogies), and building cluster/heatmap chart data.
Long Description of the Problem:
I have an array of unique/distinct values of prices from my DB (it's always an array of numbers, with 0.01 precision).
As you may see, most of its values are between ~8 and 40 (in this certain case).
[
7.9, 7.98, 7.99, 8.05, 8.15, 8.25, 8.3, 8.34, 8.35, 8.39,
8.4, 8.49, 8.5, 8.66, 8.9, 8.97, 8.98, 8.99, 9, 9.1,
9.15, 9.2, 9.28, 9.3, 9.31, 9.32, 9.4, 9.46, 9.49, 9.5,
9.51, 9.69, 9.7, 9.9, 9.98, 9.99, 10, 10.2, 10.21, 10.22,
10.23, 10.24, 10.25, 10.27, 10.29, 10.49, 10.51, 10.52, 10.53, 10.54,
10.55, 10.77, 10.78, 10.98, 10.99, 11, 11.26, 11.27, 11.47, 11.48,
11.49, 11.79, 11.85, 11.9, 11.99, 12, 12.49, 12.77, 12.8, 12.86,
12.87, 12.88, 12.89, 12.9, 12.98, 13, 13.01, 13.49, 13.77, 13.91,
13.98, 13.99, 14, 14.06, 14.16, 14.18, 14.19, 14.2, 14.5, 14.53,
14.54, 14.55, 14.81, 14.88, 14.9, 14.98, 14.99, 15, 15.28, 15.78,
15.79, 15.8, 15.81, 15.83, 15.84, 15.9, 15.92, 15.93, 15.96, 16,
16.5, 17, 17.57, 17.58, 17.59, 17.6, 17.88, 17.89, 17.9, 17.93,
17.94, 17.97, 17.99, 18, 18.76, 18.77, 18.78, 18.99, 19.29, 19.38,
19.78, 19.9, 19.98, 19.99, 20, 20.15, 20.31, 20.35, 20.38, 20.39,
20.44, 20.45, 20.49, 20.5, 20.69, 20.7, 20.77, 20.78, 20.79, 20.8,
20.9, 20.91, 20.92, 20.93, 20.94, 20.95, 20.96, 20.99, 21, 21.01,
21.75, 21.98, 21.99, 22, 22.45, 22.79, 22.96, 22.97, 22.98, 22.99,
23, 23.49, 23.78, 23.79, 23.8, 23.81, 23.9, 23.94, 23.95, 23.96,
23.97, 23.98, 23.99, 24, 24.49, 24.5, 24.63, 24.79, 24.8, 24.89,
24.9, 24.96, 24.97, 24.98, 24.99, 25, 25.51, 25.55, 25.88, 25.89,
25.9, 25.96, 25.97, 25.99, 26, 26.99, 27, 27.55, 28, 28.8,
28.89, 28.9, 28.99, 29, 29.09, 30, 31.91, 31.92, 31.93, 33.4,
33.5, 33.6, 34.6, 34.7, 34.79, 34.8, 35, 38.99, 39.57, 39.99,
40, 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
]
The problem itself or How to clear (non)-normal distribution tail from non-normal elements
I need to find in this array, irrelevant values, some kind of «dirty tail», and remove them. Actually, I don't even need to remove it from the array, the real case is to find the latest relevant number. To define it as a cap value, for finding a range between floor (min relevant) and cap (max relevant), like:
floor value => 8
cap value => 40
What am I talking about?
For example, for the array above: it will be all values after 40 (or maybe even 60), like 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
They all defined by me like a non-normal.
What will be counted as an answer?
S tier. The clear/optimal code (language doesn't matter, but JavaScript preferred) or formula (if math has one) could solve the problem for a small / non-resourceful amount of time. It would be perfect, if I don't even need to check every element in the array, or could skip some of them, like starting from peak / most popular value in the array.
A tier. Your own experience or code try with any relevant results or improving the current formula with better performance.
B tier. Something useful. Blog article/google link. The main requirement is to make sense. Non-obvious solutions are welcome. Even if your code is terribly formatted and so on.
TL:DR VISUAL CLARIFICATION
By which criteria and how I should «target the tail» / remove non-relevant elements from the array with x (dramatically rising and rarely occurring) values?
The given data set has some huge outliers, which make it somewhat hard to analyze using standard statistical methods (if it were better behaved, I would recommend fitting several candidate distributions to it and finding out which fits best - log normal distribution, beta distribution, gamma distribution, etc).
The problem of determining which outliers to ignore can be solved in general through more simplistic but less rigorous methods; one method is to compare the values of the data at various percentiles and throw away the ones where the differences become "too high" (for a suitably chosen value of "too high").
For example, here are the last few entries if we go up by two percentile slots; the delta column gives the difference between the previous percentile and this one.
Here, you can see that the difference with the previous entry increases by almost 2 once we hit 87, and goes up (mostly) from there. To use a "nice" number, let's make the cut-off the 85th percentile, and ignore all values above that.
Given the sorted list above in array named data, we ignore any index above
Math.floor(data.length*85/100)
The analysis above can be repeated in code if it should change dynamically (or to call attention to deviations where 85 is not the right value), but I leave this as an exercise for the reader.
This is the version 2 of the code, and the exact version of that's running at the production, at the moment. It covers about 80%+ of the problems, but there is still a bottle-neck.
/** priceRangeArray ALWAYS SORTED ASC */
let priceRangeArray = [1,2,3...]
/** Resulting array */
let priceArray = []
/** Control variable */
let prev_sV = 0
/** Array length is always more then 3 elements */
const L = priceRangeArray.length;
/** Sample Variance algorithm */
for (let i = 2; i < L-1; i++) {
/**
* We skip the first two value, because 1st sV could be too low
* sV becomes previous sV
*/
if (prev_SV === 0) {
/** prev_sV of 2nd element */
prev_sV = ( 1 / L * (Math.pow(priceRangeArray[1],2))) - (Math.pow((1 / L * priceRangeArray[1]),2));
} else {
prev_sV = sV
}
/**
* sample variance, right?
* 1 / L * (el ^ 2) - ( 1 / L * el) ^ 2
* #type {number}
*/
sV = ( 1 / L * (Math.pow(priceRangeArray[i],2))) - (Math.pow((1 / L * priceRangeArray[i]),2));
/** User-defined, 1.1 is a control constant */
if (prev_sV * 1.1 < sV) {
break;
}
/** Control passed values to new array */
priceArray.push(priceRangeArray[i]);
}
console.log(priceArray)
It based on a Wikipedia's Variance article. The logic is quite simple, as long as I can't remove beginning (first 2 values, even if they are too low), I starting for of cycle from 3-rd element of array and check every next one of them, with my control formula (something with sqrt(pow^2) of current and previous value).
First version of this code, has a linear logic, and simply change previous value from current one, by one of this simple principals, like:
If current value is twice ( xN) more that previous one, then break
If current value is more that previous one, by 10%, then break.
The real problem, is that it doesn't work will the begining or small values, in arrays like: [ 1,2,3,4,13,14,16,22,100,500000].
Where, as you may see, a cap value could be de terminated as 4 instead of 22, or 100.
I also found another code, that helps me in production, and as for now, the current working version is combined the best practices from my previous answer and James McLead:
priceRange(
quotes: number[],
blocks: number,
): number[] {
if (!quotes.length) return [];
const length = quotes.length > 3 ? quotes.length - 3 : quotes.length;
const start = length === 1 ? 0 : 1;
const cap = Math.round(quotes[Math.floor(length * 0.9)]);
const floor = Math.round(quotes[start]);
const price_range = cap - floor;
/** Step represent 2.5% for each cluster */
const tick = price_range / blocks;
return Array(Math.ceil((cap + tick - floor) / tick))
.fill(floor)
.map((x, y) => parseFloat((x + y * tick).toFixed(4)));
}
For the array like this:
[1, 20, ..., 40, 432, 567, 345346]
floor value will be determined as: 20,
cap as ~40, step ~0.5 and result will be:
[20, 20.5, 21, ... 39.5, 40]
I am writing a numpy based .PLY importer. I am only interested in binary files, and vertices, faces and vertex colors. My target data format is a flattened list of x,y,z floats for the vertex data and r,g,b,a integers for the color data.
[x0,y0,z0,x1,y1,z1....xn,yn,zn]
[r0,g0,b0,a0,r1,g1,b1,a1....rn,gn,bn,an]
This allows me to use fasts builtin C++ methods to construct the mesh in the target program (Blender).
I am using a modified version of this code to read in the data into numpy arrays example
valid_formats = {'binary_big_endian': '>','binary_little_endian': '<'}
ply = open(filename, 'rb')
# get binary_little/big or ascii
fmt = ply.readline().split()[1].decode()
# get extension for building the numpy dtypes
ext = valid_formats[fmt]
ply.seek(end_header)
#v_dtype = [('x','<f4'),('y','<f4'), ('z','<f4'), ('red','<u1'), ('green','<u1'), ('blue','<u1'),('alpha','<u1')]
#points_size = (previously read in from header)
points_np = np.fromfile(ply, dtype=v_dtype, count=points_size)
The results being
print(points_np.shape)
print(points_np[0:3])
print(points_np.ravel()[0:3])
>>>(158561,)
>>>[ (20.781816482543945, 11.767952919006348, 15.565438270568848, 206, 216, 186, 255)
(20.679922103881836, 11.754084587097168, 15.560364723205566, 189, 196, 157, 255)
(20.72969627380371, 11.823691368103027, 15.51106071472168, 192, 193, 157, 255)]
>>>[ (20.781816482543945, 11.767952919006348, 15.565438270568848, 206, 216, 186, 255)
(20.679922103881836, 11.754084587097168, 15.560364723205566, 189, 196, 157, 255)
(20.72969627380371, 11.823691368103027, 15.51106071472168, 192, 193, 157, 255)]
So the ravel (I've also tried flatten, reshape etc) does work and I presume it is because the data types are (float, float, float, int, int, int).
What I have tried
-I've tried doing things like vectorizing a function that just pulls out the xyz and rgb separately into a new array.
-I've tried stack, vstack etc
List comprehension (yuck)
-Things like thes take 1 to 10s of seconds to execute compared to hundredths of seconds to read in the data.
-I have tried using astype on the verts data, but that seems to return only the first element.
convert to structured array
accessing first element of each element
Most efficient way to map function over numpy array
What I want to Try/Would Like to Know
Is there a better way to read the data in the data in the first place so I don't loose all this time reshaping, flattening etc? Perhaps by telling np.fromfile to skip over the color data on one pass and then come back and read it again?
Is there a numpy trick I don't know for reshaping/flattening data of this kind
I am developing an IA algorithm for a robot that needs to follow a line. The floor will be black, with a white line and there will be different marks that determine different types of "obstacles". I'm using a sensor that gives me an array of 8 measurements of the floor, as seen on the Figure 1 that give me an array of 8 measurements from 0 to 1000, where 0 there is no white and 1000 is totally white. In the examples bellow is a measurement of a white line in the middle of the sensor array and other cases.
int array[] = {50, 24, 9, 960, 1000, 150, 50, 45} // white line in the middle
int array2[] = {50, 24, 9, 960, 1000, 150, 50, 960} // white line in the middle and a square box on the right
int array3[] = {1000, 24, 9, 960, 1000, 150, 50, 40} // white line in the middle and a square box on the left
int array4[] = {1000, 980, 950, 0, 10, 980, 1000, 960} // black square box in the middle
Witch algorithms I could use to detect the patterns on the images below given this array of measurements? I do not want to use several "hardcoded" conditionals as templates, as I think it will not scale well. Im thinking on implementing a "peak counter" algorithm, but I do not know if it will work robust enough.
On the Figures we can see the different cases, the case I want to detect are the ones with the red circle.
Thanks!
How about doing something simple like treating each measurement like an N-dimensional vector. In your case N=8. Then, all you measurements are contained in a hypercube with sides up to length 1000. For N=8 there will be 256 corners. For each of your cases of interest, associate the hypercube corners that best match up to it. Note, some corners may not get associated. Then, for each measurement find its nearest associated hypercube corner. This tells you which case it is. You can mitigate errors by implementing some checks. For example, if the measurement is close to multiple corners (within some uncertainty threshold) then you label the measurement as being ambiguous and skip it.
It's easier to see this for the case of 3 measurements. The 8 corners of the cube could represent
[0,0,0] = no white
[0,0,1] = white on right
[0,1,0] = white in middle
[0,1,1] = white in middle and right
[1,0,0] = white on left
[1,0,1] = white on left and right
[1,1,0] = white on left and middle
[1,1,1] = all white
The case shown below is an ambiguous measurement in the middle.
(source: ctralie.com)