While I appreciate this question is math-heavy, the real answer for this question will be helpful for all those, who are dealing with MongoDB's $bucket operator (or its SQL analogies), and building cluster/heatmap chart data.
Long Description of the Problem:
I have an array of unique/distinct values of prices from my DB (it's always an array of numbers, with 0.01 precision).
As you may see, most of its values are between ~8 and 40 (in this certain case).
[
7.9, 7.98, 7.99, 8.05, 8.15, 8.25, 8.3, 8.34, 8.35, 8.39,
8.4, 8.49, 8.5, 8.66, 8.9, 8.97, 8.98, 8.99, 9, 9.1,
9.15, 9.2, 9.28, 9.3, 9.31, 9.32, 9.4, 9.46, 9.49, 9.5,
9.51, 9.69, 9.7, 9.9, 9.98, 9.99, 10, 10.2, 10.21, 10.22,
10.23, 10.24, 10.25, 10.27, 10.29, 10.49, 10.51, 10.52, 10.53, 10.54,
10.55, 10.77, 10.78, 10.98, 10.99, 11, 11.26, 11.27, 11.47, 11.48,
11.49, 11.79, 11.85, 11.9, 11.99, 12, 12.49, 12.77, 12.8, 12.86,
12.87, 12.88, 12.89, 12.9, 12.98, 13, 13.01, 13.49, 13.77, 13.91,
13.98, 13.99, 14, 14.06, 14.16, 14.18, 14.19, 14.2, 14.5, 14.53,
14.54, 14.55, 14.81, 14.88, 14.9, 14.98, 14.99, 15, 15.28, 15.78,
15.79, 15.8, 15.81, 15.83, 15.84, 15.9, 15.92, 15.93, 15.96, 16,
16.5, 17, 17.57, 17.58, 17.59, 17.6, 17.88, 17.89, 17.9, 17.93,
17.94, 17.97, 17.99, 18, 18.76, 18.77, 18.78, 18.99, 19.29, 19.38,
19.78, 19.9, 19.98, 19.99, 20, 20.15, 20.31, 20.35, 20.38, 20.39,
20.44, 20.45, 20.49, 20.5, 20.69, 20.7, 20.77, 20.78, 20.79, 20.8,
20.9, 20.91, 20.92, 20.93, 20.94, 20.95, 20.96, 20.99, 21, 21.01,
21.75, 21.98, 21.99, 22, 22.45, 22.79, 22.96, 22.97, 22.98, 22.99,
23, 23.49, 23.78, 23.79, 23.8, 23.81, 23.9, 23.94, 23.95, 23.96,
23.97, 23.98, 23.99, 24, 24.49, 24.5, 24.63, 24.79, 24.8, 24.89,
24.9, 24.96, 24.97, 24.98, 24.99, 25, 25.51, 25.55, 25.88, 25.89,
25.9, 25.96, 25.97, 25.99, 26, 26.99, 27, 27.55, 28, 28.8,
28.89, 28.9, 28.99, 29, 29.09, 30, 31.91, 31.92, 31.93, 33.4,
33.5, 33.6, 34.6, 34.7, 34.79, 34.8, 35, 38.99, 39.57, 39.99,
40, 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
]
The problem itself or How to clear (non)-normal distribution tail from non-normal elements
I need to find in this array, irrelevant values, some kind of «dirty tail», and remove them. Actually, I don't even need to remove it from the array, the real case is to find the latest relevant number. To define it as a cap value, for finding a range between floor (min relevant) and cap (max relevant), like:
floor value => 8
cap value => 40
What am I talking about?
For example, for the array above: it will be all values after 40 (or maybe even 60), like 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
They all defined by me like a non-normal.
What will be counted as an answer?
S tier. The clear/optimal code (language doesn't matter, but JavaScript preferred) or formula (if math has one) could solve the problem for a small / non-resourceful amount of time. It would be perfect, if I don't even need to check every element in the array, or could skip some of them, like starting from peak / most popular value in the array.
A tier. Your own experience or code try with any relevant results or improving the current formula with better performance.
B tier. Something useful. Blog article/google link. The main requirement is to make sense. Non-obvious solutions are welcome. Even if your code is terribly formatted and so on.
TL:DR VISUAL CLARIFICATION
By which criteria and how I should «target the tail» / remove non-relevant elements from the array with x (dramatically rising and rarely occurring) values?
The given data set has some huge outliers, which make it somewhat hard to analyze using standard statistical methods (if it were better behaved, I would recommend fitting several candidate distributions to it and finding out which fits best - log normal distribution, beta distribution, gamma distribution, etc).
The problem of determining which outliers to ignore can be solved in general through more simplistic but less rigorous methods; one method is to compare the values of the data at various percentiles and throw away the ones where the differences become "too high" (for a suitably chosen value of "too high").
For example, here are the last few entries if we go up by two percentile slots; the delta column gives the difference between the previous percentile and this one.
Here, you can see that the difference with the previous entry increases by almost 2 once we hit 87, and goes up (mostly) from there. To use a "nice" number, let's make the cut-off the 85th percentile, and ignore all values above that.
Given the sorted list above in array named data, we ignore any index above
Math.floor(data.length*85/100)
The analysis above can be repeated in code if it should change dynamically (or to call attention to deviations where 85 is not the right value), but I leave this as an exercise for the reader.
This is the version 2 of the code, and the exact version of that's running at the production, at the moment. It covers about 80%+ of the problems, but there is still a bottle-neck.
/** priceRangeArray ALWAYS SORTED ASC */
let priceRangeArray = [1,2,3...]
/** Resulting array */
let priceArray = []
/** Control variable */
let prev_sV = 0
/** Array length is always more then 3 elements */
const L = priceRangeArray.length;
/** Sample Variance algorithm */
for (let i = 2; i < L-1; i++) {
/**
* We skip the first two value, because 1st sV could be too low
* sV becomes previous sV
*/
if (prev_SV === 0) {
/** prev_sV of 2nd element */
prev_sV = ( 1 / L * (Math.pow(priceRangeArray[1],2))) - (Math.pow((1 / L * priceRangeArray[1]),2));
} else {
prev_sV = sV
}
/**
* sample variance, right?
* 1 / L * (el ^ 2) - ( 1 / L * el) ^ 2
* #type {number}
*/
sV = ( 1 / L * (Math.pow(priceRangeArray[i],2))) - (Math.pow((1 / L * priceRangeArray[i]),2));
/** User-defined, 1.1 is a control constant */
if (prev_sV * 1.1 < sV) {
break;
}
/** Control passed values to new array */
priceArray.push(priceRangeArray[i]);
}
console.log(priceArray)
It based on a Wikipedia's Variance article. The logic is quite simple, as long as I can't remove beginning (first 2 values, even if they are too low), I starting for of cycle from 3-rd element of array and check every next one of them, with my control formula (something with sqrt(pow^2) of current and previous value).
First version of this code, has a linear logic, and simply change previous value from current one, by one of this simple principals, like:
If current value is twice ( xN) more that previous one, then break
If current value is more that previous one, by 10%, then break.
The real problem, is that it doesn't work will the begining or small values, in arrays like: [ 1,2,3,4,13,14,16,22,100,500000].
Where, as you may see, a cap value could be de terminated as 4 instead of 22, or 100.
I also found another code, that helps me in production, and as for now, the current working version is combined the best practices from my previous answer and James McLead:
priceRange(
quotes: number[],
blocks: number,
): number[] {
if (!quotes.length) return [];
const length = quotes.length > 3 ? quotes.length - 3 : quotes.length;
const start = length === 1 ? 0 : 1;
const cap = Math.round(quotes[Math.floor(length * 0.9)]);
const floor = Math.round(quotes[start]);
const price_range = cap - floor;
/** Step represent 2.5% for each cluster */
const tick = price_range / blocks;
return Array(Math.ceil((cap + tick - floor) / tick))
.fill(floor)
.map((x, y) => parseFloat((x + y * tick).toFixed(4)));
}
For the array like this:
[1, 20, ..., 40, 432, 567, 345346]
floor value will be determined as: 20,
cap as ~40, step ~0.5 and result will be:
[20, 20.5, 21, ... 39.5, 40]
I have 25 values from a sensor between 0 and 40.
Most values will be around the true real value but some can be far away.
How can I get a reliable measure based on the values?
Mean will not work as lower or higher values will blow up the average.
Lets suppose we have these values:
10, 13, 22, 21, 19, 18, 20, 21, 22, 21, 19, 30, 30, 21, 20, 21, 22, 19, 18, 20, 22, 21, 10, 18, 19
I think best approach will be using histogram. Defining manual ranges will fail as we can reject some good values next to range bounds.
Automatic ranges calculated from the incoming data wont work because they can overlap each other.
I am coding over Arduino so memory allocation will not be the best idea.
May be a moving average will be a nice choice, define the length say d then at any instant your filtered value will be the average of the previous d sensor values.
If you really concern about deleting some strange values, you may set a threshold in addition to your moving average, but i don't recommend it, i guess you have to observe the real pattern your sensor is following.
P.S: My answer could be opinion-based, but this is the way i deal with sensors [in my work] and this is working perfectly for the company up to now.
Maybe calculating the Trimmed Mean of these values could help you, or just
the median.
How to calculate the truncated or trimmed mean?
The filter I have ended used is a multiple mean.
I process the mean of all values, then discard those values over a threshold from the first mean and then I compute the mean of the left values.
If you're actually after an average, this won't work for you, but this method requires no extra memory to manage a buffer of samples:
void loop() {
static float mean = analogRead(A0);
int newInput = analogRead(A0);
mean = (mean * 0.95) + (newInput * 0.05);
Serial.println(mean);
}
You can adjust the constants (0.95 and 0.05) to whatever you like, as long as they add up to 1. The smaller the multiplier of mean, the faster mean will track new values.
If you don't like the overhead of floating point math, the same idea works quite well in fixed point.
I have an array of n pairwise different elements and a number k with 1<=k<=n.
Now I am looking for an algorithm calculating the k numbers with the minimal absolute difference to the median of the number array. I need linear complexity (O(n)).
My approach:
I find the median:
I sort the number
I get the middle element or if the number of elements id even then the average of the two elements in the middle and round.
After that:
I take every number and find the absolute distance from the median. These results I save in a different array
I sort the newly obtained array.
I take the first k elements of the result array and I'm done.
I don't know if my solution is in O(n), also whether I'm right with this idea. Can someone verify that? Can someone show me how to solve it in O(n)?
You can solve your problem like that:
You can find the median in O(n), w.g. using the O(n) nth_element algorithm.
You loop through all elements substutiting each with a pair: <the absolute difference to the median>, <element's value>. Once more you do nth_element with n = k. after applying this algorithm you are guaranteed to have the k smallest elements in absolute difference first in the new array. You take their indices and DONE!
Your algorithm, on the other hand uses sorting, and this makes it O(nlogn).
EDIT: The requested example:
Let the array be [14, 6, 7, 8, 10, 13, 21, 16, 23].
After the step for finding the median it will be reordered to, say: [8, 7, 9, 10, 13, 16, 23, 14, 21], notice that the array is not sorted, but still the median (13) is exactly in the middle.
Now let's do the pair substitution that got you confused: we create a new array of pairs: [<abs(14-13), 14>, <abs(6-13), 6>, <abs(7-13), 7>, <abs(8-13), 8>, <abs(10-13), 10>, <abs(13-13), 13>, <abs(21-13), 21>, <abs(16-13), 16>, <abs(23-13), 23>. Thus we obtain the array: [<1, 14>, <7, 6>, <6, 7>, <5, 8>, <3, 10>, <0, 13>, <8, 21>, <3, 16>, <10, 23>
If e.g. k is 4 we make once more nth_element(using the first element of each pair for comparisons) and obtain: [<1, 14>, <3, 16>, <0, 13>, <3, 10>, <8, 21>, <7, 6>, <10, 23>, <6, 7>, <5, 8>] so the numbers you search for are the second elements of the first 4 pairs: 14, 16, 13 and 10