Filter data using histogram in C - c

I have 25 values from a sensor between 0 and 40.
Most values will be around the true real value but some can be far away.
How can I get a reliable measure based on the values?
Mean will not work as lower or higher values will blow up the average.
Lets suppose we have these values:
10, 13, 22, 21, 19, 18, 20, 21, 22, 21, 19, 30, 30, 21, 20, 21, 22, 19, 18, 20, 22, 21, 10, 18, 19
I think best approach will be using histogram. Defining manual ranges will fail as we can reject some good values next to range bounds.
Automatic ranges calculated from the incoming data wont work because they can overlap each other.
I am coding over Arduino so memory allocation will not be the best idea.

May be a moving average will be a nice choice, define the length say d then at any instant your filtered value will be the average of the previous d sensor values.
If you really concern about deleting some strange values, you may set a threshold in addition to your moving average, but i don't recommend it, i guess you have to observe the real pattern your sensor is following.
P.S: My answer could be opinion-based, but this is the way i deal with sensors [in my work] and this is working perfectly for the company up to now.

Maybe calculating the Trimmed Mean of these values could help you, or just
the median.
How to calculate the truncated or trimmed mean?

The filter I have ended used is a multiple mean.
I process the mean of all values, then discard those values over a threshold from the first mean and then I compute the mean of the left values.

If you're actually after an average, this won't work for you, but this method requires no extra memory to manage a buffer of samples:
void loop() {
static float mean = analogRead(A0);
int newInput = analogRead(A0);
mean = (mean * 0.95) + (newInput * 0.05);
Serial.println(mean);
}
You can adjust the constants (0.95 and 0.05) to whatever you like, as long as they add up to 1. The smaller the multiplier of mean, the faster mean will track new values.
If you don't like the overhead of floating point math, the same idea works quite well in fixed point.

Related

Removing irrelevant values (end tail) from (non)normal distribution array of numbers

While I appreciate this question is math-heavy, the real answer for this question will be helpful for all those, who are dealing with MongoDB's $bucket operator (or its SQL analogies), and building cluster/heatmap chart data.
Long Description of the Problem:
I have an array of unique/distinct values of prices from my DB (it's always an array of numbers, with 0.01 precision).
As you may see, most of its values are between ~8 and 40 (in this certain case).
[
7.9, 7.98, 7.99, 8.05, 8.15, 8.25, 8.3, 8.34, 8.35, 8.39,
8.4, 8.49, 8.5, 8.66, 8.9, 8.97, 8.98, 8.99, 9, 9.1,
9.15, 9.2, 9.28, 9.3, 9.31, 9.32, 9.4, 9.46, 9.49, 9.5,
9.51, 9.69, 9.7, 9.9, 9.98, 9.99, 10, 10.2, 10.21, 10.22,
10.23, 10.24, 10.25, 10.27, 10.29, 10.49, 10.51, 10.52, 10.53, 10.54,
10.55, 10.77, 10.78, 10.98, 10.99, 11, 11.26, 11.27, 11.47, 11.48,
11.49, 11.79, 11.85, 11.9, 11.99, 12, 12.49, 12.77, 12.8, 12.86,
12.87, 12.88, 12.89, 12.9, 12.98, 13, 13.01, 13.49, 13.77, 13.91,
13.98, 13.99, 14, 14.06, 14.16, 14.18, 14.19, 14.2, 14.5, 14.53,
14.54, 14.55, 14.81, 14.88, 14.9, 14.98, 14.99, 15, 15.28, 15.78,
15.79, 15.8, 15.81, 15.83, 15.84, 15.9, 15.92, 15.93, 15.96, 16,
16.5, 17, 17.57, 17.58, 17.59, 17.6, 17.88, 17.89, 17.9, 17.93,
17.94, 17.97, 17.99, 18, 18.76, 18.77, 18.78, 18.99, 19.29, 19.38,
19.78, 19.9, 19.98, 19.99, 20, 20.15, 20.31, 20.35, 20.38, 20.39,
20.44, 20.45, 20.49, 20.5, 20.69, 20.7, 20.77, 20.78, 20.79, 20.8,
20.9, 20.91, 20.92, 20.93, 20.94, 20.95, 20.96, 20.99, 21, 21.01,
21.75, 21.98, 21.99, 22, 22.45, 22.79, 22.96, 22.97, 22.98, 22.99,
23, 23.49, 23.78, 23.79, 23.8, 23.81, 23.9, 23.94, 23.95, 23.96,
23.97, 23.98, 23.99, 24, 24.49, 24.5, 24.63, 24.79, 24.8, 24.89,
24.9, 24.96, 24.97, 24.98, 24.99, 25, 25.51, 25.55, 25.88, 25.89,
25.9, 25.96, 25.97, 25.99, 26, 26.99, 27, 27.55, 28, 28.8,
28.89, 28.9, 28.99, 29, 29.09, 30, 31.91, 31.92, 31.93, 33.4,
33.5, 33.6, 34.6, 34.7, 34.79, 34.8, 35, 38.99, 39.57, 39.99,
40, 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
]
The problem itself or How to clear (non)-normal distribution tail from non-normal elements
I need to find in this array, irrelevant values, some kind of «dirty tail», and remove them. Actually, I don't even need to remove it from the array, the real case is to find the latest relevant number. To define it as a cap value, for finding a range between floor (min relevant) and cap (max relevant), like:
floor value => 8
cap value => 40
What am I talking about?
For example, for the array above: it will be all values after 40 (or maybe even 60), like 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
They all defined by me like a non-normal.
What will be counted as an answer?
S tier. The clear/optimal code (language doesn't matter, but JavaScript preferred) or formula (if math has one) could solve the problem for a small / non-resourceful amount of time. It would be perfect, if I don't even need to check every element in the array, or could skip some of them, like starting from peak / most popular value in the array.
A tier. Your own experience or code try with any relevant results or improving the current formula with better performance.
B tier. Something useful. Blog article/google link. The main requirement is to make sense. Non-obvious solutions are welcome. Even if your code is terribly formatted and so on.
TL:DR VISUAL CLARIFICATION
By which criteria and how I should «target the tail» / remove non-relevant elements from the array with x (dramatically rising and rarely occurring) values?
The given data set has some huge outliers, which make it somewhat hard to analyze using standard statistical methods (if it were better behaved, I would recommend fitting several candidate distributions to it and finding out which fits best - log normal distribution, beta distribution, gamma distribution, etc).
The problem of determining which outliers to ignore can be solved in general through more simplistic but less rigorous methods; one method is to compare the values of the data at various percentiles and throw away the ones where the differences become "too high" (for a suitably chosen value of "too high").
For example, here are the last few entries if we go up by two percentile slots; the delta column gives the difference between the previous percentile and this one.
Here, you can see that the difference with the previous entry increases by almost 2 once we hit 87, and goes up (mostly) from there. To use a "nice" number, let's make the cut-off the 85th percentile, and ignore all values above that.
Given the sorted list above in array named data, we ignore any index above
Math.floor(data.length*85/100)
The analysis above can be repeated in code if it should change dynamically (or to call attention to deviations where 85 is not the right value), but I leave this as an exercise for the reader.
This is the version 2 of the code, and the exact version of that's running at the production, at the moment. It covers about 80%+ of the problems, but there is still a bottle-neck.
/** priceRangeArray ALWAYS SORTED ASC */
let priceRangeArray = [1,2,3...]
/** Resulting array */
let priceArray = []
/** Control variable */
let prev_sV = 0
/** Array length is always more then 3 elements */
const L = priceRangeArray.length;
/** Sample Variance algorithm */
for (let i = 2; i < L-1; i++) {
/**
* We skip the first two value, because 1st sV could be too low
* sV becomes previous sV
*/
if (prev_SV === 0) {
/** prev_sV of 2nd element */
prev_sV = ( 1 / L * (Math.pow(priceRangeArray[1],2))) - (Math.pow((1 / L * priceRangeArray[1]),2));
} else {
prev_sV = sV
}
/**
* sample variance, right?
* 1 / L * (el ^ 2) - ( 1 / L * el) ^ 2
* #type {number}
*/
sV = ( 1 / L * (Math.pow(priceRangeArray[i],2))) - (Math.pow((1 / L * priceRangeArray[i]),2));
/** User-defined, 1.1 is a control constant */
if (prev_sV * 1.1 < sV) {
break;
}
/** Control passed values to new array */
priceArray.push(priceRangeArray[i]);
}
console.log(priceArray)
It based on a Wikipedia's Variance article. The logic is quite simple, as long as I can't remove beginning (first 2 values, even if they are too low), I starting for of cycle from 3-rd element of array and check every next one of them, with my control formula (something with sqrt(pow^2) of current and previous value).
First version of this code, has a linear logic, and simply change previous value from current one, by one of this simple principals, like:
If current value is twice ( xN) more that previous one, then break
If current value is more that previous one, by 10%, then break.
The real problem, is that it doesn't work will the begining or small values, in arrays like: [ 1,2,3,4,13,14,16,22,100,500000].
Where, as you may see, a cap value could be de terminated as 4 instead of 22, or 100.
I also found another code, that helps me in production, and as for now, the current working version is combined the best practices from my previous answer and James McLead:
priceRange(
quotes: number[],
blocks: number,
): number[] {
if (!quotes.length) return [];
const length = quotes.length > 3 ? quotes.length - 3 : quotes.length;
const start = length === 1 ? 0 : 1;
const cap = Math.round(quotes[Math.floor(length * 0.9)]);
const floor = Math.round(quotes[start]);
const price_range = cap - floor;
/** Step represent 2.5% for each cluster */
const tick = price_range / blocks;
return Array(Math.ceil((cap + tick - floor) / tick))
.fill(floor)
.map((x, y) => parseFloat((x + y * tick).toFixed(4)));
}
For the array like this:
[1, 20, ..., 40, 432, 567, 345346]
floor value will be determined as: 20,
cap as ~40, step ~0.5 and result will be:
[20, 20.5, 21, ... 39.5, 40]

Labview Array Index Tracking

Currently working with a pretty simple XY plot (Y values from a random generator, and X values from the while loop count). These are both stored in arrays and at certain X thresholds, the Y array will be decimated to certain factors (10, 100, 1000...).
However my goal with this VI is to be able to decimate in "chunks." So in other words, every 1,000-point chunk, decimate the array with a factor of 10. And every 10,000-point chunk, decimate with a factor of 100. After each of these chunks, the arrays should continue to index at +1 until they reach another "chunk" and then be decimated appropriately.
For example;
Index: 998, 999, 1000, 1001... Decimate Factor 10
1998, 1999, 2000, 2001... Decimate Factor 10
...
9998, 9999, 10000, 10001... Decimate Factor 100
(my current setup permanently changes the decimation factor once it reaches a certain X value, and from then on will only record data points in increments, of 10, 100, 1000...).
Thanks for any help! See code below
Answered as an edit on the original thread this question was asked:
Labview - Increasing Array Index with Array Size Limiting
Copying info from there:
EDIT: #JonathanVahala was asking about using configurable decimation below. See this image which shows a way to do this:

Match the values of 2 arrays

I am trying to create a program that will rate a bunch of arrays to find the one that most closely matches a given array. So say the given array is
[1, 80, 120, 155, 281, 301]
And one of the array to compare against is
[-6, 78, 108, 121, 157, 182, 218, 256, 310, 408, 410]
How can I match up the values in the first array to their closes values in the second array that will give it the lowest total difference.
In case this is unclear
1 => -6, 80 => 78, 120 => 121, 155 => 157
Than 281 should match up to 310 since it is closer than 256 however this would force 301 to match to 256. So the best overall match would be
281=>256 and 301=> 310
Then the program would simply calculate a rating by doing
abs(-6 - 1) + abs(78-80) etc for all matches. And the array with the lowest rating is the best match
*******NOTE*******
The given array will be the same size or smaller than the matching array and will only have positive values. The matching array can have negative values.
I was thinking of using cosine similarity but I am unsure how to implement that for this problem.
In general a computed distance is more accurate. There are different approaches with advantages and disadvantages. In your example you compute the sum of one dimension euclidean distances. But there are more extended comparisons like the dynamic time warping. It's an algorithm, which finds the best alignment between two 'arrays' and computes the optimal distance.
You can install and use this package. Here you can see a visual example. One other advantage of the DTW is, that the length of the arrays don't have to be matched.

ILMath function to arrange values given start, end and step size?

I am trying to get an array linearly spaced given the step size.
For Example
arange(10,15,0.5) = 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5,15
arange(10, 15, 1) = 11, 12, 13, 14
There is linspace function that accepts only number of elements to be generated. Is there a way to provide step size instead of the number of elements?
For now, I calculate number of elements manually and use linspace to get the result.
Is there any straight to use api to get desired output? Thanks
Try:
ILArray<double> A = ILMath.vec<double>(10.0,0.5,15.0);
More array creation functions can be found in the Array section of the documentation. A number of quick reference charts is also available:
ILNumerics' getting started:
http://ilnumerics.net/media/oldres/img/ILNumerics_ArraysUsage.pdf
ILNumerics for Matlab users:
http://ilnumerics.net/media/oldres/img/ILNumerics4MatlabUsers.pdf
Last but not least the class reference for all ILMath functions:
http://ilnumerics.net/apidoc/?topic=html/Methods_T_ILNumerics_ILMath.htm

Finding [index of] the minimal value in array which satisfies a condition in Fortran

I am looking for a minimal value in an array which is larger than a certain number. I found this discussion which I don't understand. There is MINLOC, but it looks like it does not do as much as I would like on its own, though I didn't parse the arguments passed to it in the given examples. (It is also possible to do this using a loop but it could be clumsy.)
You probably want MINVAL.
If your array is say,
array = (/ 21, 52, 831, 46, 125, 68, 7, 8, 549, 10 /)
And you want to find the minimum value greater than say 65,
variable = minval(array, mask=(array > 65))
which would obviously give 68.
It sounds like MINVAL is what you want.
You just need to do something like:
min_above_cutoff = MINVAL(a, MASK=(a > cutoff))
The optional parameter MASK should be a logical array with the same size as a. It tells MINVAL which elements in a to consider when searching for the minimum value.
Take a look at the documentation here: MINVAL
If you would instead like to get the index of the minimum value, rather than the value itself, you can use MINLOC. In this case the code would look like:
index = MINLOC(a, MASK=(a > cutoff))
Documentation can be found here: MINLOC

Resources