Match the values of 2 arrays - arrays

I am trying to create a program that will rate a bunch of arrays to find the one that most closely matches a given array. So say the given array is
[1, 80, 120, 155, 281, 301]
And one of the array to compare against is
[-6, 78, 108, 121, 157, 182, 218, 256, 310, 408, 410]
How can I match up the values in the first array to their closes values in the second array that will give it the lowest total difference.
In case this is unclear
1 => -6, 80 => 78, 120 => 121, 155 => 157
Than 281 should match up to 310 since it is closer than 256 however this would force 301 to match to 256. So the best overall match would be
281=>256 and 301=> 310
Then the program would simply calculate a rating by doing
abs(-6 - 1) + abs(78-80) etc for all matches. And the array with the lowest rating is the best match
*******NOTE*******
The given array will be the same size or smaller than the matching array and will only have positive values. The matching array can have negative values.
I was thinking of using cosine similarity but I am unsure how to implement that for this problem.

In general a computed distance is more accurate. There are different approaches with advantages and disadvantages. In your example you compute the sum of one dimension euclidean distances. But there are more extended comparisons like the dynamic time warping. It's an algorithm, which finds the best alignment between two 'arrays' and computes the optimal distance.
You can install and use this package. Here you can see a visual example. One other advantage of the DTW is, that the length of the arrays don't have to be matched.

Related

Removing irrelevant values (end tail) from (non)normal distribution array of numbers

While I appreciate this question is math-heavy, the real answer for this question will be helpful for all those, who are dealing with MongoDB's $bucket operator (or its SQL analogies), and building cluster/heatmap chart data.
Long Description of the Problem:
I have an array of unique/distinct values of prices from my DB (it's always an array of numbers, with 0.01 precision).
As you may see, most of its values are between ~8 and 40 (in this certain case).
[
7.9, 7.98, 7.99, 8.05, 8.15, 8.25, 8.3, 8.34, 8.35, 8.39,
8.4, 8.49, 8.5, 8.66, 8.9, 8.97, 8.98, 8.99, 9, 9.1,
9.15, 9.2, 9.28, 9.3, 9.31, 9.32, 9.4, 9.46, 9.49, 9.5,
9.51, 9.69, 9.7, 9.9, 9.98, 9.99, 10, 10.2, 10.21, 10.22,
10.23, 10.24, 10.25, 10.27, 10.29, 10.49, 10.51, 10.52, 10.53, 10.54,
10.55, 10.77, 10.78, 10.98, 10.99, 11, 11.26, 11.27, 11.47, 11.48,
11.49, 11.79, 11.85, 11.9, 11.99, 12, 12.49, 12.77, 12.8, 12.86,
12.87, 12.88, 12.89, 12.9, 12.98, 13, 13.01, 13.49, 13.77, 13.91,
13.98, 13.99, 14, 14.06, 14.16, 14.18, 14.19, 14.2, 14.5, 14.53,
14.54, 14.55, 14.81, 14.88, 14.9, 14.98, 14.99, 15, 15.28, 15.78,
15.79, 15.8, 15.81, 15.83, 15.84, 15.9, 15.92, 15.93, 15.96, 16,
16.5, 17, 17.57, 17.58, 17.59, 17.6, 17.88, 17.89, 17.9, 17.93,
17.94, 17.97, 17.99, 18, 18.76, 18.77, 18.78, 18.99, 19.29, 19.38,
19.78, 19.9, 19.98, 19.99, 20, 20.15, 20.31, 20.35, 20.38, 20.39,
20.44, 20.45, 20.49, 20.5, 20.69, 20.7, 20.77, 20.78, 20.79, 20.8,
20.9, 20.91, 20.92, 20.93, 20.94, 20.95, 20.96, 20.99, 21, 21.01,
21.75, 21.98, 21.99, 22, 22.45, 22.79, 22.96, 22.97, 22.98, 22.99,
23, 23.49, 23.78, 23.79, 23.8, 23.81, 23.9, 23.94, 23.95, 23.96,
23.97, 23.98, 23.99, 24, 24.49, 24.5, 24.63, 24.79, 24.8, 24.89,
24.9, 24.96, 24.97, 24.98, 24.99, 25, 25.51, 25.55, 25.88, 25.89,
25.9, 25.96, 25.97, 25.99, 26, 26.99, 27, 27.55, 28, 28.8,
28.89, 28.9, 28.99, 29, 29.09, 30, 31.91, 31.92, 31.93, 33.4,
33.5, 33.6, 34.6, 34.7, 34.79, 34.8, 35, 38.99, 39.57, 39.99,
40, 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
]
The problem itself or How to clear (non)-normal distribution tail from non-normal elements
I need to find in this array, irrelevant values, some kind of «dirty tail», and remove them. Actually, I don't even need to remove it from the array, the real case is to find the latest relevant number. To define it as a cap value, for finding a range between floor (min relevant) and cap (max relevant), like:
floor value => 8
cap value => 40
What am I talking about?
For example, for the array above: it will be all values after 40 (or maybe even 60), like 49, 50, 50.55, 60.89, 99.99, 20000, 63000, 483000
They all defined by me like a non-normal.
What will be counted as an answer?
S tier. The clear/optimal code (language doesn't matter, but JavaScript preferred) or formula (if math has one) could solve the problem for a small / non-resourceful amount of time. It would be perfect, if I don't even need to check every element in the array, or could skip some of them, like starting from peak / most popular value in the array.
A tier. Your own experience or code try with any relevant results or improving the current formula with better performance.
B tier. Something useful. Blog article/google link. The main requirement is to make sense. Non-obvious solutions are welcome. Even if your code is terribly formatted and so on.
TL:DR VISUAL CLARIFICATION
By which criteria and how I should «target the tail» / remove non-relevant elements from the array with x (dramatically rising and rarely occurring) values?
The given data set has some huge outliers, which make it somewhat hard to analyze using standard statistical methods (if it were better behaved, I would recommend fitting several candidate distributions to it and finding out which fits best - log normal distribution, beta distribution, gamma distribution, etc).
The problem of determining which outliers to ignore can be solved in general through more simplistic but less rigorous methods; one method is to compare the values of the data at various percentiles and throw away the ones where the differences become "too high" (for a suitably chosen value of "too high").
For example, here are the last few entries if we go up by two percentile slots; the delta column gives the difference between the previous percentile and this one.
Here, you can see that the difference with the previous entry increases by almost 2 once we hit 87, and goes up (mostly) from there. To use a "nice" number, let's make the cut-off the 85th percentile, and ignore all values above that.
Given the sorted list above in array named data, we ignore any index above
Math.floor(data.length*85/100)
The analysis above can be repeated in code if it should change dynamically (or to call attention to deviations where 85 is not the right value), but I leave this as an exercise for the reader.
This is the version 2 of the code, and the exact version of that's running at the production, at the moment. It covers about 80%+ of the problems, but there is still a bottle-neck.
/** priceRangeArray ALWAYS SORTED ASC */
let priceRangeArray = [1,2,3...]
/** Resulting array */
let priceArray = []
/** Control variable */
let prev_sV = 0
/** Array length is always more then 3 elements */
const L = priceRangeArray.length;
/** Sample Variance algorithm */
for (let i = 2; i < L-1; i++) {
/**
* We skip the first two value, because 1st sV could be too low
* sV becomes previous sV
*/
if (prev_SV === 0) {
/** prev_sV of 2nd element */
prev_sV = ( 1 / L * (Math.pow(priceRangeArray[1],2))) - (Math.pow((1 / L * priceRangeArray[1]),2));
} else {
prev_sV = sV
}
/**
* sample variance, right?
* 1 / L * (el ^ 2) - ( 1 / L * el) ^ 2
* #type {number}
*/
sV = ( 1 / L * (Math.pow(priceRangeArray[i],2))) - (Math.pow((1 / L * priceRangeArray[i]),2));
/** User-defined, 1.1 is a control constant */
if (prev_sV * 1.1 < sV) {
break;
}
/** Control passed values to new array */
priceArray.push(priceRangeArray[i]);
}
console.log(priceArray)
It based on a Wikipedia's Variance article. The logic is quite simple, as long as I can't remove beginning (first 2 values, even if they are too low), I starting for of cycle from 3-rd element of array and check every next one of them, with my control formula (something with sqrt(pow^2) of current and previous value).
First version of this code, has a linear logic, and simply change previous value from current one, by one of this simple principals, like:
If current value is twice ( xN) more that previous one, then break
If current value is more that previous one, by 10%, then break.
The real problem, is that it doesn't work will the begining or small values, in arrays like: [ 1,2,3,4,13,14,16,22,100,500000].
Where, as you may see, a cap value could be de terminated as 4 instead of 22, or 100.
I also found another code, that helps me in production, and as for now, the current working version is combined the best practices from my previous answer and James McLead:
priceRange(
quotes: number[],
blocks: number,
): number[] {
if (!quotes.length) return [];
const length = quotes.length > 3 ? quotes.length - 3 : quotes.length;
const start = length === 1 ? 0 : 1;
const cap = Math.round(quotes[Math.floor(length * 0.9)]);
const floor = Math.round(quotes[start]);
const price_range = cap - floor;
/** Step represent 2.5% for each cluster */
const tick = price_range / blocks;
return Array(Math.ceil((cap + tick - floor) / tick))
.fill(floor)
.map((x, y) => parseFloat((x + y * tick).toFixed(4)));
}
For the array like this:
[1, 20, ..., 40, 432, 567, 345346]
floor value will be determined as: 20,
cap as ~40, step ~0.5 and result will be:
[20, 20.5, 21, ... 39.5, 40]

Filter data using histogram in C

I have 25 values from a sensor between 0 and 40.
Most values will be around the true real value but some can be far away.
How can I get a reliable measure based on the values?
Mean will not work as lower or higher values will blow up the average.
Lets suppose we have these values:
10, 13, 22, 21, 19, 18, 20, 21, 22, 21, 19, 30, 30, 21, 20, 21, 22, 19, 18, 20, 22, 21, 10, 18, 19
I think best approach will be using histogram. Defining manual ranges will fail as we can reject some good values next to range bounds.
Automatic ranges calculated from the incoming data wont work because they can overlap each other.
I am coding over Arduino so memory allocation will not be the best idea.
May be a moving average will be a nice choice, define the length say d then at any instant your filtered value will be the average of the previous d sensor values.
If you really concern about deleting some strange values, you may set a threshold in addition to your moving average, but i don't recommend it, i guess you have to observe the real pattern your sensor is following.
P.S: My answer could be opinion-based, but this is the way i deal with sensors [in my work] and this is working perfectly for the company up to now.
Maybe calculating the Trimmed Mean of these values could help you, or just
the median.
How to calculate the truncated or trimmed mean?
The filter I have ended used is a multiple mean.
I process the mean of all values, then discard those values over a threshold from the first mean and then I compute the mean of the left values.
If you're actually after an average, this won't work for you, but this method requires no extra memory to manage a buffer of samples:
void loop() {
static float mean = analogRead(A0);
int newInput = analogRead(A0);
mean = (mean * 0.95) + (newInput * 0.05);
Serial.println(mean);
}
You can adjust the constants (0.95 and 0.05) to whatever you like, as long as they add up to 1. The smaller the multiplier of mean, the faster mean will track new values.
If you don't like the overhead of floating point math, the same idea works quite well in fixed point.

Finding [index of] the minimal value in array which satisfies a condition in Fortran

I am looking for a minimal value in an array which is larger than a certain number. I found this discussion which I don't understand. There is MINLOC, but it looks like it does not do as much as I would like on its own, though I didn't parse the arguments passed to it in the given examples. (It is also possible to do this using a loop but it could be clumsy.)
You probably want MINVAL.
If your array is say,
array = (/ 21, 52, 831, 46, 125, 68, 7, 8, 549, 10 /)
And you want to find the minimum value greater than say 65,
variable = minval(array, mask=(array > 65))
which would obviously give 68.
It sounds like MINVAL is what you want.
You just need to do something like:
min_above_cutoff = MINVAL(a, MASK=(a > cutoff))
The optional parameter MASK should be a logical array with the same size as a. It tells MINVAL which elements in a to consider when searching for the minimum value.
Take a look at the documentation here: MINVAL
If you would instead like to get the index of the minimum value, rather than the value itself, you can use MINLOC. In this case the code would look like:
index = MINLOC(a, MASK=(a > cutoff))
Documentation can be found here: MINLOC

C Abscissas fitting in length algorithm

I couldn't find algorithm at my problem.
There are defined different kinds of sizes abscissas.
Lengths is in integers.
And than there is defined size to create from abscissas.
I need algorithm which finds best way to merge, fit, compose abscissas to defined length.
(we are in 1D)
The fewer lines the better, and i need to find the best combination.
Number of every predefined abscissa is infinite.
The smallest abscissa is always size of 1. So the problem is always possible to solve.
Combine all possibilities and pick the best is not an option.
for example
number of abscissas: 5;
types: 321, 215, 111, 9, 1;
length: 900;
result: 2x321 + 2x111 + 4x9 => 8 abscissas
The above problem is similar to the knapsack problem with following parameters:-
knapsack capacity = length = 900
items weights : 321 (900/321=2 times), 215 (900/215=4 times), 111(900/111=8 times).....
values = weights
maximize profit & store min needed abscissas of each subproblem
if max profit == knapsack capacity
solution exists retrace solution with minimum abscissas
else doesnt exist.
Knapsack problem
There is DP solution for Knapsack in pseudo polynomial time

How do I calculate the k nearest numbers to the median?

I have an array of n pairwise different elements and a number k with 1<=k<=n.
Now I am looking for an algorithm calculating the k numbers with the minimal absolute difference to the median of the number array. I need linear complexity (O(n)).
My approach:
I find the median:
I sort the number
I get the middle element or if the number of elements id even then the average of the two elements in the middle and round.
After that:
I take every number and find the absolute distance from the median. These results I save in a different array
I sort the newly obtained array.
I take the first k elements of the result array and I'm done.
I don't know if my solution is in O(n), also whether I'm right with this idea. Can someone verify that? Can someone show me how to solve it in O(n)?
You can solve your problem like that:
You can find the median in O(n), w.g. using the O(n) nth_element algorithm.
You loop through all elements substutiting each with a pair: <the absolute difference to the median>, <element's value>. Once more you do nth_element with n = k. after applying this algorithm you are guaranteed to have the k smallest elements in absolute difference first in the new array. You take their indices and DONE!
Your algorithm, on the other hand uses sorting, and this makes it O(nlogn).
EDIT: The requested example:
Let the array be [14, 6, 7, 8, 10, 13, 21, 16, 23].
After the step for finding the median it will be reordered to, say: [8, 7, 9, 10, 13, 16, 23, 14, 21], notice that the array is not sorted, but still the median (13) is exactly in the middle.
Now let's do the pair substitution that got you confused: we create a new array of pairs: [<abs(14-13), 14>, <abs(6-13), 6>, <abs(7-13), 7>, <abs(8-13), 8>, <abs(10-13), 10>, <abs(13-13), 13>, <abs(21-13), 21>, <abs(16-13), 16>, <abs(23-13), 23>. Thus we obtain the array: [<1, 14>, <7, 6>, <6, 7>, <5, 8>, <3, 10>, <0, 13>, <8, 21>, <3, 16>, <10, 23>
If e.g. k is 4 we make once more nth_element(using the first element of each pair for comparisons) and obtain: [<1, 14>, <3, 16>, <0, 13>, <3, 10>, <8, 21>, <7, 6>, <10, 23>, <6, 7>, <5, 8>] so the numbers you search for are the second elements of the first 4 pairs: 14, 16, 13 and 10

Resources