How do I search for most similar sequence in Excel?

How do I search for most similar sequence in Excel? - arrays

I'm hoping to search an excel column for the sequence in it most similar to a sequence I enter.
For instance, in the following example, the sequence I provide is: 1, 2.5, 3.5, 2.5, 1. It's depicted on the following graph as black.
In the column I'm searching, there are a few sequences. The most similar one to mine is colored blue. It goes: 1, 2, 3, 2, 1.
Graph
Do any of you know an excel formula, or series of formulas and steps, that would allow Excel to determine this -- so that when I enter the black sequence, for instance, it will match it with the blue sequence as the most similar one?
Thanks tothis Stack overflow answer, I already know how to search a set of numbers for an exact sequence by using the following formula:
=MATCH([Criteria 1]&[Criteria 2],[Data 1st val]:[Data last val]&[Data 2nd val]:[Data last + 1 val],0)
For instance, if I have the following numbers: 1, 3, 5, 1, 4, and I am hoping to find the sequence, 1, 4, this formula will direct me towards it in that set of numbers.
I ALSO already know how to find the closest match to a number I enter, using this formula (which will make more sense if you look in the example image below): =INDEX($A$1:$A$10,MATCH(MIN(ABS(C1-B1:B10)),ABS(C1-$B$1:$B$10),0))
Example
When I press control+shift+enter, this formula will produce the number 4, indicating row 4, because the number I entered in C1, which was 39, is closest to the number 40, which is located in the 4th row.
So I have both the components -- finding exact sequences, and finding the closest number -- but now the question is, how do I combine these two formulas to show me the closest sequence of numbers, the one which would look most similar if drawn on a graph like in my first example with the blue and black line?
And bonus points if you can help find not only the closest sequence but the closest sequences in order of most similar to least similar.
And once again, I don't need this to be rolled into one formula; I am happy to go through a couple steps and different formulas manually to arrive at the answer.
And if you think this would be better solved in some other way, please let me know! But I do not have any coding experience so I figured Excel would be my best bet.
Thank you so much!!!

Not sure how you exactly have set this up, but if I visualize your graph in a table you could use the below (if one has Microsoft365):
Formula in H2:
=INDEX(SORTBY(B2:F4,MMULT(ABS(B2:F4-B1:F1),SEQUENCE(5,,,0))),1)
With all your data in a single column, below you can find an example for if you'd have sequences of 5.
Formula in C2:
=TRANSPOSE(INDEX(SORTBY(INDEX(A2:A16,SEQUENCE(11,5)-ROUNDDOWN(SEQUENCE(11,5,0,0.2),0)*4),MMULT(ABS(INDEX(A2:A16,SEQUENCE(11,5)-ROUNDDOWN(SEQUENCE(11,5,0,0.2),0)*4)-TRANSPOSE(B2:B6)),SEQUENCE(5,,,0))),1))
If you would want to make this applicable for your dataset from A1:A500 with sequence of 10 numbers:
=TRANSPOSE(INDEX(SORTBY(INDEX(A1:A500,SEQUENCE(COUNT(A1:A500)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(A1:A500)-9,10,0,0.1),0)*9),MMULT(ABS(INDEX(A1:A500,SEQUENCE(COUNT(A1:A500)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(A1:A500)-9,10,0,0.1),0)*9)-TRANSPOSE(B1:B10)),SEQUENCE(10,,,0))),1))
And if will be even better if you had acces to LET() and it will be a piece of cake to just change the range reference:
=LET(X,A2:A500,Y,INDEX(X,SEQUENCE(COUNT(X)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(X)-9,10,0,0.1),0)*9),TRANSPOSE(INDEX(SORTBY(Y,MMULT(ABS(Y-TRANSPOSE(B2:B11)),SEQUENCE(10,,,0))),1)))
EDIT2:
To make it more dynamic you can use:
=LET(W,1,X,A2:A500,Y,11,Z,INDEX(X,SEQUENCE(COUNT(X)-(Y-1),Y)-ROUNDDOWN(SEQUENCE(COUNT(X)-(Y-1),Y,0,1/Y),0)*(Y-1)),TRANSPOSE(INDEX(SORTBY(Z,MMULT(ABS(Z-TRANSPOSE(B2:INDEX(B:B,Y+1))),SEQUENCE(Y,,,0))),W)))
Where "W" is the nth closest match and where "Y" is the length of the sequence, 11 in the example.

My approach would be to calculate a match-value between each color and the input values, like the sum of the differences for each point.
The formula for this is:
=SUM(IF([inputrange]<>"",ABS([inputrange]-[colorrange]),0))
Where [inputrange] is the range of your input (indicated red in the picture below, $C$6:$G$6) and [colorrange] is the range of that color (indicated blue, C2:G2).
The color with the lowest difference is the match:
=VLOOKUP(MIN([matchvalues],[rangeofmatchandcolors],2,0)
Where [matchvalues] is the range of match values (indicated blue in the picture below, Cells A2:A4) and [rangeofmatchandcolors] is both the match values as well as the colors (indicated red, A2:B4)

Related

Multiple IF QUARTILEs returning wrong values

I am using a nested IF statement within a Quartile wrapper, and it only kind of works, for the most part because it's returning values that are slightly off from what I would have expected if I calculate the range of values manually.
I've looked around but most of the posts and research is about designing the fomrula, I haven't come across anything compelling in terms of this odd behaviour I'm observing.
My formula (ctrl+shift enter as it's an array): =QUARTILE(IF(((F2:$F$10=$W$4)($Q$2:$Q$10=$W$3))($E$2:$E$10=W$2),IF($O$2:$O$10<>"",$O$2:$O$10)),1)
The full dataset:
0.868997877*
0.99480118
0.867040346*
0.914032128*
0.988150438
0.981207615*
0.986629288
0.984750004*
0.988983643*
*The formula has 3 AND conditions that need to be met and should return range:
0.868997877
0.867040346
0.914032128
0.981207615
0.984750004
0.988983643
At which 25% is calculated based on the range.
If I take the output from the formula, 25%-ile (QUARTILE,1) is 0.8803, but if I calculate it manually based on the data points right above, it comes out to 0.8685 and I can't see why.
I feel it's because the IF statements identifies slight off range but the values that meet the IF statements are different rows or something.

If you look at the table here you can see that there is more than one way of estimating quartile (or other percentile) from a sample and Excel has two. The one you are doing by hand must be like Quartile.exc and the one you are using in the formula is like Quartile.inc
Basically both formulas work out the rank of the quartile value. If it isn't an integer it interpolates (e.g. if it was 1.5, that means the quartile lies half way between the first and second numbers in ascending order). You might think that there wouldn't be much difference, but for small samples there is a massive difference:
Quartile.exc Rank=(N+1)/4
Quartile.inc Rank=(N+3)/4
Here's how it would look with your data

Trying to returned the 2nd last position of a matched value

As it is really difficult to make understand the problem. I have uploaded the pic. It might help you to understand my problem. Please help me to solve it out. Thanks in advance

I cannot reconcile why you would want to have the pattern in column D. It seems to be 'look up the first position for the first two matches then progressive positions for the remainder of any matches'.
Here is the solution to your problem.
In D1, as a standard formula,
=AGGREGATE(15, 6, ROW($1:$12)/(B$1:B$12=B1), COUNTIF(B$1:B1, B1)-(COUNTIF(B$1:B1, B1)>1))
'to retrieve the values from column A as,
=INDEX(A:A AGGREGATE(15, 6, ROW($1:$12)/(B$1:B$12=B1), COUNTIF(B$1:B1, B1)-(COUNTIF(B$1:B1, B1)>1)))
However, I strongly suspect that it should be closer to this (as in G1),
=AGGREGATE(15, 6, ROW($1:$12)/(B$1:B$12=B1), COUNTIF(B$1:B1, B1)-(COUNTIF(B$1:B1, B1)>1))
'to retrieve the values from column A as,
=INDEX(A:A, AGGREGATE(15, 6, ROW($1:$12)/(B$1:B$12=B1), COUNTIF(B$1:B1, B1)-(COUNTIF(B$1:B1, B1)>1)))

sequences with the same order in an array - Identify sequences

I'm looking for a hint towards a solution of the problem:
Suppose there's an array with some numbers in ascending order and some in descending, for example [1,2,5,9,6,3,2,4,7,8] has sequences asc [1,2,5,9], desc [(9),6,3,2], asc [(2),4,7,8].
Now this isn't a problem, I could simply loop through an array and add them to some data structure, and when the direction changes - I store this structure somwhere and start filling next one.
What I've found tricky is if I want to have threshold of some sort. For example: [0,50,100,99,98,97,105,160]
So the sequence in descending order [(100), 99, 98, 97] could be neglected, because overall change is -3, whereas the sequence was increasing much more dramatically (+100) and as a result, the algorithm identifies only one sequence in ascending order.
I have tried the same method as above, simply adding all sequences in a data structure and then comparing the change in values of two consequtive items: (100 vs -3 means -3 can be neglected). But then the problem is if I have say this situation:
(example only in change of values from start to end of sequense)
[+100, -3, +1, -50]
in this situation I cannot neglect descending movement, because the numbers start to descend, then slightly ascend and again go down pretty significantly.
and it gets really confusing with stuff like that:
[+100, -3, +3, -3, +3, -50]
this is quick sketch of representation of what I am trying to achieve:
black lines represent initial data in an array, red thin lines are desired resulting output
Could somebody point me out in right direction? How would I approach this situation? Compare multiple sequences at a time slowly combining sequences together? Maybe I would need to go through sequences multiple times?
I'm not sure If I've come across problem like that and don't know working algorithm. This is a problem I've faced myself trying to analyse some data.

If I understand correctly, you expect your curve to be a succession of alternatively increasing and decreasing sequences, with a bit of added noise.
The usual way to get rid of noise is to filter data. There are millions of ways to do that, most of them requiring frequency analysis, but in your case you could probably get good enough results with something simple.
The main point is that the relevant variable is not the values in the array, but their variations.
Given N values, consider the array of N-1 elements holding the differences between two consecutive values.
[0,50,100,99,98,97,105,160] -> 50,100,-1,-1,-1,6,45
Now eliminate all values whose absolute value is below a given threshold (say 10 for instance)
-> 50,100,0,0,0,0,45
you can then detect a rising sequence by looking at streaks of all positive or null values (and the same for decreasing sequences, considering zero or negative values).
As for all filtering processes, you will have to find a sweet spot for your threshold. Too low and it will fail to eliminate insignificant variations, too high and it will wipe out significant slope inversions.

I don't know if I understand your problem correctly, but I had to do this kind of dimensionality reduction many times before, so I wrote a small javascript library to do so. It uses the Perceptually Important Points algorithm.
In the algorithm you can define a custom metric of the distance between three consecutive points (to measure how much a single point adds in entropy).
Here is a demonstration (in JS). It works kind like a heap, where you remove points that do not contribute so much to the overall entropy:
for(var i=0; i<data.length; i++)
heap.add(data[i]);
while(heap.minValue() < threshold)
heap.removeMin();
And here is the library.

Test if a column is a superset of predefined set of data

I'm trying to compare teams' compositions to known configurations in order to see where I might have a problem :
The trials columns are to be compare against the differents scenarios to see if a column is a superset of a particular scenario (error being default).
Can it be done using index+match/lookup, or do I have to write some VB macro ?
EDIT : I've updated the question with a worksheet with input data.
Worksheet : https://drive.google.com/file/d/0BxwDbXStIEAsUmpONHp1RVRzR2s/edit?usp=sharing
Github Gist : https://gist.github.com/lucasg/11177852 (python script for data gen)
(xlwt module needed to create excel workbooks).
I've simplified the problem using soccer teams : given 7 positions ( 1 goalie, 2 defenders, 2 midfield and 2 forward) and list of presence to certains week-end, I would like to know whether I'm gonna be able to provide a full team or am I to forfeit the match due to lack of key-players.
The positions :
styles = {
"Goalkeeper" : ["Goalkeeper"],
"Defender" : ["Centre back", "Wing"],
"Midfielder" : ["Centre midfield", "Wide"],
"Forward" : ["Centre forward","Winger"]
}
Most football players can play only one position, but some are more versatile and can play any positions in their own field (defense-midfield-attack).
Example of a team (18 pers.):
example_players = {
"Forward": [
[1, "Winger"],
[2, "Winger"],
[3, "Centre forward"],
[4, "Centre forward"]
],
"Defender": [
[5, "Centre back"],
[6, "Centre back", "Wing"],
[7, "Centre back", "Wing"],
[8, "Wing"],
[9, "Centre back"]
],
"Goalkeeper": [
[10, "Goalkeeper"],
[11, "Goalkeeper"]
],
"Midfielder": [
[12, "Centre midfield"],
[13, "Centre midfield"],
[14, "Wide", "Centre midfield"],
[15, "Centre midfield"],
[16, "Centre midfield"],
[17, "Wide", "Centre midfield"],
[18, "Wide", "Centre midfield"]
]
}
To make it more simple, I need at least one person in each zone (goal-def-mid-attack) to be able to play, the most comfortable situation being one person in each of the 7 positions.
ex scenario :
"no_defense_4" : ["Goalkeeper", "Wide", "Winger" ] ,
"no_attack_1" : ["Goalkeeper", "Centre midfield", "Centre back", ] ,
Now, given a list of a hundred weekends, and the list of the presence/abscence of players, I want to know the resulting situation.
I'm looking preferentially for a formula-based solution, since the worksheet will be uploaded and used in google drive

You can represent sets as bit vectors and then use bit operators "equal" or "AND" to test which sets get matched. Using bit vectors as set representation will solve problem of ordering and duplicate values automatically as position of each value in the bit vector is fixed and each bit will be "set" only once, regardless of how many times the value appears in the column that defines the set.
Simple to use bit vector representation in Excel including operators OR, AND, NOT is listed here: http://chandoo.org/wp/2011/07/29/bitwise-operations-in-excel/#comment-207723
For example following function
=POWER(10;0)*MIN(COUNTIF($B$3:$B$12;"T1");1)+POWER(10;1)*MIN(COUNTIF($B$3:$B$12;"T2");1)+POWER(10;2)*MIN(COUNTIF($B$3:$B$12;"S");1)+POWER(10;3)*MIN(COUNTIF($B$3:$B$12;"PL");1)+POWER(10;4)*MIN(COUNTIF($B$3:$B$12;"CC");1)+POWER(10;5)*MIN(COUNTIF($B$3:$B$12;"GC");1)
Converts values in the range $B$3->$B$12 into a bit vector representation having bits 0..5 defined so that the bit is set if the value in any column in the range is equal to:
bit 0 = T1
bit 1 = T2
bit 2 = S
bit 3 = PL
bit 4 = CC
bit 5 = GC
You can add more bits with other values easily by following the same copy/paste pattern.
So to check if certain column matches certain scenario, just compare the bit vectors. Use expression like IF(x=y;"warn2";IF(..)) and substitute bit vector of the column for x and bit vector of the warn2 scenario for y.
If partial matching is needed, you can use the bitwise AND operator as defined in the above article.
This solution as opposed to a VBA-based solution will require some copy/pasting discipline, e.g. when new trial column or new scenario will be added few expressions will have to be copy/pasted and few will have to be updated.
VBA-based solution might solve this maintenance problem automatically for you by using auto-detected CurrentRegions, all necessary logic hidden behind one macro-click.
EDIT: The bit vectors concept applied to the new soccer teams dataset
Worksheet: https://docs.google.com/spreadsheet/ccc?key=0AtZPnBk7a3pvdHcyWDV6ZFFoUTNyWWF0bjl3VFpaRkE&usp=drive_web#gid=0
As it is ambiguous what will be the exact team setup on a given day as one player may be assigned different positions, I have simplified the problem in such a way that instead of "present" or "absent" I expect the table to contain player's position. It should not be a problem to achieve as if you know what positions the player can play then instead of absent,present you can define the set of valid values to be (empty or anything else),Midfielder,Centre midfield,Wide for players 14,17,18. List of valid available values can be configured for each cell using the "Data validation" rules. The abstract role Midfielder stands for "this player can play a midfielder, exact position is not known yet".
To represent positions I use bit vector calculated with this formula
=POWER(10;6)*MIN(COUNTIF(D2:ZZ2;"Goalkeeper");1)+POWER(10;5)*MIN(COUNTIF(D2:ZZ2;"Centre back");1)+POWER(10;4)*MIN(COUNTIF(D2:ZZ2;"Wing");1)+POWER(10;3)*MIN(COUNTIF(D2:ZZ2;"Centre midfield");1)+POWER(10;2)*MIN(COUNTIF(D2:ZZ2;"Wide");1)+POWER(10;1)*MIN(COUNTIF(D2:ZZ2;"Centre forward");1)+POWER(10;0)*MIN(COUNTIF(D2:ZZ2;"Winger");1)
the formula calculates bit vector from a range D2:ZZ2 in such a way so that each position in the range is counted only once and in final vector each position has a fixed place. It is useful to set number format of the vector to custom numeric format 0000000. For example a row containing Wide,Winger,Goalkeeper in any order with any number of repeats will evaluate to vector 1000101 where the left-most bit 6 stands for Goalkeeper and 2nd from the right goes bit 2 standing for Wide. The most comfortable situation is the one with bit vector evaluating to 1111111. The only purpose of this bit vector is to detect the comfortable situation
For matching scenarios to team setups I use another vector composed of 4 digits with this meaning:
leftmost digit 3 - number of goalies (at most 1 counts)
digit 2 - number of defenders (at most 2 counts)
digit 1 - number of midfielders (at most 2 counts)
rightmost digit 0 - number of forwards (at most 2 counts)
The formula to calculate this vector for range D2:ZZ2 looks like this
=POWER(10;3)*MIN(COUNTIF(D2:ZZ2;"Goalkeeper");1)+POWER(10;2)*MIN(COUNTIF(D2:ZZ2;"Defender")+COUNTIF(D2:ZZ2;"Centre back")+COUNTIF(D2:ZZ2;"Wing");2)+POWER(10;1)*MIN(COUNTIF(D2:ZZ2;"Midfielder")+COUNTIF(D2:ZZ2;"Centre midfield")+COUNTIF(D2:ZZ2;"Wide");2)+POWER(10;0)*MIN(COUNTIF(D2:ZZ2;"Forward")+COUNTIF(D2:ZZ2;"Centre forward")+COUNTIF(D2:ZZ2;"Winger");2)
It is useful to set number format of the vector to custom numeric format 0000. This same formula can calculate the 4-digit vector for team setup and for scenario.
Besides position names it can count also abstract position names like Defender.
For example in a row containing Centre back,Centre back,Goalkeeper,Goalkeeper,Goalkeeper,Defender,Defender,Midfielder,Midfielder,Winger the vector looks like 1221.
There are (1+1)*(2+1)*(2+1)*(2+1) = 54 different possible scenarios. I assume each of them is listed in the constraints sheet. You should be able to generate them all in python quite easily.
There are 2 sheets constraints with scenarios and events with days and team setups. The lookup formula that takes the vector calculated for a team setup in row #2 and searches the constraints sheet for a row with exactly the same vector and returns the value from the value column looks like this
=IFERROR(VLOOKUP($A2;constraints!$A:$B;2;FALSE);"?")
$A2 - contains the 4-digit vector formula for the team setup
constraints!$A - column in the sheet with scenarios containing the 4-digit vector formula for the scenario
constraints!$B - column in the sheet with scenarios containing the scenario name - the thing you are looking for
2 - index of column constraints!$B
FALSE - means the lookup column does not have to be sorted
? - fallback value if no matching scenario was found (should not occur)
The Google docs link above contains the formulas, example 3 days and example 11 scenarios.
If there's something unclear let me know and I'll improve the answer as the Google docs link will vanish some day

How to extract specific parts of an equation in MuPAD or Maple

I have MuPAD and Maple and I would like to do the following with one of those softwares:
I have an equation containing several cosines with different amplitudes and different arguments as depictetd in the picture below in the first (blue) row.
I want to extract only those cosines which contain at least the argument "+at-bt" (so "+at-bt+alpha" is OK, too) - see second (blue row).
I want to display the summ of amplitudes of this specific cosines - see third (red) row.
The second picture shows a real example.

Let's say that your long expression is named expr. Then do this
TypeTools:-AddType(
MyCos,
cos(satisfies(x-> x::`+` and {a*t, -b*t} subset {op(x)} or x = b*t-a*t))
):
subex:= select(T-> T::MyCos or T::`*` and membertype(MyCos, {op(T)}), expr);
Now subex is your desired subexpression. If you want to add up the coefficients, then simply do eval(subex, cos= 1).
Note that this will not find partially factored arguments like (a-b)*t+alpha. If you need to find these, let me know.