Related
I'm hoping to search an excel column for the sequence in it most similar to a sequence I enter.
For instance, in the following example, the sequence I provide is: 1, 2.5, 3.5, 2.5, 1. It's depicted on the following graph as black.
In the column I'm searching, there are a few sequences. The most similar one to mine is colored blue. It goes: 1, 2, 3, 2, 1.
Graph
Do any of you know an excel formula, or series of formulas and steps, that would allow Excel to determine this -- so that when I enter the black sequence, for instance, it will match it with the blue sequence as the most similar one?
Thanks tothis Stack overflow answer, I already know how to search a set of numbers for an exact sequence by using the following formula:
=MATCH([Criteria 1]&[Criteria 2],[Data 1st val]:[Data last val]&[Data 2nd val]:[Data last + 1 val],0)
For instance, if I have the following numbers: 1, 3, 5, 1, 4, and I am hoping to find the sequence, 1, 4, this formula will direct me towards it in that set of numbers.
I ALSO already know how to find the closest match to a number I enter, using this formula (which will make more sense if you look in the example image below): =INDEX($A$1:$A$10,MATCH(MIN(ABS(C1-B1:B10)),ABS(C1-$B$1:$B$10),0))
Example
When I press control+shift+enter, this formula will produce the number 4, indicating row 4, because the number I entered in C1, which was 39, is closest to the number 40, which is located in the 4th row.
So I have both the components -- finding exact sequences, and finding the closest number -- but now the question is, how do I combine these two formulas to show me the closest sequence of numbers, the one which would look most similar if drawn on a graph like in my first example with the blue and black line?
And bonus points if you can help find not only the closest sequence but the closest sequences in order of most similar to least similar.
And once again, I don't need this to be rolled into one formula; I am happy to go through a couple steps and different formulas manually to arrive at the answer.
And if you think this would be better solved in some other way, please let me know! But I do not have any coding experience so I figured Excel would be my best bet.
Thank you so much!!!
Not sure how you exactly have set this up, but if I visualize your graph in a table you could use the below (if one has Microsoft365):
Formula in H2:
=INDEX(SORTBY(B2:F4,MMULT(ABS(B2:F4-B1:F1),SEQUENCE(5,,,0))),1)
With all your data in a single column, below you can find an example for if you'd have sequences of 5.
Formula in C2:
=TRANSPOSE(INDEX(SORTBY(INDEX(A2:A16,SEQUENCE(11,5)-ROUNDDOWN(SEQUENCE(11,5,0,0.2),0)*4),MMULT(ABS(INDEX(A2:A16,SEQUENCE(11,5)-ROUNDDOWN(SEQUENCE(11,5,0,0.2),0)*4)-TRANSPOSE(B2:B6)),SEQUENCE(5,,,0))),1))
If you would want to make this applicable for your dataset from A1:A500 with sequence of 10 numbers:
=TRANSPOSE(INDEX(SORTBY(INDEX(A1:A500,SEQUENCE(COUNT(A1:A500)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(A1:A500)-9,10,0,0.1),0)*9),MMULT(ABS(INDEX(A1:A500,SEQUENCE(COUNT(A1:A500)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(A1:A500)-9,10,0,0.1),0)*9)-TRANSPOSE(B1:B10)),SEQUENCE(10,,,0))),1))
And if will be even better if you had acces to LET() and it will be a piece of cake to just change the range reference:
=LET(X,A2:A500,Y,INDEX(X,SEQUENCE(COUNT(X)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(X)-9,10,0,0.1),0)*9),TRANSPOSE(INDEX(SORTBY(Y,MMULT(ABS(Y-TRANSPOSE(B2:B11)),SEQUENCE(10,,,0))),1)))
EDIT2:
To make it more dynamic you can use:
=LET(W,1,X,A2:A500,Y,11,Z,INDEX(X,SEQUENCE(COUNT(X)-(Y-1),Y)-ROUNDDOWN(SEQUENCE(COUNT(X)-(Y-1),Y,0,1/Y),0)*(Y-1)),TRANSPOSE(INDEX(SORTBY(Z,MMULT(ABS(Z-TRANSPOSE(B2:INDEX(B:B,Y+1))),SEQUENCE(Y,,,0))),W)))
Where "W" is the nth closest match and where "Y" is the length of the sequence, 11 in the example.
My approach would be to calculate a match-value between each color and the input values, like the sum of the differences for each point.
The formula for this is:
=SUM(IF([inputrange]<>"",ABS([inputrange]-[colorrange]),0))
Where [inputrange] is the range of your input (indicated red in the picture below, $C$6:$G$6) and [colorrange] is the range of that color (indicated blue, C2:G2).
The color with the lowest difference is the match:
=VLOOKUP(MIN([matchvalues],[rangeofmatchandcolors],2,0)
Where [matchvalues] is the range of match values (indicated blue in the picture below, Cells A2:A4) and [rangeofmatchandcolors] is both the match values as well as the colors (indicated red, A2:B4)
I'm trying to compare teams' compositions to known configurations in order to see where I might have a problem :
The trials columns are to be compare against the differents scenarios to see if a column is a superset of a particular scenario (error being default).
Can it be done using index+match/lookup, or do I have to write some VB macro ?
EDIT : I've updated the question with a worksheet with input data.
Worksheet : https://drive.google.com/file/d/0BxwDbXStIEAsUmpONHp1RVRzR2s/edit?usp=sharing
Github Gist : https://gist.github.com/lucasg/11177852 (python script for data gen)
(xlwt module needed to create excel workbooks).
I've simplified the problem using soccer teams : given 7 positions ( 1 goalie, 2 defenders, 2 midfield and 2 forward) and list of presence to certains week-end, I would like to know whether I'm gonna be able to provide a full team or am I to forfeit the match due to lack of key-players.
The positions :
styles = {
"Goalkeeper" : ["Goalkeeper"],
"Defender" : ["Centre back", "Wing"],
"Midfielder" : ["Centre midfield", "Wide"],
"Forward" : ["Centre forward","Winger"]
}
Most football players can play only one position, but some are more versatile and can play any positions in their own field (defense-midfield-attack).
Example of a team (18 pers.):
example_players = {
"Forward": [
[1, "Winger"],
[2, "Winger"],
[3, "Centre forward"],
[4, "Centre forward"]
],
"Defender": [
[5, "Centre back"],
[6, "Centre back", "Wing"],
[7, "Centre back", "Wing"],
[8, "Wing"],
[9, "Centre back"]
],
"Goalkeeper": [
[10, "Goalkeeper"],
[11, "Goalkeeper"]
],
"Midfielder": [
[12, "Centre midfield"],
[13, "Centre midfield"],
[14, "Wide", "Centre midfield"],
[15, "Centre midfield"],
[16, "Centre midfield"],
[17, "Wide", "Centre midfield"],
[18, "Wide", "Centre midfield"]
]
}
To make it more simple, I need at least one person in each zone (goal-def-mid-attack) to be able to play, the most comfortable situation being one person in each of the 7 positions.
ex scenario :
"no_defense_4" : ["Goalkeeper", "Wide", "Winger" ] ,
"no_attack_1" : ["Goalkeeper", "Centre midfield", "Centre back", ] ,
Now, given a list of a hundred weekends, and the list of the presence/abscence of players, I want to know the resulting situation.
I'm looking preferentially for a formula-based solution, since the worksheet will be uploaded and used in google drive
You can represent sets as bit vectors and then use bit operators "equal" or "AND" to test which sets get matched. Using bit vectors as set representation will solve problem of ordering and duplicate values automatically as position of each value in the bit vector is fixed and each bit will be "set" only once, regardless of how many times the value appears in the column that defines the set.
Simple to use bit vector representation in Excel including operators OR, AND, NOT is listed here: http://chandoo.org/wp/2011/07/29/bitwise-operations-in-excel/#comment-207723
For example following function
=POWER(10;0)*MIN(COUNTIF($B$3:$B$12;"T1");1)+POWER(10;1)*MIN(COUNTIF($B$3:$B$12;"T2");1)+POWER(10;2)*MIN(COUNTIF($B$3:$B$12;"S");1)+POWER(10;3)*MIN(COUNTIF($B$3:$B$12;"PL");1)+POWER(10;4)*MIN(COUNTIF($B$3:$B$12;"CC");1)+POWER(10;5)*MIN(COUNTIF($B$3:$B$12;"GC");1)
Converts values in the range $B$3->$B$12 into a bit vector representation having bits 0..5 defined so that the bit is set if the value in any column in the range is equal to:
bit 0 = T1
bit 1 = T2
bit 2 = S
bit 3 = PL
bit 4 = CC
bit 5 = GC
You can add more bits with other values easily by following the same copy/paste pattern.
So to check if certain column matches certain scenario, just compare the bit vectors. Use expression like IF(x=y;"warn2";IF(..)) and substitute bit vector of the column for x and bit vector of the warn2 scenario for y.
If partial matching is needed, you can use the bitwise AND operator as defined in the above article.
This solution as opposed to a VBA-based solution will require some copy/pasting discipline, e.g. when new trial column or new scenario will be added few expressions will have to be copy/pasted and few will have to be updated.
VBA-based solution might solve this maintenance problem automatically for you by using auto-detected CurrentRegions, all necessary logic hidden behind one macro-click.
EDIT: The bit vectors concept applied to the new soccer teams dataset
Worksheet: https://docs.google.com/spreadsheet/ccc?key=0AtZPnBk7a3pvdHcyWDV6ZFFoUTNyWWF0bjl3VFpaRkE&usp=drive_web#gid=0
As it is ambiguous what will be the exact team setup on a given day as one player may be assigned different positions, I have simplified the problem in such a way that instead of "present" or "absent" I expect the table to contain player's position. It should not be a problem to achieve as if you know what positions the player can play then instead of absent,present you can define the set of valid values to be (empty or anything else),Midfielder,Centre midfield,Wide for players 14,17,18. List of valid available values can be configured for each cell using the "Data validation" rules. The abstract role Midfielder stands for "this player can play a midfielder, exact position is not known yet".
To represent positions I use bit vector calculated with this formula
=POWER(10;6)*MIN(COUNTIF(D2:ZZ2;"Goalkeeper");1)+POWER(10;5)*MIN(COUNTIF(D2:ZZ2;"Centre back");1)+POWER(10;4)*MIN(COUNTIF(D2:ZZ2;"Wing");1)+POWER(10;3)*MIN(COUNTIF(D2:ZZ2;"Centre midfield");1)+POWER(10;2)*MIN(COUNTIF(D2:ZZ2;"Wide");1)+POWER(10;1)*MIN(COUNTIF(D2:ZZ2;"Centre forward");1)+POWER(10;0)*MIN(COUNTIF(D2:ZZ2;"Winger");1)
the formula calculates bit vector from a range D2:ZZ2 in such a way so that each position in the range is counted only once and in final vector each position has a fixed place. It is useful to set number format of the vector to custom numeric format 0000000. For example a row containing Wide,Winger,Goalkeeper in any order with any number of repeats will evaluate to vector 1000101 where the left-most bit 6 stands for Goalkeeper and 2nd from the right goes bit 2 standing for Wide. The most comfortable situation is the one with bit vector evaluating to 1111111. The only purpose of this bit vector is to detect the comfortable situation
For matching scenarios to team setups I use another vector composed of 4 digits with this meaning:
leftmost digit 3 - number of goalies (at most 1 counts)
digit 2 - number of defenders (at most 2 counts)
digit 1 - number of midfielders (at most 2 counts)
rightmost digit 0 - number of forwards (at most 2 counts)
The formula to calculate this vector for range D2:ZZ2 looks like this
=POWER(10;3)*MIN(COUNTIF(D2:ZZ2;"Goalkeeper");1)+POWER(10;2)*MIN(COUNTIF(D2:ZZ2;"Defender")+COUNTIF(D2:ZZ2;"Centre back")+COUNTIF(D2:ZZ2;"Wing");2)+POWER(10;1)*MIN(COUNTIF(D2:ZZ2;"Midfielder")+COUNTIF(D2:ZZ2;"Centre midfield")+COUNTIF(D2:ZZ2;"Wide");2)+POWER(10;0)*MIN(COUNTIF(D2:ZZ2;"Forward")+COUNTIF(D2:ZZ2;"Centre forward")+COUNTIF(D2:ZZ2;"Winger");2)
It is useful to set number format of the vector to custom numeric format 0000. This same formula can calculate the 4-digit vector for team setup and for scenario.
Besides position names it can count also abstract position names like Defender.
For example in a row containing Centre back,Centre back,Goalkeeper,Goalkeeper,Goalkeeper,Defender,Defender,Midfielder,Midfielder,Winger the vector looks like 1221.
There are (1+1)*(2+1)*(2+1)*(2+1) = 54 different possible scenarios. I assume each of them is listed in the constraints sheet. You should be able to generate them all in python quite easily.
There are 2 sheets constraints with scenarios and events with days and team setups. The lookup formula that takes the vector calculated for a team setup in row #2 and searches the constraints sheet for a row with exactly the same vector and returns the value from the value column looks like this
=IFERROR(VLOOKUP($A2;constraints!$A:$B;2;FALSE);"?")
$A2 - contains the 4-digit vector formula for the team setup
constraints!$A - column in the sheet with scenarios containing the 4-digit vector formula for the scenario
constraints!$B - column in the sheet with scenarios containing the scenario name - the thing you are looking for
2 - index of column constraints!$B
FALSE - means the lookup column does not have to be sorted
? - fallback value if no matching scenario was found (should not occur)
The Google docs link above contains the formulas, example 3 days and example 11 scenarios.
If there's something unclear let me know and I'll improve the answer as the Google docs link will vanish some day
I am implementing a VM compiler, and naturally, I've come to the point of implementing switches. Also naturally, for short switches, a sequential lookup array would be optimal but what about bigger switches?
So far I've come up with a data structure that gives me a pretty good lookup time. I don't know the name of that structure, but it is similar to a binary tree but monolith, with the difference it only applies to a static set of integers, cannot add or remove. It looks like a table, where value increases to the top and the right, here is an example:
Integers -89, -82, -72, -68, -65, -48, -5, 0, 1, 3, 7, 18, 27, 29, 32, 37, 38, 42, 45, 54, 76, 78, 87, 89, 92
and the table:
-65 3 32 54 92
-68 1 29 45 89
-82 -5 18 38 78
-89 -48 7 37 76
Which gives me the worst possible case width + height iterations. Let's say the case is 37, -65 is less than 37, so move to the right, same for 3 move to the right, same for 32 move to the right, 54 is bigger so move down (step a width since it is a sequential array anyway), 45 is bigger so move down, 38 is bigger so move down and there we have 37 in 7 hops.
Is there any possible faster lookup algorithm?
Also, is there a name for this kind of arrangement? I came up with it on my own, but most likely someone else did that before me, so it is most probably named already.
EDIT: OK, as far as I got it, a "perfect hash" will offer me better THEORETICAL performance. But how will this play out in real life? If I understand correctly a two level "perfect hash" will be rather spread out instead of a continuous block of memory, so while the theoretical complexity is lower, there is a potential penalty of tens if not hundreds of cycles before that memory is fetched. In contrast, a slower theoretical worst case scenario will actually perform better just because it is more cache friendly than a perfect hash... Or not?
When implementing switches among a diverse set of alternatives, you have several options:
Make several groups of flat lookup arrays. For example, if you see numbers 1, 2, 3, 20000, 20001, 20002, you could do a single if to take you to 1-s or to 20,000-s, and then employ two flat lookup arrays.
Discover a pattern. For example, if you see numbers 100, 200, 300, 400, 500, 600, divide the number by 100, and then go for a flat lookup array.
Make a hash table. Since all the numbers that you are hashing are known to you, you can play with the load factor of the table to make sure that the lookup is not going to take a lot of probing.
Your algorithm is similar to binary search, in the sense that it's from the "divide an conquer" family. Such algorithms have logarithmic time complexity, which may not be acceptable for switches, because they are expected to be O(1).
Is there any possible faster lookup algorithm?
Binary search is faster.
Binary search completes in log2(w*h) = log2(w) + log2(h).
Your algorithm completes in w+h.
So let's say I have an array like this:
[1,1,2,3,10,11,13,67,71]
Is there a convenient way to partition the array into something like this?
[[1,1,2,3],[10,11,13],[67,71]]
I looked through similar questions yet most people suggested using k-means to cluster points, like scipy, which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers?
Some people also suggest rigid range partitioning, but it doesn't always render the results as
expected
Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.
In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.
You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.
With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.
See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):
This simple algorithm works:
points = [0.1, 0.31, 0.32, 0.45, 0.35, 0.40, 0.5 ]
clusters = []
eps = 0.2
points_sorted = sorted(points)
curr_point = points_sorted[0]
curr_cluster = [curr_point]
for point in points_sorted[1:]:
if point <= curr_point + eps:
curr_cluster.append(point)
else:
clusters.append(curr_cluster)
curr_cluster = [point]
curr_point = point
clusters.append(curr_cluster)
print(clusters)
The above example clusters points into a group, such that each element in a group is at most eps away from another element in the group. This is like the clustering algorithm DBSCAN with eps=0.2, min_samples=1. As others noted, 1d data allows you to solve the problem directly, instead of using the bigger guns like DBSCAN.
The above algorithm is 10-100x faster for some small datasets with <1000 elements I tested.
You may look for discretize algorithms. 1D discretization problem is a lot similar to what you are asking. They decide cut-off points, according to frequency, binning strategy etc.
weka uses following algorithms in its , discretization process.
weka.filters.supervised.attribute.Discretize
uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion
weka.filters.unsupervised.attribute.Discretize
uses simple binning
CKwrap is a fast and straightforward k-means clustering function, though a bit light on documentation.
Example Usage
pip install ckwrap
import ckwrap
nums= np.array([1,1,2,3,10,11,13,67,71])
km = ckwrap.ckmeans(nums,3)
print(km.labels)
# [0 0 0 0 1 1 1 2 2]
buckets = [[],[],[]]
for i in range(len(nums)):
buckets[km.labels[i]].append(nums[i])
print(buckets)
# [[1, 1, 2, 3], [10, 11, 13], [67, 71]]
exit()
I expect the authors intended you to make use of the nd array functionality rather than create a list of lists.
other measures:
km.centers
km.k
km.sizes
km.totss
km.betweenss
km.withinss
The underlying algorithm is based on this article.
Late response and just for the record. You can partition a 1D array using Ckmeans.1d.dp.
This method guarantees optimality and it is O(n^2), where n is the num of observations. The implementation is in C++ and there is a wrapper in R.
The code for Has QUIT--Anony-Mousse's answer to Clustering values by their proximity in python (machine learning?)
When you have 1-dimensional data, sort it, and look for the largest
gaps
I only added that gaps need to be relatively large
import numpy as np
from scipy.signal import argrelextrema
# lst = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
lst = [1,1,2,3,10,11,13,67,71]
lst.sort()
diff = [lst[i] - lst[i-1] for i in range(1, len(lst))]
rel_diff = [diff[i]/lst[i] for i in range(len(diff))]
arg = argrelextrema(np.array(rel_diff), np.greater)[0]
last = 0
for x in arg:
print(f'{last}:{x + 1} {lst[last:x + 1]}')
last = x + 1
print(f'{last}: {lst[last:]}')
output:
0:2 [1, 1]
2:4 [2, 3]
4:7 [10, 11, 13]
7: [67, 71]
For my Advanced Algorithms and Data Structures class, my professor asked us to pick any topic that interested us. He also told us to research it and to try and implement a solution in it. I chose Neural Networks because it's something that I've wanted to learn for a long time.
I've been able to implement an AND, OR, and XOR using a neural network whose neurons use a step function for the activator. After that I tried to implement a back-propagating neural network that learns to recognize the XOR operator (using a sigmoid function as the activator). I was able to get this to work 90% of the time by using a 3-3-1 network (1 bias at the input and hidden layer, with weights initialized randomly). At other times it seems to get stuck in what I think is a local minima, but I am not sure (I've asked questions on this before and people have told me that there shouldn't be a local minima).
The 90% of the time it was working, I was consistently presenting my inputs in this order: [0, 0], [0, 1], [1, 0], [1, 0] with the expected output set to [0, 1, 1, 0]. When I present the values in the same order consistently, the network eventually learns the pattern. It actually doesn't matter in what order I send it in, as long as it is the exact same order for each epoch.
I then implemented a randomization of the training set, and so this time the order of inputs is sufficiently randomized. I've noticed now that my neural network gets stuck and the errors are decreasing, but at a very small rate (which is getting smaller at each epoch). After a while, the errors start oscillating around a value (so the error stops decreasing).
I'm a novice at this topic and everything I know so far is self-taught (reading tutorials, papers, etc.). Why does the order of presentation of inputs change the behavior of my network? Is it because the change in error is consistent from one input to the next (because the ordering is consistent), which makes it easy for the network to learn?
What can I do to fix this? I'm going over my backpropagation algorithm to make sure I've implemented it right; currently it is implemented with a learning rate and a momentum. I'm considering looking at other enhancements like an adaptive learning-rate. However, the XOR network is often portrayed as a very simple network and so I'm thinking that I shouldn't need to use a sophisticated backpropagation algorithm.
the order in which you present the observations (input vectors) comprising your training set to the network only matters in one respect--randomized arrangement of the observations according to the response variable is strongly preferred versus ordered arrangement.
For instance, suppose you have 150 observations comprising your training set, and for each the response variable is one of three class labels (class I, II, or III), such that observations 1-50 are in class I, 51-100 in class II, and 101-50 in class III. What you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 observations in class I, then all 50 in class II, then all 50 in class III.
What happened during training your classifier? Well initially you were presenting the four observations to your network, unordered [0, 1, 1, 0].
I wonder what was the ordering of the input vectors in those instances in which your network failed to converge? If it was [1, 1, 0, 0], or [0, 1, 1, 1], this is consistent with this well-documented empirical rule i mentioned above.
On the other hand, i have to wonder whether this rule even applies in your case. The reason is that you have so few training instances that even if the order is [1, 1, 0, 0], training over multiple epochs (which i am sure you must be doing) will mean that this ordering looks more "randomized" rather than the exemplar i mentioned above (i.e., [1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0] is how the network would be presented with the training data over three epochs).
Some suggestions to diagnose the problem:
As i mentioned above, look at the ordering of your input vectors in the non-convergence cases--are they sorted by response variable?
In the non-convergence cases, look at your weight matrices (i assume you have two of them). Look for any values that are very large (e.g., 100x the others, or 100x the value it was initialized with). Large weights can cause overflow.