App Engine - Precomputing bounding boxes for proximity search - google-app-engine

I'm trying to do a location-based search on App Engine, but since the data store doesn't support multiple inequality operators, I can't search "where lat between a and b and lon between c and d".
One of the solutions is to pre-compute bounding boxes to search on, as explained here:
http://code.google.com/appengine/articles/geosearch.html
http://mutiny.googlecode.com
However, I'm a little confused about "slices". I'm trying to figure out:
Why have slices? Why not just increase the resolution? Don't they do the same thing?
Why does the same have 5 configs - won't one do?
GEOBOX_CONFIGS = (
(4, 5, True),
(3, 2, True),
(3, 8, False),
(3, 16, False),
(2, 5, False),
)
I'm trying to figure out what to set the config to for my own app, but there are so many variables, it's not clear what to do. Do I increase the resolution (first number), the number of slices (second number), add/remove config?
Ultimately, I'm interested in points within 10-15 miles (the code already sorts them by distance), but I don't understand why it can't be done with 1 config and the resolution set high enough.

I found another example which seems to wrap everything up nicely, and I don't need to worry about all those crazy config values!
http://code.google.com/p/geomodel/wiki/Usage

Related

How do I search for most similar sequence in Excel?

I'm hoping to search an excel column for the sequence in it most similar to a sequence I enter.
For instance, in the following example, the sequence I provide is: 1, 2.5, 3.5, 2.5, 1. It's depicted on the following graph as black.
In the column I'm searching, there are a few sequences. The most similar one to mine is colored blue. It goes: 1, 2, 3, 2, 1.
Graph
Do any of you know an excel formula, or series of formulas and steps, that would allow Excel to determine this -- so that when I enter the black sequence, for instance, it will match it with the blue sequence as the most similar one?
Thanks tothis Stack overflow answer, I already know how to search a set of numbers for an exact sequence by using the following formula:
=MATCH([Criteria 1]&[Criteria 2],[Data 1st val]:[Data last val]&[Data 2nd val]:[Data last + 1 val],0)
For instance, if I have the following numbers: 1, 3, 5, 1, 4, and I am hoping to find the sequence, 1, 4, this formula will direct me towards it in that set of numbers.
I ALSO already know how to find the closest match to a number I enter, using this formula (which will make more sense if you look in the example image below): =INDEX($A$1:$A$10,MATCH(MIN(ABS(C1-B1:B10)),ABS(C1-$B$1:$B$10),0))
Example
When I press control+shift+enter, this formula will produce the number 4, indicating row 4, because the number I entered in C1, which was 39, is closest to the number 40, which is located in the 4th row.
So I have both the components -- finding exact sequences, and finding the closest number -- but now the question is, how do I combine these two formulas to show me the closest sequence of numbers, the one which would look most similar if drawn on a graph like in my first example with the blue and black line?
And bonus points if you can help find not only the closest sequence but the closest sequences in order of most similar to least similar.
And once again, I don't need this to be rolled into one formula; I am happy to go through a couple steps and different formulas manually to arrive at the answer.
And if you think this would be better solved in some other way, please let me know! But I do not have any coding experience so I figured Excel would be my best bet.
Thank you so much!!!
Not sure how you exactly have set this up, but if I visualize your graph in a table you could use the below (if one has Microsoft365):
Formula in H2:
=INDEX(SORTBY(B2:F4,MMULT(ABS(B2:F4-B1:F1),SEQUENCE(5,,,0))),1)
With all your data in a single column, below you can find an example for if you'd have sequences of 5.
Formula in C2:
=TRANSPOSE(INDEX(SORTBY(INDEX(A2:A16,SEQUENCE(11,5)-ROUNDDOWN(SEQUENCE(11,5,0,0.2),0)*4),MMULT(ABS(INDEX(A2:A16,SEQUENCE(11,5)-ROUNDDOWN(SEQUENCE(11,5,0,0.2),0)*4)-TRANSPOSE(B2:B6)),SEQUENCE(5,,,0))),1))
If you would want to make this applicable for your dataset from A1:A500 with sequence of 10 numbers:
=TRANSPOSE(INDEX(SORTBY(INDEX(A1:A500,SEQUENCE(COUNT(A1:A500)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(A1:A500)-9,10,0,0.1),0)*9),MMULT(ABS(INDEX(A1:A500,SEQUENCE(COUNT(A1:A500)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(A1:A500)-9,10,0,0.1),0)*9)-TRANSPOSE(B1:B10)),SEQUENCE(10,,,0))),1))
And if will be even better if you had acces to LET() and it will be a piece of cake to just change the range reference:
=LET(X,A2:A500,Y,INDEX(X,SEQUENCE(COUNT(X)-9,10)-ROUNDDOWN(SEQUENCE(COUNT(X)-9,10,0,0.1),0)*9),TRANSPOSE(INDEX(SORTBY(Y,MMULT(ABS(Y-TRANSPOSE(B2:B11)),SEQUENCE(10,,,0))),1)))
EDIT2:
To make it more dynamic you can use:
=LET(W,1,X,A2:A500,Y,11,Z,INDEX(X,SEQUENCE(COUNT(X)-(Y-1),Y)-ROUNDDOWN(SEQUENCE(COUNT(X)-(Y-1),Y,0,1/Y),0)*(Y-1)),TRANSPOSE(INDEX(SORTBY(Z,MMULT(ABS(Z-TRANSPOSE(B2:INDEX(B:B,Y+1))),SEQUENCE(Y,,,0))),W)))
Where "W" is the nth closest match and where "Y" is the length of the sequence, 11 in the example.
My approach would be to calculate a match-value between each color and the input values, like the sum of the differences for each point.
The formula for this is:
=SUM(IF([inputrange]<>"",ABS([inputrange]-[colorrange]),0))
Where [inputrange] is the range of your input (indicated red in the picture below, $C$6:$G$6) and [colorrange] is the range of that color (indicated blue, C2:G2).
The color with the lowest difference is the match:
=VLOOKUP(MIN([matchvalues],[rangeofmatchandcolors],2,0)
Where [matchvalues] is the range of match values (indicated blue in the picture below, Cells A2:A4) and [rangeofmatchandcolors] is both the match values as well as the colors (indicated red, A2:B4)

getting combination of indices of on bits from 2 bitmaps

This problem has 2 level
Level 1.
I have a 64 bit bitmap and I know only few of them are on or set to 1. Is there a way to get which bits are set without using branching ?
e.g.___(0)___________________________________________________________(63)
BMP = 000000001000010000000000010000000000000000000000011000000000000
f(BMP) = {9, 14, 26, 51, 52}
Level 2.
Now I have 2 64 bit bitmaps and I need combination of set bits in 2.
e.g.____(0)___________________________________________________________(63)
BMP1 = 000000001000000000000000000000000000000000000000011000000000000
BMP2 = 000000000000010000000000010000000000000000000000000000000000000
f(BMP1, BMP2) = {(9,14), (9, 26), (51, 14), (51, 26), (52, 14), (52, 26)}
I know that the bitmap almost always is sparse.
It would be great if the solution suggested can be expanded to more than 2 bitmaps at a time but I would rather have a method which works extremely fast for upto 2 and then a little slower for more than that.
Even if solution without branching is not possible then please suggest what will be fastest possible method with branching.
(Sorry for bad formatting)
You could store the possible bitfields in a hash table, if there are only a relative few of them, such as if you know no more than two bits are set and there are at most a few thousand possibilities.
Failing that, there are a few tricks you can do with two’s-complement arithmetic and signed numbers to get the first bit set in a vector. v & -v will get you a column vector of the lowest-order bit that’s set in v. You can bitmask and repeat to get them all.

Test if a column is a superset of predefined set of data

I'm trying to compare teams' compositions to known configurations in order to see where I might have a problem :
The trials columns are to be compare against the differents scenarios to see if a column is a superset of a particular scenario (error being default).
Can it be done using index+match/lookup, or do I have to write some VB macro ?
EDIT : I've updated the question with a worksheet with input data.
Worksheet : https://drive.google.com/file/d/0BxwDbXStIEAsUmpONHp1RVRzR2s/edit?usp=sharing
Github Gist : https://gist.github.com/lucasg/11177852 (python script for data gen)
(xlwt module needed to create excel workbooks).
I've simplified the problem using soccer teams : given 7 positions ( 1 goalie, 2 defenders, 2 midfield and 2 forward) and list of presence to certains week-end, I would like to know whether I'm gonna be able to provide a full team or am I to forfeit the match due to lack of key-players.
The positions :
styles = {
"Goalkeeper" : ["Goalkeeper"],
"Defender" : ["Centre back", "Wing"],
"Midfielder" : ["Centre midfield", "Wide"],
"Forward" : ["Centre forward","Winger"]
}
Most football players can play only one position, but some are more versatile and can play any positions in their own field (defense-midfield-attack).
Example of a team (18 pers.):
example_players = {
"Forward": [
[1, "Winger"],
[2, "Winger"],
[3, "Centre forward"],
[4, "Centre forward"]
],
"Defender": [
[5, "Centre back"],
[6, "Centre back", "Wing"],
[7, "Centre back", "Wing"],
[8, "Wing"],
[9, "Centre back"]
],
"Goalkeeper": [
[10, "Goalkeeper"],
[11, "Goalkeeper"]
],
"Midfielder": [
[12, "Centre midfield"],
[13, "Centre midfield"],
[14, "Wide", "Centre midfield"],
[15, "Centre midfield"],
[16, "Centre midfield"],
[17, "Wide", "Centre midfield"],
[18, "Wide", "Centre midfield"]
]
}
To make it more simple, I need at least one person in each zone (goal-def-mid-attack) to be able to play, the most comfortable situation being one person in each of the 7 positions.
ex scenario :
"no_defense_4" : ["Goalkeeper", "Wide", "Winger" ] ,
"no_attack_1" : ["Goalkeeper", "Centre midfield", "Centre back", ] ,
Now, given a list of a hundred weekends, and the list of the presence/abscence of players, I want to know the resulting situation.
I'm looking preferentially for a formula-based solution, since the worksheet will be uploaded and used in google drive
You can represent sets as bit vectors and then use bit operators "equal" or "AND" to test which sets get matched. Using bit vectors as set representation will solve problem of ordering and duplicate values automatically as position of each value in the bit vector is fixed and each bit will be "set" only once, regardless of how many times the value appears in the column that defines the set.
Simple to use bit vector representation in Excel including operators OR, AND, NOT is listed here: http://chandoo.org/wp/2011/07/29/bitwise-operations-in-excel/#comment-207723
For example following function
=POWER(10;0)*MIN(COUNTIF($B$3:$B$12;"T1");1)+POWER(10;1)*MIN(COUNTIF($B$3:$B$12;"T2");1)+POWER(10;2)*MIN(COUNTIF($B$3:$B$12;"S");1)+POWER(10;3)*MIN(COUNTIF($B$3:$B$12;"PL");1)+POWER(10;4)*MIN(COUNTIF($B$3:$B$12;"CC");1)+POWER(10;5)*MIN(COUNTIF($B$3:$B$12;"GC");1)
Converts values in the range $B$3->$B$12 into a bit vector representation having bits 0..5 defined so that the bit is set if the value in any column in the range is equal to:
bit 0 = T1
bit 1 = T2
bit 2 = S
bit 3 = PL
bit 4 = CC
bit 5 = GC
You can add more bits with other values easily by following the same copy/paste pattern.
So to check if certain column matches certain scenario, just compare the bit vectors. Use expression like IF(x=y;"warn2";IF(..)) and substitute bit vector of the column for x and bit vector of the warn2 scenario for y.
If partial matching is needed, you can use the bitwise AND operator as defined in the above article.
This solution as opposed to a VBA-based solution will require some copy/pasting discipline, e.g. when new trial column or new scenario will be added few expressions will have to be copy/pasted and few will have to be updated.
VBA-based solution might solve this maintenance problem automatically for you by using auto-detected CurrentRegions, all necessary logic hidden behind one macro-click.
EDIT: The bit vectors concept applied to the new soccer teams dataset
Worksheet: https://docs.google.com/spreadsheet/ccc?key=0AtZPnBk7a3pvdHcyWDV6ZFFoUTNyWWF0bjl3VFpaRkE&usp=drive_web#gid=0
As it is ambiguous what will be the exact team setup on a given day as one player may be assigned different positions, I have simplified the problem in such a way that instead of "present" or "absent" I expect the table to contain player's position. It should not be a problem to achieve as if you know what positions the player can play then instead of absent,present you can define the set of valid values to be (empty or anything else),Midfielder,Centre midfield,Wide for players 14,17,18. List of valid available values can be configured for each cell using the "Data validation" rules. The abstract role Midfielder stands for "this player can play a midfielder, exact position is not known yet".
To represent positions I use bit vector calculated with this formula
=POWER(10;6)*MIN(COUNTIF(D2:ZZ2;"Goalkeeper");1)+POWER(10;5)*MIN(COUNTIF(D2:ZZ2;"Centre back");1)+POWER(10;4)*MIN(COUNTIF(D2:ZZ2;"Wing");1)+POWER(10;3)*MIN(COUNTIF(D2:ZZ2;"Centre midfield");1)+POWER(10;2)*MIN(COUNTIF(D2:ZZ2;"Wide");1)+POWER(10;1)*MIN(COUNTIF(D2:ZZ2;"Centre forward");1)+POWER(10;0)*MIN(COUNTIF(D2:ZZ2;"Winger");1)
the formula calculates bit vector from a range D2:ZZ2 in such a way so that each position in the range is counted only once and in final vector each position has a fixed place. It is useful to set number format of the vector to custom numeric format 0000000. For example a row containing Wide,Winger,Goalkeeper in any order with any number of repeats will evaluate to vector 1000101 where the left-most bit 6 stands for Goalkeeper and 2nd from the right goes bit 2 standing for Wide. The most comfortable situation is the one with bit vector evaluating to 1111111. The only purpose of this bit vector is to detect the comfortable situation
For matching scenarios to team setups I use another vector composed of 4 digits with this meaning:
leftmost digit 3 - number of goalies (at most 1 counts)
digit 2 - number of defenders (at most 2 counts)
digit 1 - number of midfielders (at most 2 counts)
rightmost digit 0 - number of forwards (at most 2 counts)
The formula to calculate this vector for range D2:ZZ2 looks like this
=POWER(10;3)*MIN(COUNTIF(D2:ZZ2;"Goalkeeper");1)+POWER(10;2)*MIN(COUNTIF(D2:ZZ2;"Defender")+COUNTIF(D2:ZZ2;"Centre back")+COUNTIF(D2:ZZ2;"Wing");2)+POWER(10;1)*MIN(COUNTIF(D2:ZZ2;"Midfielder")+COUNTIF(D2:ZZ2;"Centre midfield")+COUNTIF(D2:ZZ2;"Wide");2)+POWER(10;0)*MIN(COUNTIF(D2:ZZ2;"Forward")+COUNTIF(D2:ZZ2;"Centre forward")+COUNTIF(D2:ZZ2;"Winger");2)
It is useful to set number format of the vector to custom numeric format 0000. This same formula can calculate the 4-digit vector for team setup and for scenario.
Besides position names it can count also abstract position names like Defender.
For example in a row containing Centre back,Centre back,Goalkeeper,Goalkeeper,Goalkeeper,Defender,Defender,Midfielder,Midfielder,Winger the vector looks like 1221.
There are (1+1)*(2+1)*(2+1)*(2+1) = 54 different possible scenarios. I assume each of them is listed in the constraints sheet. You should be able to generate them all in python quite easily.
There are 2 sheets constraints with scenarios and events with days and team setups. The lookup formula that takes the vector calculated for a team setup in row #2 and searches the constraints sheet for a row with exactly the same vector and returns the value from the value column looks like this
=IFERROR(VLOOKUP($A2;constraints!$A:$B;2;FALSE);"?")
$A2 - contains the 4-digit vector formula for the team setup
constraints!$A - column in the sheet with scenarios containing the 4-digit vector formula for the scenario
constraints!$B - column in the sheet with scenarios containing the scenario name - the thing you are looking for
2 - index of column constraints!$B
FALSE - means the lookup column does not have to be sorted
? - fallback value if no matching scenario was found (should not occur)
The Google docs link above contains the formulas, example 3 days and example 11 scenarios.
If there's something unclear let me know and I'll improve the answer as the Google docs link will vanish some day

App Engine - Efficient queries over user profiles containing many multiple-choice responses

I'm building an application where users are able to create profiles for themselves by answering a bunch of multiple-choice questions. Users are also able to search for other users by specifying criteria for answers to these questions.
Let's say we have 9 questions q1 .. q9, each with 6 possible answers (0 through 5). This could be represented in a user profile as something like:
class UserProfile(db.Model):
user = db.StringProperty(required=True)
q1 = db.IntegerProperty()
...
q9 = db.IntegerProperty()
Now, consider that a user wants to run a query for users that answered:
0, 1 or 2 for q1
1, 2 or 5 for q2
...
3, 4, or 5 for q9
We could write a query such as:
q = UserProfile.all()
q.filter("q1 IN", [0, 1, 2])
q.filter("q2 IN", [1, 2, 5])
...
q.filter("q9 IN", [3, 4, 5])
Unfortunately, this would generate close to 20,000 sub-queries (assuming that the user specified 3 possible answers for each filter), which is significantly greater than the 30 allowed, not to mention its horrible inefficiency.
Is there a design pattern to do this efficiently?
I can envision a way to turn each of these filters into single equality filters by representing each filter as an integer using binary encoding (e.g., [1, 2, 5] -> b100110 = 38) and storing each user answer in the datastore as a list of queries it would match (e.g., 1 -> bxxxx1x -> [2, 3, 6, 7, 10, 11, .. , 62, 63]). However, this seems a bit kludgy.
I would appreciate if anyone has a more efficient suggestion for an implementation.
UPDATE (on proposed binary encoding):
Nick Johnson raised some interesting concerns about the binary encoding proposed above, so I would like to clarify the proposed encoding in more detail to allow him and others to provide a clear evaluation of its merits and challenges.
I think a concrete example will work best. Also, I think that starting with the query mechanism is also more intuitive.
Continuing with the example from above, let's assume that there are 9 questions with 6 possible answers each (0 through 5). Let's also define that each query will be in the form of a filter on a number of these questions for matching against multiple possible answers (as described above). I propose to convert each query of the form "q2 IN [1, 2, 5]" to an equality filter using binary encoding, where each bit position is 1 if it's one of the queried responses and 0 otherwise. For example, "q2 IN [1, 2, 5]" would translate to "q2 == b100110" or "q2 == 38". Applying this further, the composite query described above would be translated into the following multiple equality filters:
0, 1 or 2 for q1 -> q1 == b000111 -> q1 == 7
1, 2 or 5 for q2 -> q2 == b100110 -> q2 == 38
...
3, 4, or 5 for q9 -> q9 == b111000 -> q9 == 56
To enable turning the "IN" filters into "==" filters, we need to determine in advance which queries (in their binary-encoded form) a profile response will match. For example, if a user selects 2 (among 0 through 5) as the answer, then that response will match any query whose binary encoding has a 1 in the 2-position, i.e. of the form bxxx1xx, where x could be 0 or 1. The set of integers defined by bxxx1xx are [b000100, b000101, b000110, b000111, b001100, b001101, ... , b111100, b111101, b111110, b111111] or in decimal form: [4, 5, 6, 7, 12, 13, ..., 60, 61, 62, 63], which is a list of 32 integers. In general, this "query match set" will have 2^(n-1) elements for a response to a question with n possible answers, because 1 of the n bits in the binary encoding will be fixed to 1, while the others could be 0 or 1.
Therefore, if we had m questions with n possible answers each, then the number of index entries for each entity storing these "query match sets" for each question would be m x (2 ^ (n-1)). If I have:
9 questions with 6 possible answers each, this would require 9 x 2^5 = 288 index entries
10 questions with 8 possible answers each, this would require 10 x 2^7 = 1280 index entries
15 questions with 9 possible answers each, this would require 15 x 2^8 = 3840 index entries
20 questions with 10 possible answers each, this would require 20 x 2^9 = 10240 index entries (which is above the 5000/entity limit imposed by App Engine)
Therefore, I agree that this is not a suitable approach for an arbitrarily large number of questions, especially if the possible number of answers to questions is large also. However, it appears feasible if the number of questions to be indexed is 10-15 and the possible answers don't number more than 6, at least for a majority of the questions.
I will have no more than 10 questions that need to be indexed. Most of them have 3-5 possible answers. Some have 6-7 possible answers, so I'm expecting less than 300 index entries per entity (unless I'm wrong about how I'm calculating the index requirements above).
I don't really view this as a very elegant solution, but:
It appears that indexing overhead could be manageable (i.e. well below the 5000 index rows limit)
It will return exactly what I'm filtering for (rather than getting a partially filtered list of entities, which all need to be transported over the network, only to be filtered further by the application)
I had gathered that the built-in merge-join would be fast enough for this to be effective.
I would still appreciate perspectives on the following questions:
Based on this more detailed explanation, do you think that the indexing requirements could be reasonable? If you think that this still bumps up against limitations, I really would appreciate your insights on this.
Even if the indexing requirements could be reasonable, do you think that writing a query planner would yield a more efficient solution? If so, I would be grateful for (a) a brief explanation of why this would be more efficient and (b) a pointer to how to go about doing this. I'm not sure about how to even get started with a query planner.
There's simply no efficient way to structure the data for queries as you describe them. The only way to do this is to query on the criteria you think will be most restrictive, then filter manually in memory for the remaining criteria.
If you tell us more about the specific sorts of queries people might execute and why, we may be able to provide concrete suggestions for something more efficient.

Artificial neural networks

I want to know whether Artificial Neural Networks can be applied to discrete values inputs? I know they can be applied to continuous valued inputs, but can they be applied to discrete valued ones? Also, will perform well for discrete valued inputs?
Yes, artificial neural networks may be applied to data featuring discrete-value input variables. In the most commonly used neural network architectures (which are numeric), discrete inputs are typically represented by a series of dummy variables, just as in statistical regression. Also, as with regression, one less than the number of distinct values dummy variables is needed. There are other methods, but this is the most straightforward.
Well, good question let me say!
First of all let me answer directly yes to your question!
The answer implies to consider few aspects about the use and the implementation of the network itself.
Than let me explain why:
The easiest way is to normalize input as usual, this is the first rule of thumb with NN,
than let the neural network compute the task, and once you have your output, invert the normalization to get the output in the original range but still continuous, to get back descrete values just consider the integer part of your output. It is easy, it works and is fine, DONE! A good result just depends on the topology you design for you network.
As a plus you could consider the use of "step" transfer function, instead of "tan-sigmoid", between layers just to strenght and mimic a sort of digitization forcing the output to be just 0 or 1. But you should reconsider also the starting normalization as well as the use of well tuned thresholds.
NB: this latter trick is not really necessary but could give some secondary benefits; maybe test it in a second stage of your development and look at the differences.
PS: Just let me suggest something that should apply to your issue; if you would be smart take into account the use of some fuzzy logic on your learning algorithm ;-)
Cheers!
I'm late on this question, but this may help someone.
Say you have a categorical output variable, for example 3 different categories (0, 1 and 2),
outputs
0
2
1
2
1
0
then becomes
1, 0, 0
0, 0, 1
0, 1, 0
0, 0, 1
0, 1, 0
1, 0, 0
A possible NN output result is
0.2, 0.3, 0.5 (winner is categ 2)
0.05, 0.9, 0.05 (winner is categ 1)
...
Then your NN hill have 3 output nodes in this case, so take the max value.
To improve this, use entropy as a error measure and a softmax activation on the output layer, so that the outputs sum up to 1.
The purpose of a neural network is to approximate complicated functions by interpolating samples. As such, they tend to be a poor fit for discrete data, unless that data can be expressed by thresholding a continuous function. Depending on your problem, there are likely to be much more effective learning methods.

Resources