Google App Engine query optimization - google-app-engine

I have a Google App Engine datastore that could have several million records in it and I'm trying to figure out the best way to do a query where I need get records back that match several Strings.
For example, say I have the following Model:
String name
String level
Int score
I need to return all the records for a given "level" that also match a list of "names". There might be only 1 or 2 names in the name list, but there could be 100.
It's basically a list of high scores ("score") for players ("name") for a given level ("level"). I want to find all the scores for a given "level" for a list of players by "name" to build a high score list that include just your friends.
I could just loop over the list of "names" and do a query for each their high scores for that level, but I don't know if this is the best way. In SQL I could construct a single (complex) query to do this.
Given the size of the datastore, I want to make sure I'm not wasting time running python code that should be done by the query or vise-versa.
The "level" needs to be a String, not an Int since they are not numbered levesl but rather level names, but I don't know if that matters.

You can use IN filter operator to match property against a list of values (user names):
scores = Scores.all().filter('level ==', level).filter('user IN', user_list)
Note that under the hood this performs as much queries as there are users in user_list.

players = Player.all().filter('level =', level).order('score')
names = [name1, name2, name3, ...]
players = [p for p in players if p.name in names]
for player in players:
print name, print score
is this what you want?
...or am i simplifying too much?

No you can not do that in one pass.
You will have to either query the friends for the level one by one
or
make a friends score entity for each level. Each time the score changes check which friends list he belongs to and update all their lists. Then its just a matter or retrieving that list.
the first one will be slow and the second costly unless optimized.

Related

How to get the count of property in a Kind?

I have a Kind Students which stores the details of favorite colors of all students. They are allowed to pick their favorite color from a set of three colors : {Red,Blue,Green}
Let us assume there are 100 students, my code is like this for every student :
Entity arya = new Entity("Student","Arya");
arya.setProperty("Color","Red");
Entity robb = new Entity("Student","Robb");
robb.setProperty("Color","Green");
..
..
Entity jon = new Entity("Student","Jon");
jon.setProperty("Color","Blue");
How to find out how many students liked a particular color(say Red) in this Student Kind ? What Query I should write to fetch the count ?
Thanks in advance
The number you seek would be the number of items in the result of a query with an equality filter on the Color property.
You could use a keys-only query (a special kind of projection query) for this purpose, faster and less expensive:
Keys-only queries
A keys-only query (which is a type of projection query) returns just
the keys of the result entities instead of the entities themselves, at
lower latency and cost than retrieving entire entities.
...
A keys-only query is a small operation and counts as only a single
entity read for the query itself.
Something along these lines (but note that I'm not a java user, the snippet is based only on the documentation examples)
Query<Key> query = Query.newKeyQueryBuilder()
.setKind("Student")
.setFilter(PropertyFilter.eq("Color", "Red")
.build();
I agree with the Dan Cornilescu's answer. Here is a direct Datastore API usage. I have prepared the request body for your use-case. You can run it by just adding your Project Id. This will return the entities that matches with the filter then you can count the number of them.

How to query for multiple vertices and counts of their relationships in Gremlin/Tinkerpop 3?

I am using Gremlin/Tinkerpop 3 to query a graph stored in TitanDB.
The graph contains user vertices with properties, for example, "description", and edges denoting relationships between users.
I want to use Gremlin to obtain 1) users by properties and 2) the number of relationships (in this case of any kind) to some other user (e.g., with id = 123). To realize this, I make use of the match operation in Gremlin 3 like so:
g.V().match('user',__.as('user').has('description',new P(CONTAINS,'developer')),
__.as('user').out().hasId(123).values('name').groupCount('a').cap('a').as('relationships'))
.select()
This query works fine, unless there are multiple user vertices returned, for example, because multiple users have the word "developer" in their description. In this case, the count in relationships is the sum of all relationships between all returned users and the user with id 123, and not, as desired, the individual count for every returned user.
Am I doing something wrong or is this maybe an error?
PS: This question is related to one I posted some time ago about a similar query in Tinkerpop 2, where I had another issue: How to select optional graph structures with Gremlin?
Here's the sample data I used:
graph = TinkerGraph.open()
g = graph.traversal()
v123=graph.addVertex(id,123,"description","developer","name","bob")
v124=graph.addVertex(id,124,"description","developer","name","bill")
v125=graph.addVertex(id,125,"description","developer","name","brandy")
v126=graph.addVertex(id,126,"description","developer","name","beatrice")
v124.addEdge('follows',v125)
v124.addEdge('follows',v123)
v124.addEdge('likes',v126)
v125.addEdge('follows',v123)
v125.addEdge('likes',v123)
v126.addEdge('follows',v123)
v126.addEdge('follows',v124)
My first thought, was: "Do we really need match step"? Secondarily, of course, I wanted to write this in TP3 fashion and not use a lambda/closure. I tried all manner of things in the first iteration and the closest I got was stuff like this from Daniel Kuppitz:
gremlin> g.V().as('user').local(out().hasId(123).values('name')
.groupCount()).as('relationships').select()
==>[relationships:[:]]
==>[relationships:[bob:1]]
==>[relationships:[bob:2]]
==>[relationships:[bob:1]]
so here we used local step to restrict the traversal within local to the current element. This works, but we lost the "user" tag in the select. Why? groupCount is a ReducingBarrierStep and paths are lost after those steps.
Well, let's go back to match. I figured I could try to make the match step traverse using local:
gremlin> g.V().match('user',__.as('user').has('description','developer'),
gremlin> __.as('user').local(out().hasId(123).values('name').groupCount()).as('relationships')).select()
==>[relationships:[:], user:v[123]]
==>[relationships:[bob:1], user:v[124]]
==>[relationships:[bob:2], user:v[125]]
==>[relationships:[bob:1], user:v[126]]
Ok - success - that's what we wanted: no lambdas and local counts. But, it still left me feeling like: "Do we really need match step"? That's when Mr. Kuppitz closed in on the final answer which makes copious use of the by step:
gremlin> g.V().has('description','developer').as("user","relationships").select().by()
.by(out().hasId(123).values("name").groupCount())
==>[user:v[123], relationships:[:]]
==>[user:v[124], relationships:[bob:1]]
==>[user:v[125], relationships:[bob:2]]
==>[user:v[126], relationships:[bob:1]]
As you can see, by can be chained (on some steps). The first by groups by vertex and the second by processes the grouped elements with a "local" groupCount.

About indexes of GAE datastore

I have a following model in the GAE app.
class User
school_name = db.StringProperty(Indexed=True)
country = db.StringProperty(Indexed=True)
city = db.StringProperty(Indexed=True)
sex = db.StringProperty(Indexed=True)
profession = db.StringProperty(Indexed=True)
joined_date = db.DateTimeProperty(Indexed=True)
And I want to filter the users by combinations of these fields. Result of the filter should show a user at first who is joined recently. So which means any query end by order operation, I suppose. like that:
User.all().filter('country =','US').filter('profession =','SE').order('-joined_date')
User.all().filter('school_name =','AAA').filter('profession =','SE').order('-joined_date')
....
User.all().filter('sex =','Female').filter('profession =','HR').order('-joined_date')
All these fields combination would be C(5,1)+C(5,2)+...+C(5,5) = 31.
My question is to implement it, do I need to create indexes for all these cases(31) in the Google AppEngine. Or can you suggest other way to implement it?
Note: C(n,k) is combination formula, see more on http://en.wikipedia.org/wiki/Combination
Thanks in advance!
You have several options:
Create all 31 indexes, as you suggest.
Do the sorting in memory. Without a sort order, all your queries can be executed with the built-in merge-join strategy, and so you won't need any indexes at all.
Restrict queries to those that are more likely, or those that eliminate most of the non-matching results, and perform additional filtering in memory.
Put all your data in a ListProperty for indexing as "key:value" strings, and filter only on that. You will need to create multiple indexes with different occurrence counts on that field (eg, indexing it once, twice, etc), and it will result in the same number of index entries, but fewer custom indexes used.

Datastore fetch on two filters alternative?

I have a datastore entity called Game and two fields in it called playerOne and playerTwo. Either of these fields stores a username.
I need to search on the Game entity and return a MAX of 30 games where the username can be either playerOne OR playerTwo...
So in a relational database you would go:
SELECT * FROM Game WHERE playerOne='username' OR playerTwo='username' LIMIT 30
But in big table you can't filter on more than one field! I can't fetch 10 from one and 10 from the other as the number from each can be variable and in createdDate order.
How would you do this in your datastore?
The quick answer is create a StringListProperty that contains [player_a, player_b] and then simply use the multi-value index made out of that:
games = Game.all().filter("players =", player_find)
You can not do an OR query on the datastore using different fields. If you have to keep your current entity model then you have to do two queries.
1) filtering on playerOne and limiting to 30
2) filtering on playerTwo and limiting to (30 - result size of query one)
Then merge the results in memory to produce the final set of 30.
Now if you also want some ordering by date, then it will get more tricky. However the SQL query you wrote doesn't have any ordering so I omitted it aswell.
However if you can change the entity model then a good way to achive what you want is to have a single field containing a list of both usernames.
Then you can do a simple query in the style of:
SELECT * FROM Game WHERE playerBoth = 'username'

Searching for and matching elements across arrays

I have two tables.
In one table there are two columns, one has the ID and the other the abstracts of a document about 300-500 words long. There are about 500 rows.
The other table has only one column and >18000 rows. Each cell of that column contains a distinct acronym such as NGF, EPO, TPO etc.
I am interested in a script that will scan each abstract of the table 1 and identify one or more of the acronyms present in it, which are also present in table 2.
Finally the program will create a separate table where the first column contains the content of the first column of the table 1 (i.e. ID) and the acronyms found in the document associated with that ID.
Can some one with expertise in Python, Perl or any other scripting language help?
It seems to me that you are trying to join the two tables where the acronym appears in the abstract. ie (pseudo SQL):
SELECT acronym.id, document.id
FROM acronym, document
WHERE acronym.value IN explode(documents.abstract)
Given the desired semantics you can use the most straight forward approach:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
joins = []
for id, abstract in documents:
for word in abstract.split():
try:
index = acronyms.index(word)
joins.append((id, index))
except ValueError:
pass # word not an acronym
This is a straightforward implementation; however, it has n cubed running time as acronyms.index performs a linear search (of our largest array, no less). We can improve the algorithm by first building a hash index of the acronyms:
acronyms = ['ABC', ...]
documents = [(0, "Document zeros discusses the value of ABC in the context of..."), ...]
index = dict((acronym, idx) for idx, acronym in enumberate(acronyms))
joins = []
for id, abstract in documents:
for word in abstract.split():
try
joins.append((id, index[word]))
except KeyError:
pass # word not an acronym
Of course, you might want to consider using an actual database. That way you won't have to implement your joins by hand.
Thanks a lot for the quick response.
I assume the pseudo SQL solution is for MYSQL etc. However it did not work in Microsoft ACCESS.
the second and the third are for Python I assume. Can I feed acronym and document as input files?
babru
It didn't work in Access because tables are accessed differently (e.g. acronym.[id])

Resources