App Engine GQL query a property range

App Engine GQL query a property range - google-app-engine

so i have a User class
class User(db.Model):
points = db.IntegerProperty()
so I created 1000 dummy entities on development server with points ranging from 1 to 1000
query = db.GqlQuery("SELECT * FROM User WHERE points >= 300"
"AND points <= 700"
"LIMIT 20"
"ORDER BY points desc")
I only want 20 results per query ( enough to fill a page). I don't need any pagination of the results.
Everything looks ok, it worked on developement server.
Question:
1. Will it work on a production server with 100,000 - 500,000 user entities? Will i experience great lag? I hope not, cos I heard that App Engine indexes the points column automatically
2. Any other optimization techniques that you can recommend?

I think that it is difficult to say what kind of performance issues that you will have with such a large number of entities. This one particular query will probably be fine, but you should be aware that no datastore query can ever return more than 1000 entities, so if you need to operate on numbers larger than 1000, you will need to do it in batches, and you may want to partition them into separate entity groups.
As far as optimization goes, you may want to consider caching the results of this query and only running it when you know the information has changed or at specific intervals. If the query is for some purpose where exactly correct results are not totally critical -- say, displaying a leader board or a high score list -- you might be choose to update and cache the result once every hour or something like that.
The only other optimization that I can think of is that you can save the cycles associated with parsing that GQL statement by doing it once and saving the resulting object, either in memchache or a global variable.

Your code seems fine to get the top users, but more complex queries, like finding out what's the rank of any specific user will be hard. If you need this kind of functionality too, have a look at google-app-engine-ranklist.
Ranklist is a python library for Google App Engine that implements a
data structure for storing integer
scores and quickly retrieving their
relative ranks.

Related

Comparison between using two Models and using one Model with entities with two ancestors in GAE NDB Python(design for amazon.com like website)

I use GAE NDB Python
Approach 1:
# both models below have similar properties (same number and type)
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
class X2(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
record_list = []
if (q=="a"):
qry = X1.query(X1.p1==w)
record_list = qry.fetch()
elif (q=="b"):
qry = X2.query(X2.p1==w)
record_list = qry.fetch()
Approach 2:
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
if (q=="a"):
k = ndb.Key("type_1", "k1")
elif (q=="b"):
k = ndb.Key("type_2", "k1")
qry = X1.query(ancestor=k, X1.p1==w)
record_list = qry.fetch()
My Questions:
Which approach is better in terms of query performance when I scale up the entities
Would there be significant impact on query performance if I scale up the ancestors (in the same hierarchy level horizontally) to 10,000 or 1,00,000 in approach 2
Is this application the correct use case for ancestor
Context:
This project is for understanding GAE better and the goal is to create an ecommerce website like amazon.com where I need to query based on a lot many(10) filter conditions(like, price range, brand, screen size, and so on). Each filter condition has few ranges(like, there could be five price bands); multiple ranges of a filter condition could be selected simultaneously. Multiple filter conditions could be selected just like on amazon.com left pane.
If I put all the filter conditions in the query in the form of AND, OR connected expression, it would take huge amount of time for scaled data sets even if I use query cursor and fetch by page.
To overcome this, I thought I would store the data in entities with parent as a string. The parent would be a cancatenation of the the different filters options which the product matches. There would be a lot of redundancy as I would store the same data in several entities for all the combinations of filter values which it satisfies. The disadvantage of this approach is that each product data is being stored multiple times in different entities(much more storage); but I was hoping to get a much better query performance(<2 seconds) since now my query string would contain only one or two AND or OR connected elements apart from ancestor. The ancestor would be the concatenation of the filter conditions which the user has selected to search for a product
Please let me know if I am not clear.. This is just an experimental approach that I am trying.. Another approach would have been to cache the results through a cron job periodically..
Any other suggestion to achieve a good query performance for such a website would be highly appreciated..
UPDATE(NEW STRATEGY):
i have decided to go with a model with some boolean properties(flags) for each range of each category(total such property per entity is ~14).. for one category, which had two possible values, I have three models(one having all entities of with either of the two values, and the other two for entites with each value).. so there is duplication(same data could be store twice in two entities)..
also my complete product data model is a separate one.. the above model contains a key to this complete model..
i could not do away with Query class and write my own filtering(i actually did that with good success initially).. the reason is that i need to fetch results page by page(~15 results).. and i need to sort them too.. if i fetch all results and apply my own filtering, with large data set the fetching of all results takes a huge amount of time because of the large size of the results returned..
the initial development server results look good.. query execution time is <3 seconds for ~6000 matched entities.. (though i wished it to be ~1 second).. need to scale up the production datastore to test there..

EDIT after context definition:
Tough subject there. You have plenty of datastore limitations that can get in your way :
Write throughput (1 write/sec per Entity Group)
Query inequality filters limit
Cross entity group transactions at write time (duplicating your product in each
"query filter" specific entity group )
Max entity size (1MB) if you duplicate whole products for every "query filter" entity
I don't have any "ready made" answer, just some humble advice based on common sense.
In my opinion your first solution will get overly complex as you add new filtering criterias, type of products, etc.
The problem with the datastore, and most "NoSQL" solutions, is that they tend to have few analytic/query features out of the box (they are not at the maturity level of RDBMS that have evolved for years), forcing you to compute results "by hand".
For your case, I don't see anything out of the box, and the "datastore query engine" is clearly not enough for such queries.
Keep your data quite simple though, just store your products as entities with properties.
If you have clearly different product categories, you may store them as different entity kinds -> I highly doubt people will run a "brand" query for both "shoes" and "food".
You will have to run a datastore query within the limitations to quickly get a gross result set, and refine it by hand (map reduce job, async task..) ... and then cache the result for as long as you can.
-> your aggressive cache solutions looks far better from a performance, cost and maintainability standpoint.
You won't be able to cache your whole product base, and some queries for rarities will take longer... like I said, I don't see any perfect answers here, just different tradeoffs for performance.
Just my 2 cents :) I'll be curious in what solution you end up adopting.
You typically use ancestors for data that is own by an entity.
For example :
A Book is your root entity, and it "owns" Page entities.
A Page without a Book is meaningless.
Book is the ancestor of Page.
A User is your root entity, and it "owns" BlogPost entities.
A BlogPost without its Writter is quite meaningless.
User is the ancestor of BlogPost.
If your two entities X1 and X2 share the same attributes, I'd say they are the same X entity, with just an additonal "type" attribute to determine if your talking about X Type1 or X type2.

Finding unique products (never seen before by a user) in a datastore sorted by a dynamically changing value (i.e. product rating)

been trying to solve this problem for a week and couldn't come up with any solutions in all my research so I thought I'd ask you all.
I have a "Product" table and a "productSent" table, here's a quick scheme to help explain:
class Product(ndb.Model):
name = ndb.StringProperty();
rating = ndb.IntegerProperty
class productSent(ndb.Model): <--- the key name here is md5(Product Key+UUID)
pId = ndb.KeyProperty(kind=Product)
uuId = ndb.KeyProperty(kind=userData)
action = ndb.StringProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
My goal is to show users the highest rated product that they've never seen before--fast. So to keep track of the products users have seen, I use the productSent table. I created this table instead of using Cursors because every time the rating order changes, there's a possibility that the cursor skips the new higher ranking product. An example: assume the user has seen products 1-24 in the db. Next, 5 users liked product #25, making it the #10 product in the database--I'm worried that the product will never be shown again to the user (and possibly mess things up on a higher scale).
The problem with the way I'm doing it right now is that, once the user has blown past the first 1,000 products, it really starts slowing down the query performance. Because I'm literally pulling 1,000+ results, checking if they've been sent by querying against the productSent table (doing a keyName lookup to speed things up) and going through the loop until 15 new ones have been detected.
One solution I thought of was to add a repeated property (listProperty) to the Product table of all the users who have seen a product. Or if I don't want to have inequality filters I could put a repeated property of all the users who haven't seen a product. That way when I query I can dynamically take those out. But I'm afraid of what happens when I have 1,000+ users:
a) I'll go through the roof on the limit of repeated properties in one entity.
b) The index size will increase size costs
Has anyone dealt with this problem before (I'm sure someone has!) Any tips on the best way to structure it?
update
Okay, so had another idea. In order to minimize the changes that take place when a rating (number of likes) changes, I could have a secondary column that only has 3 possible values: positive, neutral, negative. And sort by that? Ofcourse for items that have a rating of 0 and get a 'like' (making them a positive) would still have a chance of being out of order or skipped by the cursor--but it'd be less likely. What do y'all think?

Sounds like the inverse, productNotSent would work well here. Every time you add a new product, you would add a new productNotSent entity for each user. When the user wants to see the highest rated product they have not seen, you will only have to query over the productNotSent entities that match that user. If you put the rating directly on the productNotSent you could speed the query up even more, since you will only have to query against one Model.
Another idea would be to limit the number of productNotSent entities per user. So each user only has ~100 of these entities at a time. This would mean your query would be constant for each user, regardless of the number of products or users you have. The creation of new productNotSent entities would become more complex, though. You'd have to have a cron job or something that "tops up" a user's collection of productNotSent entities when they use some up. You also may want to double-check that products rated higher than those already within the user's set of productNotSent entities get pushed in there. These are a little more difficult and well require some design trade-offs.
Hope this helps!

I do not know your expected volumes and exact issues (only did a quick perusal of your question), but you may consider using Json TextProperty storage as part of your plan. Create dictionaries/lists and store them in records by json.dump()ing them to a TextProperty. When the client calls, simply send the TextProperties to the client, and figure everything out on the client side once you JSON.parse() them. We have done some very large array/object processing in JS this way, and it is very fast (particularly indexed arrays). When the user clicks on something, send a transaction back to update their record. Set up some pull or push queue processes to handle your overall product listing updates, major customer rec updates, etc.
One downside is higher bandwidth going out of you app, but I think this cost will be minimal given potential processing savings on GAE. If you structure this right, you may be able to use get_by_id() to replace all or most of your planned indices and queries. We have found json.loads() and json.dumps() to be very fast inside the app, but we only use simple dictionary/list structures.This approach will be, though, a big, big quantum measure lower than your planned use of queries. The other potential issue is that very large objects may run into soft memory limits. Be sure that your Json objects are fairly simple+lightweight to avoid this (e.g. do no include product description, sub-objects, etc. in the Json item, just the basics such as product number). HTH, -stevep

Choosing the right model for storing and querying data?

I am working on my first GAE project using java and the datastore. And this is my first try with noSQL database. Like a lot of people i have problems understanding the right model to use. So far I've figured out two models and I need help to choose the right one.
All the data is represented in two classes User.class and Word.class.
User: couple of string with user data (username, email.....)
Word: two strings
Which is better :
Search in 10 000 000 entities for the 100 i need. For instance every entity Word have a string property owner and i query (owner = ‘John’).
In User.class i add property List<Word> and method getWords() that returns the list of words. So i query in 1000 users for the one i need and then call method like getWords() that returns List<Word> with that 100 i need.
Which one uses less resources ? Or am i going the wrong way with this ?

The answer is to use appstats and you can find out:
AppStats
To keep your application fast, you need to know:
Is your application making unnecessay RPC calls? Should it be caching
data instead of making repeated RPC calls to get the same data? Will
your application perform better if multiple requests are executed in
parallel rather than serially?
Run some tests, try it both ways and see what appstats says.
But I'd say that your option 2) is better simply because you don't need to search millions of entities. But who knows for sure? The trouble is that "resources" are a dozen different things in app engine - CPU, datastore reads, datastore writes etc etc etc.

For your User class, set a unique ID for each user (such as a username or email address). For the Word class, set the parent of each Word class as a specific User.
So, if you wanted to look up words from a specific user, you would do an ancestor query for all words belonging to that specific user.
By setting an ID for each user, you can get that user by ID as opposed to doing an additional query.
More info on ancestor queries:
https://developers.google.com/appengine/docs/java/datastore/queries#Ancestor_Queries
More info on IDs:
https://developers.google.com/appengine/docs/java/datastore/entities#Kinds_and_Identifiers

It really depends on the queries you're using. I assume that you want to find all the words given a certain owner.
Most likely, 2 would be cheaper, since you'll need to fetch the user entity instead of running a query.
2 will be a bit more work on your part, since you'll need to manually keep the list synchronized with the instances of Word
Off the top of my head I can think of 2 problems with #2, which may or may not apply to you:
A. If you want to find all the owners given a certain word, you'll need to keep that list of words indexed. This affects your costs. If you mostly find words by owner, and rarely find owners by words, it'll still make sense to do it this way. However, if your search pattern flips around and you're searching for owners by words a lot, this may be the wrong design. As you see, you need to design the models based on the queries you will be using.
B. Entities are limited to 1MB, and there's a limit on the number of indexed properties (5000 I think?). Those two will limit the number of words you can store in your list. Make sure that you won't need more than that limit of words per user. Method 1 allows you unlimted words per user.

Google App Engine sort costs for videogame's highscore tables

I'm considering creating my own GAE app to track players' highscores in my videogames. I've already created a simple app that allows me to send and recover Top 10 highscores (that is, just 10 scores are stored per game), but now I'm considering costs if things grow.
Say a game has thousands or millions of players (hehe, not mine). I've seen how applications like OpenFeint are able to sort your score and tell your exact rank in a highscore table with thousands of entries. You may be #19623, for example.
In order to keep things simple, I would create Top 100 score tables. But what if I truly wanted to store all scores and keep things sorted? Does it make sense to simply sort scores as they are queried from the database? I don't think so...
How are such applications implemented?

On GAE it's easy to return sorted queries as long as you index your fields. If your goal is just to find the top 100 scores, you can do an ordered query by score for 100 entities - you will get them in order.
https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_order
The harder part is assigning the number to the query. For the top 100, you'd basically go through the returned list of 100 entities, and print a number beside each of them.
If you need to find a user at a particular rank, you can use a cursor to make narrow your search to say whoever is at rank #19623.
What you won't be able to do efficiently with this is figure out the rank of a single entity. In order to figure out rankings using the built in index, you'd have to query for all entities, and find where that indivdual entity is in the list.
The laziest way to do the ranking would be something like search for the top 100, if the user is in there, show their ranking, if not, then tell them they are > 100. Another possibility is to occasionaly do large queries to get score ranges, store those, and then give the user a less accurent (you are in the top 500, top 1000 etc), without having the exact place.

Standard database indexing - both on App Engine and elsewhere - doesn't provide an efficient way to find the rank of a row/entity. One option is to go through the database at regular intervals and update the current rank. If you want ranks to be updated immediately, however, a tree-based solution is better. One is provided for App Engine in the app-engine-ranklist project.

We had the same problem with TyprX typing races (GWT + App Engine). The way we did it without going through millions of rows it to store high score like this:
class User {
Integer day, month, year;
Integer highscoreOfTheDay;
Integer highscoreOfMonth;
Integer highscoreOfTheYear;
}
Doing so you can get a sorted list of daily, monthly, yearly high scores with on query. The key is to update the users records with their own best score for each period as they finish their games.
Then we added save the result to memcache and voila.
Daniel

I'd think about using exception processing. How many of the thousands of results each day/hour will be a top 100 score? Keep a min/max top-100 range entity (memcached of course). Each score that comes is goes one direction if it is within the range, else another direction (task queue?) if not. Why not shunt the 99% of non-relevant work to another process, and only have to deal with 100+1 recs in whatever your final setup might be for changing the rankings.

How do I get a count of all entries of a given type stored in Google appengine's datastore?

What I'm looking for essentially is this SQL translated into Google AppEngine (for Java) terms:
select count(*) from Customers
Seems simple enough, but from reading the documentation, it seems like I would have to run a query that matches all Customers, loop though it and count the results, taking paging into account. I do not want to retrieve each and every element, I just want to count them.
Or another way, there was an API to loop over all entries of a given type (can't find the exact API at the moment). This seems to be quite inefficient, not to mention that datastore calls come with a limited quota as well.
Any hints would be appreciated.
Thanks, Mark

As wooble says, bigtable doesn't support row counts as a fundamental concept -- you can write a wrapper function, as mcotton says, but, as he quotes from the docs, that will still be limited to 1000 at most.
To overcome these limits you'll need to keep, for each kind of entity you want to count, a counter that gets incremented everytime a new entity of that kind is put, decremented when an entity of that kind is deleted.
To keep your app highly scalable you'll probably want to shard such counters, see http://code.google.com/appengine/articles/sharding_counters.html (unfortunately I'm not aware of a translation of that recipe to Java, but the concepts should be the same).

As mcotton said, it appears that count() on a "SELECT __ key __" query with no limit may do what you want.
http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count
This is a relatively new feature in Google Datastore though. They used to have a required limit of 1000 on this. They only recently removed that limit. The only limit now is whether your query executes quickly enough to not time out.
There's also the new Google Mapper API you could consider if this is a truly huge amount of data and you do hit timeouts. To read more on that, do a Google search for [appengine mapreduce].
I agree that it is pretty amazing that GQL doesn't support "SELECT COUNT(*)". That seems like a bit of an oversight. But doing a select only on the key and then using count() to not send those keys all the way back to the app should behave similarly.

Unfortunately, it's impossible for BigTable to count entities without running queries to match all of them. Keeping in mind that applications like Google Search and Google Reader won't even give you exact counts for results when you have more than 1000, if you absolutely, positively, think you need to count all of your entities, you could do a series of keys_only queries limited to 1000 entities each and add up the counts for all of them.

This is just speculation, but I think they will implement a count() method in java similar to their python implementation. HERE is the count() method for python.
count(limit)
Returns the number of results this query fetches.
count() is somewhat faster than retrieving all of the data by a constant factor, but the running time still grows with the size of the result set. It's best to only use count() in cases where the count is expected to be small, or specify a limit.
Note: count() returns a maximum of 1000. If the actual number of entities that match the query criteria exceeds the maximum, count() returns a count of 1000.
Arguments:
limit
The maximum number of results to count.

This is a very old thread, but just in case it helps other people looking at it, there are 3 ways to accomplish this:
Accessing the Datastore statistics
Keeping a counter in the datastore
Sharing counters
Each one of these methods is explained in this link.