Datastore and MemCache Entry Read and Pricing - google-app-engine

Datastore charges by number of entities read.
If an entity is read from memcache, does it count as an entity read in datastore pricing and got billed?
If an entity is read multiple times in the same batch, does it count as one read or multiple reads?
For example, the classic Post and Tag problem, I want to look up names of tags for a list of posts,
class Post(ndb.Model):
title = ndb.StringProperty()
tag_ids = ndb.KeyProperty(repeated=True)
class Tag(ndb.Model):
name = ndb.StringProperty()
#ndb.tasklet
def callback(post):
tags = yield Tag.get_multi(tag_id for tag_id in post.tag_ids)
raise ndb.Return(tags)
qry = Post.query()
output = qry.map(callback, limit=20)
post01 has tag01, tag02, and post02 has tag02, tag03. In this case, tag02 was queried twice in the same batch, does tag02 count as 2 read or 1 read?
Is there any profile library to get number of read counted for billing, so I can figure out above questions myself?
Thanks in advance.

You will not get billed for reading entities from the memcache but you have take
care on your own for putting the entities in the cache and reading from the cache.
I think this can become quite challenging when you are using a complex data model.
I am running a small application with just a few hundreds entities and when the first
user reads the data I put all the entities in the cache and all other users will get the
data from the cache.
For my other applications I am using Objectify (I am using Java) which supports putting the entities
in the Memcache (https://github.com/objectify/objectify). Maybe something similar for Python exists.

Related

Google App Engine - JSONProperty vs Separate Model

Say I have a blog app with blog posts and comments. Lets say for the sake of argument that there can be a very large number of comments, big enough that a simple comments = StringProperty(repeated=True) would be insufficient.
Should I store the comments as a JSONProperty (serialized from python list):
class BlogPost(ndb.Model):
title = ndb.StringProperty()
description = ndb.TextProperty()
comments = ndb.JSONProperty()
Or should I create a separate Comment model altogether and store the corresponding blogpost's ID as a property:
class Comment(ndb.Model):
text = ndb.TextProperty()
blog_id = ndb.IntegerProperty()
created = ndb.DateTimeProperty(auto_now_add=True)
And I can query for all the comments of a specific blogpost as follows: query = Comment.query(Comment.blog_id==blog_id).order(-Comment.created)?
Is one approach preferable? Especially if comments could get very large > 1000.
You definitely want a separate model for comments.
One reason is that entities are limited to 1MB in size. If one post gets a huge number of comments, then you are in danger of exceeding the limit and your code would crash.
Another reason is that you want to consider read/write rates for entities and scalability. If you use JSON, then you need update the BlogPost entity every time a comment is made. If a lot of people are writing comments at the same time, then you will need transactions and have contention issues. If you have a separate model for comments, then you can easily scale to a million comments per second!

Comparison between using two Models and using one Model with entities with two ancestors in GAE NDB Python(design for amazon.com like website)

I use GAE NDB Python
Approach 1:
# both models below have similar properties (same number and type)
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
class X2(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
record_list = []
if (q=="a"):
qry = X1.query(X1.p1==w)
record_list = qry.fetch()
elif (q=="b"):
qry = X2.query(X2.p1==w)
record_list = qry.fetch()
Approach 2:
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
if (q=="a"):
k = ndb.Key("type_1", "k1")
elif (q=="b"):
k = ndb.Key("type_2", "k1")
qry = X1.query(ancestor=k, X1.p1==w)
record_list = qry.fetch()
My Questions:
Which approach is better in terms of query performance when I scale up the entities
Would there be significant impact on query performance if I scale up the ancestors (in the same hierarchy level horizontally) to 10,000 or 1,00,000 in approach 2
Is this application the correct use case for ancestor
Context:
This project is for understanding GAE better and the goal is to create an ecommerce website like amazon.com where I need to query based on a lot many(10) filter conditions(like, price range, brand, screen size, and so on). Each filter condition has few ranges(like, there could be five price bands); multiple ranges of a filter condition could be selected simultaneously. Multiple filter conditions could be selected just like on amazon.com left pane.
If I put all the filter conditions in the query in the form of AND, OR connected expression, it would take huge amount of time for scaled data sets even if I use query cursor and fetch by page.
To overcome this, I thought I would store the data in entities with parent as a string. The parent would be a cancatenation of the the different filters options which the product matches. There would be a lot of redundancy as I would store the same data in several entities for all the combinations of filter values which it satisfies. The disadvantage of this approach is that each product data is being stored multiple times in different entities(much more storage); but I was hoping to get a much better query performance(<2 seconds) since now my query string would contain only one or two AND or OR connected elements apart from ancestor. The ancestor would be the concatenation of the filter conditions which the user has selected to search for a product
Please let me know if I am not clear.. This is just an experimental approach that I am trying.. Another approach would have been to cache the results through a cron job periodically..
Any other suggestion to achieve a good query performance for such a website would be highly appreciated..
UPDATE(NEW STRATEGY):
i have decided to go with a model with some boolean properties(flags) for each range of each category(total such property per entity is ~14).. for one category, which had two possible values, I have three models(one having all entities of with either of the two values, and the other two for entites with each value).. so there is duplication(same data could be store twice in two entities)..
also my complete product data model is a separate one.. the above model contains a key to this complete model..
i could not do away with Query class and write my own filtering(i actually did that with good success initially).. the reason is that i need to fetch results page by page(~15 results).. and i need to sort them too.. if i fetch all results and apply my own filtering, with large data set the fetching of all results takes a huge amount of time because of the large size of the results returned..
the initial development server results look good.. query execution time is <3 seconds for ~6000 matched entities.. (though i wished it to be ~1 second).. need to scale up the production datastore to test there..
EDIT after context definition:
Tough subject there. You have plenty of datastore limitations that can get in your way :
Write throughput (1 write/sec per Entity Group)
Query inequality filters limit
Cross entity group transactions at write time (duplicating your product in each
"query filter" specific entity group )
Max entity size (1MB) if you duplicate whole products for every "query filter" entity
I don't have any "ready made" answer, just some humble advice based on common sense.
In my opinion your first solution will get overly complex as you add new filtering criterias, type of products, etc.
The problem with the datastore, and most "NoSQL" solutions, is that they tend to have few analytic/query features out of the box (they are not at the maturity level of RDBMS that have evolved for years), forcing you to compute results "by hand".
For your case, I don't see anything out of the box, and the "datastore query engine" is clearly not enough for such queries.
Keep your data quite simple though, just store your products as entities with properties.
If you have clearly different product categories, you may store them as different entity kinds -> I highly doubt people will run a "brand" query for both "shoes" and "food".
You will have to run a datastore query within the limitations to quickly get a gross result set, and refine it by hand (map reduce job, async task..) ... and then cache the result for as long as you can.
-> your aggressive cache solutions looks far better from a performance, cost and maintainability standpoint.
You won't be able to cache your whole product base, and some queries for rarities will take longer... like I said, I don't see any perfect answers here, just different tradeoffs for performance.
Just my 2 cents :) I'll be curious in what solution you end up adopting.
You typically use ancestors for data that is own by an entity.
For example :
A Book is your root entity, and it "owns" Page entities.
A Page without a Book is meaningless.
Book is the ancestor of Page.
A User is your root entity, and it "owns" BlogPost entities.
A BlogPost without its Writter is quite meaningless.
User is the ancestor of BlogPost.
If your two entities X1 and X2 share the same attributes, I'd say they are the same X entity, with just an additonal "type" attribute to determine if your talking about X Type1 or X type2.

google app engine query opimization

I am trying to do my reads and writes for GAE as efficiently as possible and I was wondering which is the best of the following two options.
I have a website where users are able to post different things and right now whenever I want to show all posts by that user I do a query for all posts with that user's user ID and then I display them. Would it be better to store all of the post IDs in the user entity and do a get_by_id(post_ID_list) to return all of the posts? Or would that extra space being used up not be worth it?
Is there anywhere I can find more information like this to optimize my web app?
Thanks!
The main reason you would want to store the list of IDs would be so that you can get each entity separately for better consistency - entity gets by id are consistent with the latest version in the datastore, while queries are eventually consistent.
Check datastore costs and optimize for cost:
https://developers.google.com/appengine/docs/billing
Getting entities by key wouldn't be any cheaper than querying all the posts. The query makes use of an index.
If you use projection queries, you can reduce your costs quite a bit.
There is several cases.
First, if you keep track for all ids of user's posts. You must use entity group for consistency. Thats means speed of write to datastore would be ~1 entity per second. And cost is 1 read for object with ids and 1 read per entity.
Second, if you just use query. This is not need consistency. Cost is 1 read + 1 read per entity retrieved.
Third, if you quering only keys and after fetching. Cost is 1 read + 1 small per key retrieved. Watch this: Keys-Only Queries. This equals to projection quering for cost.
And if you have many result, and use pagination then you need use Query Cursors. That prevent useless usage of datastore.
The most economical solution is third case. Watch this: Batch Operations.
In case you have a list of id's because they are stored with your entity, a call to ndb.get_multi (in case you are using NDB, but it would be similar with any other framework using the memcache to cache single entities) would save you further datastore calls if all (or most) of the entities correpsonding to the keys are already in the datastore.
So in the best possible case (everything is in the memcache), the datastore wouldn't be touched at all, while using a query would.
See this issue for a discussion and caveats: http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118.

Improve App Engine performance by reducing entity size

The objective is to reduce the CPU cost and response time for a piece of code that runs very often and must db.get() several hundred keys each time.
Does this even work?
Can I expect the API time of a db.get() with several hundred keys
to reduce roughly linearly as I reduce the size of the entity?
Currently the entity has the following data attached: 9 String, 9
Boolean, 8 Integer, 1 GeoPt, 2 DateTime, 1 Text (avg size ~100 bytes
FWIW), 1 Reference, 1 StringList (avg size 500 bytes). The goal is to
move the vast majority of this data to related classes so that the
core fetch of the main model will be quick.
If it does work, how is it implemented?
After a refactor, will I still incur the same
high cost fetching existing entities? The documentation says that all
properties of a model are fetched simultaneously. Will the old
unneeded properties still transfer over RPC on my dime and while users
wait? In other words: if I want to reduce the load time of my entities, is
it necessary to migrate the old entities to ones with the new
definition? If so, is it sufficient to re-put() the entity, or must I
save under a wholly new key?
Example
Consider:
class Thing(db.Model):
text = db.TextProperty()
strings = db.StringListProperty()
num = db.IntegerProperty()
thing = Thing(key_name='thing1', text='x' * 10240,
strings = ['y'*500 for i in range(10)], num=23)
thing.put()
Let's say I re-define Thing to be streamlined and push up a new version:
class Thing(db.Model):
num = db.IntegerProperty()
And I fetch it again:
thing_again = Thing.get_by_key_name('thing1')
Have I reduced the fetch time for this entity?
To answer your questions in order:
Yes, splitting up your model will reduce the fetch time, though probably not linearly. For a relatively small model like yours, the differences may not be huge. Large list properties are the leading cause of increased fetch time.
Old properties will still be transferred when you fetch an entity after the change to the model, because the datastore has no knowledge of models.
Also, however, deleted properties will still be stored even once you call .put(). Currently, there's two ways to eliminate the old properties: Replace all the existing entities with new ones, or use the lower-level api.datastore interface, which is dict-like and makes it easy to delete keys.
To remove properties from an entity, you can change your Model to an Expando, and then use delattr. It's documented in the App Engine docs here:
http://code.google.com/intl/fr/appengine/articles/update_schema.html
Under the heading "Removing Deleted Properties from the Datastore"
if I want to reduce the size of my
entities, is it necessary to migrate
the old entities to ones with the new
definition?
Yes. The GAE data store is just a big key-value store, that doesn't know anything about your model definitions. So the old values will be the old values until you put new values in!

App Engine Simple Game Model Experiment ( Scalable )

So I've read all the RMDB vs BigTable debates
I tried to model a simple game class using BigTable concepts.
Goals : Provide very fast reads and considerably easy writes
Scenario: I have 500,000 user entities in my User model. My user sees a user statistics at the top of his/her game page (think of a status bar like in Mafia Wars), so everywhere he/she goes in the game, the stats get refreshed.
Since it gets called so frequently, why don't I model my User around that fact?
Code:
# simple User class for a game
class User(db.Model):
username = db.StringProperty()
total_attack = db.IntegerProperty()
unit_1_amount = db.IntegerProperty()
unit_1_attack = db.IntegerProperty(default=10)
unit_2_amount = db.IntegerProperty()
unit_2_attack = db.IntegerProperty(default=20)
unit_3_amount = db.IntegerProperty()
unit_3_attack = db.IntegerProperty(default=50)
def calculate_total_attack(self):
self.total_attack = self.unit_1_attack * self.unit_1_amount + \
self.unit_2_attack * self.unit_2_amount + \
self.unit_3_attack * self.unit_3_amount + \
here's how I'm approaching it ( feel free to comment/critique)
Advantages:
1. Everything is in one big table
2. No need to use ReferenceProperty, no MANY-TO-MANY relationships
3. Updates are very easily done : Just get the user entity by keyname
4. It's easy to transfer queried entity to the templating engine.
Disadvantages:
1. If I have 100 different units with different capabilities (attack,defense,dexterity,magic,etc), then i'll have a very HUGE table.
2. If I have to change a value of a certain attack unit, then I'm going to have to go through all 500,000 user entities to change every one of them. ( maybe a cron job/task queue will help)
Each entity will have a size of 5-10 kb ( btw how do I check how large is an entity once I've uploaded them to the production server? ).
So I'm counting on the fact that disk space at App Engine is cheap, and I need to minimize the amount of datastore API calls. And I'll try to memcache the entity for a period of time.
In essence, everything here goes against RMDB
Would love to hear your thoughts/ideas/experiences.
First a simple answer to "how do I know how big an entity is?": Once you've got some data in your app on the app engine servers, you can go to your app's console and click the 'Datastore statistics' link. That will give you some basic stats on your entities, like how much space each Kind is using, what property types are using the most disk space, etc. I don't think you can drill down to the level of one particular User however.
Now here are some thoughts on your design. It is worth it to create a separate table for your Units. Even if you end up with a few hundred units, it will be easy to keep them all in memcache, so looking up the details of each unit will be negligible. It will cost you a few extra API calls to initially populate memcache with a unit's info the first time it is used, but after that you will be saving a good amount of CPU cycles by not having to fetch the details of each unit from the database,and saving huge amounts of API calls when you need to update a unit (which you have already realized will be very expensive) In addition, each User object will use less disk space if it only needs a reference to a Unit entity rather than holding all the details itself. (Of course this depends on the amount of info you need to store about each unit, but you did mention that eventually you will be storing lots of stats for each unit)
If you do have a separate table for Units, it will also allow you to keep your User object more flexible. Instead of needing a specific field for each unit, you could just have a list of refernces to units. That way, if you add a unit type, you would not have to modify your User kind.
You should create independent models for your units. "While a single entity or entity group has a limit on how quickly it can be updated, App Engine excels at handling many parallel requests distributed across distinct entities, and we can take advantage of this by using sharding." Have a look at this article. It may be useful.
based on Peter's thoughts, I came up with the following revised User model. What do you people think?
class Unit(db.Model):
name = db.StringProperty()
attack = db.IntegerProperty()
#initialize 4 different types of units
Unit(key_name="infantry",name="Infantry",attack=10).put()
Unit(key_name="rocketmen",name="Rocketmen",attack=20).put()
Unit(key_name="grenadiers",name="Grenadiers",attack=30).put()
Unit(key_name="engineers",name="Engineers",attack=40).put()
class User(db.Model):
username = db.StringProperty()
# eg: [10,50,100,200] -> this represents 10 infantry, 50 rocketmen, 100 grenadiers and 200 engineers
unit_list_count = db.ListProperty(item_type=int)
# this holds the list of key names of each unit type: ["infantry","rocketmen","grenadiers","engineers"]
unit_list_type = db.StringListProperty()
# total attack is not calculated inside the model. Instead, I will use a
# controller file ( a py file ) to call the contents of unit_list_count and
# unit_list_type of a certain user entity, and make simple multiplications and additions to get total attack
and yes, all the unit_types will be memcached so they can be retrieved for the fast calculation of total attack points.
Would like to hear everyone's thoughts on this.

Resources