Improve App Engine performance by reducing entity size - google-app-engine

The objective is to reduce the CPU cost and response time for a piece of code that runs very often and must db.get() several hundred keys each time.
Does this even work?
Can I expect the API time of a db.get() with several hundred keys
to reduce roughly linearly as I reduce the size of the entity?
Currently the entity has the following data attached: 9 String, 9
Boolean, 8 Integer, 1 GeoPt, 2 DateTime, 1 Text (avg size ~100 bytes
FWIW), 1 Reference, 1 StringList (avg size 500 bytes). The goal is to
move the vast majority of this data to related classes so that the
core fetch of the main model will be quick.
If it does work, how is it implemented?
After a refactor, will I still incur the same
high cost fetching existing entities? The documentation says that all
properties of a model are fetched simultaneously. Will the old
unneeded properties still transfer over RPC on my dime and while users
wait? In other words: if I want to reduce the load time of my entities, is
it necessary to migrate the old entities to ones with the new
definition? If so, is it sufficient to re-put() the entity, or must I
save under a wholly new key?
Example
Consider:
class Thing(db.Model):
text = db.TextProperty()
strings = db.StringListProperty()
num = db.IntegerProperty()
thing = Thing(key_name='thing1', text='x' * 10240,
strings = ['y'*500 for i in range(10)], num=23)
thing.put()
Let's say I re-define Thing to be streamlined and push up a new version:
class Thing(db.Model):
num = db.IntegerProperty()
And I fetch it again:
thing_again = Thing.get_by_key_name('thing1')
Have I reduced the fetch time for this entity?

To answer your questions in order:
Yes, splitting up your model will reduce the fetch time, though probably not linearly. For a relatively small model like yours, the differences may not be huge. Large list properties are the leading cause of increased fetch time.
Old properties will still be transferred when you fetch an entity after the change to the model, because the datastore has no knowledge of models.
Also, however, deleted properties will still be stored even once you call .put(). Currently, there's two ways to eliminate the old properties: Replace all the existing entities with new ones, or use the lower-level api.datastore interface, which is dict-like and makes it easy to delete keys.

To remove properties from an entity, you can change your Model to an Expando, and then use delattr. It's documented in the App Engine docs here:
http://code.google.com/intl/fr/appengine/articles/update_schema.html
Under the heading "Removing Deleted Properties from the Datastore"

if I want to reduce the size of my
entities, is it necessary to migrate
the old entities to ones with the new
definition?
Yes. The GAE data store is just a big key-value store, that doesn't know anything about your model definitions. So the old values will be the old values until you put new values in!

Related

on google app engine are how are StructuredProperties updated?

I am considering ways of organizing data for my application.
One data model I am considering would entail having entities where each entity could contain up to roughly 100 repeated StructuredProperties. The StructuredProperties would be mostly read and updated only very infrequently. My question is - if I update any of those StructuredProperties, will the entire entity get deleted from Memcache and will the entire entity be reread from the ndb? Or is it just the single StructuredProperty that will get reread? Is this any different with LocalStructuredProperty?
More generally, how are StructuredProperties organized internally? In situations where I could use multiple Float or Int properties - and I am using a StructuredProperty instead just to make my model more readable - is this a bad idea? If I am reading an entity with 100 StructuredProperties will I have to make 100 rpc calls or are the properties retrieved in bulk as part of the original entity?
StructuredPropertys belong to the entity that contains them - so your assumption that
updating a single StructuredProperty will invalidate the memcache is correct.
LocalStructuredProperty is the same behavior - the advantage however is that each
property on a LocalStructuredProperty is obfuscated into a binary storage - the datastore
has no idea about the structure of a LocalStructuredProperty. (There is probably a deserialization
computational cost attributed to these properties - but that depends a lot on the amount
of data they contain, I imagine.)
To contrast, StructuredProperty actually makes its child properties available for
Query indexing in most cases - allowing you to perform complicated lookups.
Keep in mind - you should be calling put() for the containing entity, not for each
StructuredProperty or LocalStructuredProperty - so you should be seeing a single RPC
call for updating that parent entity - regardless of the number of repeated properties exist.
I would advise using StructuredProperty that contain ndb.IntegerProperty(repeated=True), rather
than making 'parallel lists' of integers and floats - that adds more complexity to your python
model, and is exactly the behavior that ndb.StructuredProperty strives to replace.

GAE ndb best practice to store large one to many relations

I'm searching for the best practice to store a large amount of Comment Entities which have a one to many relationship to another entity.
I read a lot about the limitations about the datastore and don't know how to solve this.
I can't store them as structured properties due to the 1MB Entity Limitation.
Also Guido van Rossum answered the question about repeated properties with "if you have more than 100-1000 values" do not use repeated properties.
So repeated properties are no solution for my comments, too.
Final Question: What is the best practice to solve this problem? Are ancestors an opportunity?
Edit: In this question about ancestor or reference properties Nick Johnson mentioned that "Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down if you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern."
What exactly does " writing multiple entities concurrently mean" ? When different user comment at the same time to that entity?
Depends on the amount you read / write per bill.
You can store references for more than 1000 (until an amount depending by the key size and how you reference them) as json compressed unindexed properties. But take care then with referencing and dereferecing that amount. Plus your overhead and data amount that you will transfer on each request will be big. You don't want though to be doing ops on 1000000 compressed entity keys on the server for just a simple request. If you take this way trying to optimize this approach do it on the client as smart as you can.
Go for ancestors and/or optimize your logic not to be consistent (eg it doesn't matter if a comment is not shown immediately) and use iterators or pointer or seeks (whatever it's called)

List of keys or separate model?

I'm building an app with users and their activities. Now I'm thinking of the best way of setting up the datastore models. Which one is fastest/preferred, and why?
A
class User(db.Model):
activities = db.ListProperty(db.Key)
...
class Activity(db.Model):
...
activities = db.get(user.activities)
or
B
class User(db.Model):
...
class Activity(db.Model):
owner = db.ReferenceProperty(reference_class=User)
...
activities = Activity.filter('owner =', user)
If a given activity can only have a single owner, definitely use a ReferenceProperty.
It's what ReferencePropertys are designed for
It'll automatically set up back-references for you, which can be handy since it gives you a bi-directional link (unlike the ListProperty which is a uni-directional link)
It enforces that the thing being linked to is the proper type/class
It enforces that only a single user is linked to a given activity
It lets you automatically fetch the linked objects without having to write an explicit query, if you so desire
I'm guessing the difference is going to be marginal and will likely depend more on your application than some concrete difference in read/write times based on your models.
I would say use the first option if you're going to use info from every activity a user has done each time you fetch a user. In other words, if almost everything a user does on your application coincides with a large subset of their activities, then it makes sense to always have the activities available.
Use option B if you don't need the activities all of the time. This will result in a separate request on the data store whenever you need to use the activity, but it will also make the requests smaller. Making an extra request likely adds more overhead than making bigger requests.
All of that being said, I would be surprised if you had a noticeable difference between these two approaches. The area where you're going to get much more noticeable performance improvements is by using memcache.
I don't know about the performance difference, I suspect it'll be similar. When it comes to perf, things are hard to control with the GAE datastore. If all your queries happen to hit the same tablet (bigtable server), that could limit your perf more than the query itself.
The big difference is that A would be cheaper than B. Since you have a list of activities you want, you don't need to write an index for every activity object you write. If activities are written a lot, your savings add up.
Since you have the activity key, you also have the ability to do a highly-consistent get() rather than an eventually consistent filter()
On the flip side, you won't be able to do backwards references, like look up an owner given an activity. Your ListProperty can also cause you to hit your maximum entity size - there will eventually be a hard limit on the number of activities per user. If you went with B, you can have a huge number of activities per user.
Edit: I forgot, you can have backwards reference if you index your ListProperty, but then that way, writing your User object would get expensive, and the limit on the number of indexed properties would limit the size of your list. So even though it's possible, B is still preferable if you need backwards references.
A will be a good deal faster because it is working purely with keys. Looking up objects with just keys goes straight to the data node in BigTable, whereas B requires a lookup on the indices first which is slower (and costs will go up with the number of Activity entities).
If you never need to test for ownership, you can modify A to not index the key list. This is definitely the cheapest and most efficient route. However, as I understand it, if you later need to index them app engine cannot retroactively update indices on the key list. So only disable the index if you're certain you'll never need it.
How about C: setting Activity's parent to user key? So that you can fetch user's activities with a Activity.query(ancestor=user.key).
That way you don't need additional keys/properties + good way to group your entities for HR datastore.

app engine's back referencing is too slow. How can I make it faster?

Google app engine has smart feature named back references and I usually iterate them where the traditional SQL's computed column need to be used.
Just imagine that need to accumulate specific force's total hp.
class Force(db.Model):
hp = db.IntegerProperty()
class UnitGroup(db.Model):
force = db.ReferenceProperty(reference_class=Force,collection_name="groups")
hp = db.IntegerProperty()
class Unit(db.Model):
group = db.ReferenceProperty(reference_class=UnitGroup,collection_name="units")
hp = db.IntegerProperty()
When I code like following, it was horribly slow (almost 3s) with 20 forces with single group - single unit. (I guess back-referencing force reload sub entities. Am I right?)
def get_hp(self):
hp = 0
for group in self.groups:
group_hp = 0
for unit in group.units:
group_hp += unit.hp
hp += group_hp
return hp
How can I optimize this code? Please consider that there are more properties should be computed for each force/unit-groups and I don't want to save these collective properties to each entities. :)
It appears that Force, UnitGroup and Unit would fit well into an Entity Group. Is it true that each Unit belongs to one and only one UnitGroup and that each UnitGroup belongs to one and only one Force? If that is true, then you can store these entities as an entity group and use an ancestor query to reduce the number of datastore queries.
Putting entities in a entity group is accomplished by setting its parent property when it is created. Once that is done, you can get all of the Units that belongs to a force with a single ancestor query.
That being said, the right thing to do is to compute the values when you write the entities to the database, not when you read them. This simple fact has been noted over and over again here and elsewhere, and it is the best way to get good performance out of your AppEngine application. It might seem like a lot of work up front, and it is very counter-intuitive for anyone who has a background with traditional SQL databases, but it is what you want to do, plain and simple.
Storage is cheap on App Engine, and CPU and request times less so. The predominant thing you should remember about the datastore is that you should optimize for reads, not writes -- you will end up reading your data far more often than you write it.
If you think you may ever need to know a Force's total hp, you should store it in a field called total_hp, and update it with the new value each time you update/add/remove its UnitGroups and Units. You should probably do this in a transaction, too, which means they'll need to be in the same Entity Group.

App Engine Simple Game Model Experiment ( Scalable )

So I've read all the RMDB vs BigTable debates
I tried to model a simple game class using BigTable concepts.
Goals : Provide very fast reads and considerably easy writes
Scenario: I have 500,000 user entities in my User model. My user sees a user statistics at the top of his/her game page (think of a status bar like in Mafia Wars), so everywhere he/she goes in the game, the stats get refreshed.
Since it gets called so frequently, why don't I model my User around that fact?
Code:
# simple User class for a game
class User(db.Model):
username = db.StringProperty()
total_attack = db.IntegerProperty()
unit_1_amount = db.IntegerProperty()
unit_1_attack = db.IntegerProperty(default=10)
unit_2_amount = db.IntegerProperty()
unit_2_attack = db.IntegerProperty(default=20)
unit_3_amount = db.IntegerProperty()
unit_3_attack = db.IntegerProperty(default=50)
def calculate_total_attack(self):
self.total_attack = self.unit_1_attack * self.unit_1_amount + \
self.unit_2_attack * self.unit_2_amount + \
self.unit_3_attack * self.unit_3_amount + \
here's how I'm approaching it ( feel free to comment/critique)
Advantages:
1. Everything is in one big table
2. No need to use ReferenceProperty, no MANY-TO-MANY relationships
3. Updates are very easily done : Just get the user entity by keyname
4. It's easy to transfer queried entity to the templating engine.
Disadvantages:
1. If I have 100 different units with different capabilities (attack,defense,dexterity,magic,etc), then i'll have a very HUGE table.
2. If I have to change a value of a certain attack unit, then I'm going to have to go through all 500,000 user entities to change every one of them. ( maybe a cron job/task queue will help)
Each entity will have a size of 5-10 kb ( btw how do I check how large is an entity once I've uploaded them to the production server? ).
So I'm counting on the fact that disk space at App Engine is cheap, and I need to minimize the amount of datastore API calls. And I'll try to memcache the entity for a period of time.
In essence, everything here goes against RMDB
Would love to hear your thoughts/ideas/experiences.
First a simple answer to "how do I know how big an entity is?": Once you've got some data in your app on the app engine servers, you can go to your app's console and click the 'Datastore statistics' link. That will give you some basic stats on your entities, like how much space each Kind is using, what property types are using the most disk space, etc. I don't think you can drill down to the level of one particular User however.
Now here are some thoughts on your design. It is worth it to create a separate table for your Units. Even if you end up with a few hundred units, it will be easy to keep them all in memcache, so looking up the details of each unit will be negligible. It will cost you a few extra API calls to initially populate memcache with a unit's info the first time it is used, but after that you will be saving a good amount of CPU cycles by not having to fetch the details of each unit from the database,and saving huge amounts of API calls when you need to update a unit (which you have already realized will be very expensive) In addition, each User object will use less disk space if it only needs a reference to a Unit entity rather than holding all the details itself. (Of course this depends on the amount of info you need to store about each unit, but you did mention that eventually you will be storing lots of stats for each unit)
If you do have a separate table for Units, it will also allow you to keep your User object more flexible. Instead of needing a specific field for each unit, you could just have a list of refernces to units. That way, if you add a unit type, you would not have to modify your User kind.
You should create independent models for your units. "While a single entity or entity group has a limit on how quickly it can be updated, App Engine excels at handling many parallel requests distributed across distinct entities, and we can take advantage of this by using sharding." Have a look at this article. It may be useful.
based on Peter's thoughts, I came up with the following revised User model. What do you people think?
class Unit(db.Model):
name = db.StringProperty()
attack = db.IntegerProperty()
#initialize 4 different types of units
Unit(key_name="infantry",name="Infantry",attack=10).put()
Unit(key_name="rocketmen",name="Rocketmen",attack=20).put()
Unit(key_name="grenadiers",name="Grenadiers",attack=30).put()
Unit(key_name="engineers",name="Engineers",attack=40).put()
class User(db.Model):
username = db.StringProperty()
# eg: [10,50,100,200] -> this represents 10 infantry, 50 rocketmen, 100 grenadiers and 200 engineers
unit_list_count = db.ListProperty(item_type=int)
# this holds the list of key names of each unit type: ["infantry","rocketmen","grenadiers","engineers"]
unit_list_type = db.StringListProperty()
# total attack is not calculated inside the model. Instead, I will use a
# controller file ( a py file ) to call the contents of unit_list_count and
# unit_list_type of a certain user entity, and make simple multiplications and additions to get total attack
and yes, all the unit_types will be memcached so they can be retrieved for the fast calculation of total attack points.
Would like to hear everyone's thoughts on this.

Resources