So I've read all the RMDB vs BigTable debates
I tried to model a simple game class using BigTable concepts.
Goals : Provide very fast reads and considerably easy writes
Scenario: I have 500,000 user entities in my User model. My user sees a user statistics at the top of his/her game page (think of a status bar like in Mafia Wars), so everywhere he/she goes in the game, the stats get refreshed.
Since it gets called so frequently, why don't I model my User around that fact?
Code:
# simple User class for a game
class User(db.Model):
username = db.StringProperty()
total_attack = db.IntegerProperty()
unit_1_amount = db.IntegerProperty()
unit_1_attack = db.IntegerProperty(default=10)
unit_2_amount = db.IntegerProperty()
unit_2_attack = db.IntegerProperty(default=20)
unit_3_amount = db.IntegerProperty()
unit_3_attack = db.IntegerProperty(default=50)
def calculate_total_attack(self):
self.total_attack = self.unit_1_attack * self.unit_1_amount + \
self.unit_2_attack * self.unit_2_amount + \
self.unit_3_attack * self.unit_3_amount + \
here's how I'm approaching it ( feel free to comment/critique)
Advantages:
1. Everything is in one big table
2. No need to use ReferenceProperty, no MANY-TO-MANY relationships
3. Updates are very easily done : Just get the user entity by keyname
4. It's easy to transfer queried entity to the templating engine.
Disadvantages:
1. If I have 100 different units with different capabilities (attack,defense,dexterity,magic,etc), then i'll have a very HUGE table.
2. If I have to change a value of a certain attack unit, then I'm going to have to go through all 500,000 user entities to change every one of them. ( maybe a cron job/task queue will help)
Each entity will have a size of 5-10 kb ( btw how do I check how large is an entity once I've uploaded them to the production server? ).
So I'm counting on the fact that disk space at App Engine is cheap, and I need to minimize the amount of datastore API calls. And I'll try to memcache the entity for a period of time.
In essence, everything here goes against RMDB
Would love to hear your thoughts/ideas/experiences.
First a simple answer to "how do I know how big an entity is?": Once you've got some data in your app on the app engine servers, you can go to your app's console and click the 'Datastore statistics' link. That will give you some basic stats on your entities, like how much space each Kind is using, what property types are using the most disk space, etc. I don't think you can drill down to the level of one particular User however.
Now here are some thoughts on your design. It is worth it to create a separate table for your Units. Even if you end up with a few hundred units, it will be easy to keep them all in memcache, so looking up the details of each unit will be negligible. It will cost you a few extra API calls to initially populate memcache with a unit's info the first time it is used, but after that you will be saving a good amount of CPU cycles by not having to fetch the details of each unit from the database,and saving huge amounts of API calls when you need to update a unit (which you have already realized will be very expensive) In addition, each User object will use less disk space if it only needs a reference to a Unit entity rather than holding all the details itself. (Of course this depends on the amount of info you need to store about each unit, but you did mention that eventually you will be storing lots of stats for each unit)
If you do have a separate table for Units, it will also allow you to keep your User object more flexible. Instead of needing a specific field for each unit, you could just have a list of refernces to units. That way, if you add a unit type, you would not have to modify your User kind.
You should create independent models for your units. "While a single entity or entity group has a limit on how quickly it can be updated, App Engine excels at handling many parallel requests distributed across distinct entities, and we can take advantage of this by using sharding." Have a look at this article. It may be useful.
based on Peter's thoughts, I came up with the following revised User model. What do you people think?
class Unit(db.Model):
name = db.StringProperty()
attack = db.IntegerProperty()
#initialize 4 different types of units
Unit(key_name="infantry",name="Infantry",attack=10).put()
Unit(key_name="rocketmen",name="Rocketmen",attack=20).put()
Unit(key_name="grenadiers",name="Grenadiers",attack=30).put()
Unit(key_name="engineers",name="Engineers",attack=40).put()
class User(db.Model):
username = db.StringProperty()
# eg: [10,50,100,200] -> this represents 10 infantry, 50 rocketmen, 100 grenadiers and 200 engineers
unit_list_count = db.ListProperty(item_type=int)
# this holds the list of key names of each unit type: ["infantry","rocketmen","grenadiers","engineers"]
unit_list_type = db.StringListProperty()
# total attack is not calculated inside the model. Instead, I will use a
# controller file ( a py file ) to call the contents of unit_list_count and
# unit_list_type of a certain user entity, and make simple multiplications and additions to get total attack
and yes, all the unit_types will be memcached so they can be retrieved for the fast calculation of total attack points.
Would like to hear everyone's thoughts on this.
Related
I use GAE NDB Python
Approach 1:
# both models below have similar properties (same number and type)
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
class X2(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
record_list = []
if (q=="a"):
qry = X1.query(X1.p1==w)
record_list = qry.fetch()
elif (q=="b"):
qry = X2.query(X2.p1==w)
record_list = qry.fetch()
Approach 2:
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
if (q=="a"):
k = ndb.Key("type_1", "k1")
elif (q=="b"):
k = ndb.Key("type_2", "k1")
qry = X1.query(ancestor=k, X1.p1==w)
record_list = qry.fetch()
My Questions:
Which approach is better in terms of query performance when I scale up the entities
Would there be significant impact on query performance if I scale up the ancestors (in the same hierarchy level horizontally) to 10,000 or 1,00,000 in approach 2
Is this application the correct use case for ancestor
Context:
This project is for understanding GAE better and the goal is to create an ecommerce website like amazon.com where I need to query based on a lot many(10) filter conditions(like, price range, brand, screen size, and so on). Each filter condition has few ranges(like, there could be five price bands); multiple ranges of a filter condition could be selected simultaneously. Multiple filter conditions could be selected just like on amazon.com left pane.
If I put all the filter conditions in the query in the form of AND, OR connected expression, it would take huge amount of time for scaled data sets even if I use query cursor and fetch by page.
To overcome this, I thought I would store the data in entities with parent as a string. The parent would be a cancatenation of the the different filters options which the product matches. There would be a lot of redundancy as I would store the same data in several entities for all the combinations of filter values which it satisfies. The disadvantage of this approach is that each product data is being stored multiple times in different entities(much more storage); but I was hoping to get a much better query performance(<2 seconds) since now my query string would contain only one or two AND or OR connected elements apart from ancestor. The ancestor would be the concatenation of the filter conditions which the user has selected to search for a product
Please let me know if I am not clear.. This is just an experimental approach that I am trying.. Another approach would have been to cache the results through a cron job periodically..
Any other suggestion to achieve a good query performance for such a website would be highly appreciated..
UPDATE(NEW STRATEGY):
i have decided to go with a model with some boolean properties(flags) for each range of each category(total such property per entity is ~14).. for one category, which had two possible values, I have three models(one having all entities of with either of the two values, and the other two for entites with each value).. so there is duplication(same data could be store twice in two entities)..
also my complete product data model is a separate one.. the above model contains a key to this complete model..
i could not do away with Query class and write my own filtering(i actually did that with good success initially).. the reason is that i need to fetch results page by page(~15 results).. and i need to sort them too.. if i fetch all results and apply my own filtering, with large data set the fetching of all results takes a huge amount of time because of the large size of the results returned..
the initial development server results look good.. query execution time is <3 seconds for ~6000 matched entities.. (though i wished it to be ~1 second).. need to scale up the production datastore to test there..
EDIT after context definition:
Tough subject there. You have plenty of datastore limitations that can get in your way :
Write throughput (1 write/sec per Entity Group)
Query inequality filters limit
Cross entity group transactions at write time (duplicating your product in each
"query filter" specific entity group )
Max entity size (1MB) if you duplicate whole products for every "query filter" entity
I don't have any "ready made" answer, just some humble advice based on common sense.
In my opinion your first solution will get overly complex as you add new filtering criterias, type of products, etc.
The problem with the datastore, and most "NoSQL" solutions, is that they tend to have few analytic/query features out of the box (they are not at the maturity level of RDBMS that have evolved for years), forcing you to compute results "by hand".
For your case, I don't see anything out of the box, and the "datastore query engine" is clearly not enough for such queries.
Keep your data quite simple though, just store your products as entities with properties.
If you have clearly different product categories, you may store them as different entity kinds -> I highly doubt people will run a "brand" query for both "shoes" and "food".
You will have to run a datastore query within the limitations to quickly get a gross result set, and refine it by hand (map reduce job, async task..) ... and then cache the result for as long as you can.
-> your aggressive cache solutions looks far better from a performance, cost and maintainability standpoint.
You won't be able to cache your whole product base, and some queries for rarities will take longer... like I said, I don't see any perfect answers here, just different tradeoffs for performance.
Just my 2 cents :) I'll be curious in what solution you end up adopting.
You typically use ancestors for data that is own by an entity.
For example :
A Book is your root entity, and it "owns" Page entities.
A Page without a Book is meaningless.
Book is the ancestor of Page.
A User is your root entity, and it "owns" BlogPost entities.
A BlogPost without its Writter is quite meaningless.
User is the ancestor of BlogPost.
If your two entities X1 and X2 share the same attributes, I'd say they are the same X entity, with just an additonal "type" attribute to determine if your talking about X Type1 or X type2.
I'm building an app with users and their activities. Now I'm thinking of the best way of setting up the datastore models. Which one is fastest/preferred, and why?
A
class User(db.Model):
activities = db.ListProperty(db.Key)
...
class Activity(db.Model):
...
activities = db.get(user.activities)
or
B
class User(db.Model):
...
class Activity(db.Model):
owner = db.ReferenceProperty(reference_class=User)
...
activities = Activity.filter('owner =', user)
If a given activity can only have a single owner, definitely use a ReferenceProperty.
It's what ReferencePropertys are designed for
It'll automatically set up back-references for you, which can be handy since it gives you a bi-directional link (unlike the ListProperty which is a uni-directional link)
It enforces that the thing being linked to is the proper type/class
It enforces that only a single user is linked to a given activity
It lets you automatically fetch the linked objects without having to write an explicit query, if you so desire
I'm guessing the difference is going to be marginal and will likely depend more on your application than some concrete difference in read/write times based on your models.
I would say use the first option if you're going to use info from every activity a user has done each time you fetch a user. In other words, if almost everything a user does on your application coincides with a large subset of their activities, then it makes sense to always have the activities available.
Use option B if you don't need the activities all of the time. This will result in a separate request on the data store whenever you need to use the activity, but it will also make the requests smaller. Making an extra request likely adds more overhead than making bigger requests.
All of that being said, I would be surprised if you had a noticeable difference between these two approaches. The area where you're going to get much more noticeable performance improvements is by using memcache.
I don't know about the performance difference, I suspect it'll be similar. When it comes to perf, things are hard to control with the GAE datastore. If all your queries happen to hit the same tablet (bigtable server), that could limit your perf more than the query itself.
The big difference is that A would be cheaper than B. Since you have a list of activities you want, you don't need to write an index for every activity object you write. If activities are written a lot, your savings add up.
Since you have the activity key, you also have the ability to do a highly-consistent get() rather than an eventually consistent filter()
On the flip side, you won't be able to do backwards references, like look up an owner given an activity. Your ListProperty can also cause you to hit your maximum entity size - there will eventually be a hard limit on the number of activities per user. If you went with B, you can have a huge number of activities per user.
Edit: I forgot, you can have backwards reference if you index your ListProperty, but then that way, writing your User object would get expensive, and the limit on the number of indexed properties would limit the size of your list. So even though it's possible, B is still preferable if you need backwards references.
A will be a good deal faster because it is working purely with keys. Looking up objects with just keys goes straight to the data node in BigTable, whereas B requires a lookup on the indices first which is slower (and costs will go up with the number of Activity entities).
If you never need to test for ownership, you can modify A to not index the key list. This is definitely the cheapest and most efficient route. However, as I understand it, if you later need to index them app engine cannot retroactively update indices on the key list. So only disable the index if you're certain you'll never need it.
How about C: setting Activity's parent to user key? So that you can fetch user's activities with a Activity.query(ancestor=user.key).
That way you don't need additional keys/properties + good way to group your entities for HR datastore.
Google app engine has smart feature named back references and I usually iterate them where the traditional SQL's computed column need to be used.
Just imagine that need to accumulate specific force's total hp.
class Force(db.Model):
hp = db.IntegerProperty()
class UnitGroup(db.Model):
force = db.ReferenceProperty(reference_class=Force,collection_name="groups")
hp = db.IntegerProperty()
class Unit(db.Model):
group = db.ReferenceProperty(reference_class=UnitGroup,collection_name="units")
hp = db.IntegerProperty()
When I code like following, it was horribly slow (almost 3s) with 20 forces with single group - single unit. (I guess back-referencing force reload sub entities. Am I right?)
def get_hp(self):
hp = 0
for group in self.groups:
group_hp = 0
for unit in group.units:
group_hp += unit.hp
hp += group_hp
return hp
How can I optimize this code? Please consider that there are more properties should be computed for each force/unit-groups and I don't want to save these collective properties to each entities. :)
It appears that Force, UnitGroup and Unit would fit well into an Entity Group. Is it true that each Unit belongs to one and only one UnitGroup and that each UnitGroup belongs to one and only one Force? If that is true, then you can store these entities as an entity group and use an ancestor query to reduce the number of datastore queries.
Putting entities in a entity group is accomplished by setting its parent property when it is created. Once that is done, you can get all of the Units that belongs to a force with a single ancestor query.
That being said, the right thing to do is to compute the values when you write the entities to the database, not when you read them. This simple fact has been noted over and over again here and elsewhere, and it is the best way to get good performance out of your AppEngine application. It might seem like a lot of work up front, and it is very counter-intuitive for anyone who has a background with traditional SQL databases, but it is what you want to do, plain and simple.
Storage is cheap on App Engine, and CPU and request times less so. The predominant thing you should remember about the datastore is that you should optimize for reads, not writes -- you will end up reading your data far more often than you write it.
If you think you may ever need to know a Force's total hp, you should store it in a field called total_hp, and update it with the new value each time you update/add/remove its UnitGroups and Units. You should probably do this in a transaction, too, which means they'll need to be in the same Entity Group.
so i have a User class
class User(db.Model):
points = db.IntegerProperty()
so I created 1000 dummy entities on development server with points ranging from 1 to 1000
query = db.GqlQuery("SELECT * FROM User WHERE points >= 300"
"AND points <= 700"
"LIMIT 20"
"ORDER BY points desc")
I only want 20 results per query ( enough to fill a page). I don't need any pagination of the results.
Everything looks ok, it worked on developement server.
Question:
1. Will it work on a production server with 100,000 - 500,000 user entities? Will i experience great lag? I hope not, cos I heard that App Engine indexes the points column automatically
2. Any other optimization techniques that you can recommend?
I think that it is difficult to say what kind of performance issues that you will have with such a large number of entities. This one particular query will probably be fine, but you should be aware that no datastore query can ever return more than 1000 entities, so if you need to operate on numbers larger than 1000, you will need to do it in batches, and you may want to partition them into separate entity groups.
As far as optimization goes, you may want to consider caching the results of this query and only running it when you know the information has changed or at specific intervals. If the query is for some purpose where exactly correct results are not totally critical -- say, displaying a leader board or a high score list -- you might be choose to update and cache the result once every hour or something like that.
The only other optimization that I can think of is that you can save the cycles associated with parsing that GQL statement by doing it once and saving the resulting object, either in memchache or a global variable.
Your code seems fine to get the top users, but more complex queries, like finding out what's the rank of any specific user will be hard. If you need this kind of functionality too, have a look at google-app-engine-ranklist.
Ranklist is a python library for Google App Engine that implements a
data structure for storing integer
scores and quickly retrieving their
relative ranks.
The objective is to reduce the CPU cost and response time for a piece of code that runs very often and must db.get() several hundred keys each time.
Does this even work?
Can I expect the API time of a db.get() with several hundred keys
to reduce roughly linearly as I reduce the size of the entity?
Currently the entity has the following data attached: 9 String, 9
Boolean, 8 Integer, 1 GeoPt, 2 DateTime, 1 Text (avg size ~100 bytes
FWIW), 1 Reference, 1 StringList (avg size 500 bytes). The goal is to
move the vast majority of this data to related classes so that the
core fetch of the main model will be quick.
If it does work, how is it implemented?
After a refactor, will I still incur the same
high cost fetching existing entities? The documentation says that all
properties of a model are fetched simultaneously. Will the old
unneeded properties still transfer over RPC on my dime and while users
wait? In other words: if I want to reduce the load time of my entities, is
it necessary to migrate the old entities to ones with the new
definition? If so, is it sufficient to re-put() the entity, or must I
save under a wholly new key?
Example
Consider:
class Thing(db.Model):
text = db.TextProperty()
strings = db.StringListProperty()
num = db.IntegerProperty()
thing = Thing(key_name='thing1', text='x' * 10240,
strings = ['y'*500 for i in range(10)], num=23)
thing.put()
Let's say I re-define Thing to be streamlined and push up a new version:
class Thing(db.Model):
num = db.IntegerProperty()
And I fetch it again:
thing_again = Thing.get_by_key_name('thing1')
Have I reduced the fetch time for this entity?
To answer your questions in order:
Yes, splitting up your model will reduce the fetch time, though probably not linearly. For a relatively small model like yours, the differences may not be huge. Large list properties are the leading cause of increased fetch time.
Old properties will still be transferred when you fetch an entity after the change to the model, because the datastore has no knowledge of models.
Also, however, deleted properties will still be stored even once you call .put(). Currently, there's two ways to eliminate the old properties: Replace all the existing entities with new ones, or use the lower-level api.datastore interface, which is dict-like and makes it easy to delete keys.
To remove properties from an entity, you can change your Model to an Expando, and then use delattr. It's documented in the App Engine docs here:
http://code.google.com/intl/fr/appengine/articles/update_schema.html
Under the heading "Removing Deleted Properties from the Datastore"
if I want to reduce the size of my
entities, is it necessary to migrate
the old entities to ones with the new
definition?
Yes. The GAE data store is just a big key-value store, that doesn't know anything about your model definitions. So the old values will be the old values until you put new values in!