Say I have a blog app with blog posts and comments. Lets say for the sake of argument that there can be a very large number of comments, big enough that a simple comments = StringProperty(repeated=True) would be insufficient.
Should I store the comments as a JSONProperty (serialized from python list):
class BlogPost(ndb.Model):
title = ndb.StringProperty()
description = ndb.TextProperty()
comments = ndb.JSONProperty()
Or should I create a separate Comment model altogether and store the corresponding blogpost's ID as a property:
class Comment(ndb.Model):
text = ndb.TextProperty()
blog_id = ndb.IntegerProperty()
created = ndb.DateTimeProperty(auto_now_add=True)
And I can query for all the comments of a specific blogpost as follows: query = Comment.query(Comment.blog_id==blog_id).order(-Comment.created)?
Is one approach preferable? Especially if comments could get very large > 1000.
You definitely want a separate model for comments.
One reason is that entities are limited to 1MB in size. If one post gets a huge number of comments, then you are in danger of exceeding the limit and your code would crash.
Another reason is that you want to consider read/write rates for entities and scalability. If you use JSON, then you need update the BlogPost entity every time a comment is made. If a lot of people are writing comments at the same time, then you will need transactions and have contention issues. If you have a separate model for comments, then you can easily scale to a million comments per second!
Related
Datastore charges by number of entities read.
If an entity is read from memcache, does it count as an entity read in datastore pricing and got billed?
If an entity is read multiple times in the same batch, does it count as one read or multiple reads?
For example, the classic Post and Tag problem, I want to look up names of tags for a list of posts,
class Post(ndb.Model):
title = ndb.StringProperty()
tag_ids = ndb.KeyProperty(repeated=True)
class Tag(ndb.Model):
name = ndb.StringProperty()
#ndb.tasklet
def callback(post):
tags = yield Tag.get_multi(tag_id for tag_id in post.tag_ids)
raise ndb.Return(tags)
qry = Post.query()
output = qry.map(callback, limit=20)
post01 has tag01, tag02, and post02 has tag02, tag03. In this case, tag02 was queried twice in the same batch, does tag02 count as 2 read or 1 read?
Is there any profile library to get number of read counted for billing, so I can figure out above questions myself?
Thanks in advance.
You will not get billed for reading entities from the memcache but you have take
care on your own for putting the entities in the cache and reading from the cache.
I think this can become quite challenging when you are using a complex data model.
I am running a small application with just a few hundreds entities and when the first
user reads the data I put all the entities in the cache and all other users will get the
data from the cache.
For my other applications I am using Objectify (I am using Java) which supports putting the entities
in the Memcache (https://github.com/objectify/objectify). Maybe something similar for Python exists.
I use GAE NDB Python
Approach 1:
# both models below have similar properties (same number and type)
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
class X2(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
record_list = []
if (q=="a"):
qry = X1.query(X1.p1==w)
record_list = qry.fetch()
elif (q=="b"):
qry = X2.query(X2.p1==w)
record_list = qry.fetch()
Approach 2:
class X1(ndb.Model):
p1 = ndb.StringProperty()
::
def get(self):
q = self.request.get("q")
w = self.request.get("w")
if (q=="a"):
k = ndb.Key("type_1", "k1")
elif (q=="b"):
k = ndb.Key("type_2", "k1")
qry = X1.query(ancestor=k, X1.p1==w)
record_list = qry.fetch()
My Questions:
Which approach is better in terms of query performance when I scale up the entities
Would there be significant impact on query performance if I scale up the ancestors (in the same hierarchy level horizontally) to 10,000 or 1,00,000 in approach 2
Is this application the correct use case for ancestor
Context:
This project is for understanding GAE better and the goal is to create an ecommerce website like amazon.com where I need to query based on a lot many(10) filter conditions(like, price range, brand, screen size, and so on). Each filter condition has few ranges(like, there could be five price bands); multiple ranges of a filter condition could be selected simultaneously. Multiple filter conditions could be selected just like on amazon.com left pane.
If I put all the filter conditions in the query in the form of AND, OR connected expression, it would take huge amount of time for scaled data sets even if I use query cursor and fetch by page.
To overcome this, I thought I would store the data in entities with parent as a string. The parent would be a cancatenation of the the different filters options which the product matches. There would be a lot of redundancy as I would store the same data in several entities for all the combinations of filter values which it satisfies. The disadvantage of this approach is that each product data is being stored multiple times in different entities(much more storage); but I was hoping to get a much better query performance(<2 seconds) since now my query string would contain only one or two AND or OR connected elements apart from ancestor. The ancestor would be the concatenation of the filter conditions which the user has selected to search for a product
Please let me know if I am not clear.. This is just an experimental approach that I am trying.. Another approach would have been to cache the results through a cron job periodically..
Any other suggestion to achieve a good query performance for such a website would be highly appreciated..
UPDATE(NEW STRATEGY):
i have decided to go with a model with some boolean properties(flags) for each range of each category(total such property per entity is ~14).. for one category, which had two possible values, I have three models(one having all entities of with either of the two values, and the other two for entites with each value).. so there is duplication(same data could be store twice in two entities)..
also my complete product data model is a separate one.. the above model contains a key to this complete model..
i could not do away with Query class and write my own filtering(i actually did that with good success initially).. the reason is that i need to fetch results page by page(~15 results).. and i need to sort them too.. if i fetch all results and apply my own filtering, with large data set the fetching of all results takes a huge amount of time because of the large size of the results returned..
the initial development server results look good.. query execution time is <3 seconds for ~6000 matched entities.. (though i wished it to be ~1 second).. need to scale up the production datastore to test there..
EDIT after context definition:
Tough subject there. You have plenty of datastore limitations that can get in your way :
Write throughput (1 write/sec per Entity Group)
Query inequality filters limit
Cross entity group transactions at write time (duplicating your product in each
"query filter" specific entity group )
Max entity size (1MB) if you duplicate whole products for every "query filter" entity
I don't have any "ready made" answer, just some humble advice based on common sense.
In my opinion your first solution will get overly complex as you add new filtering criterias, type of products, etc.
The problem with the datastore, and most "NoSQL" solutions, is that they tend to have few analytic/query features out of the box (they are not at the maturity level of RDBMS that have evolved for years), forcing you to compute results "by hand".
For your case, I don't see anything out of the box, and the "datastore query engine" is clearly not enough for such queries.
Keep your data quite simple though, just store your products as entities with properties.
If you have clearly different product categories, you may store them as different entity kinds -> I highly doubt people will run a "brand" query for both "shoes" and "food".
You will have to run a datastore query within the limitations to quickly get a gross result set, and refine it by hand (map reduce job, async task..) ... and then cache the result for as long as you can.
-> your aggressive cache solutions looks far better from a performance, cost and maintainability standpoint.
You won't be able to cache your whole product base, and some queries for rarities will take longer... like I said, I don't see any perfect answers here, just different tradeoffs for performance.
Just my 2 cents :) I'll be curious in what solution you end up adopting.
You typically use ancestors for data that is own by an entity.
For example :
A Book is your root entity, and it "owns" Page entities.
A Page without a Book is meaningless.
Book is the ancestor of Page.
A User is your root entity, and it "owns" BlogPost entities.
A BlogPost without its Writter is quite meaningless.
User is the ancestor of BlogPost.
If your two entities X1 and X2 share the same attributes, I'd say they are the same X entity, with just an additonal "type" attribute to determine if your talking about X Type1 or X type2.
Sorry if this question is too simple; I'm only entering 9th grade.
I'm trying to learn about NoSQL database design. I want to design a Google Datastore model that minimizes the number of read/writes.
Here is a toy example for a blog post and comments in a one-to-many relationship. Which is more efficient - storing all of the comments in a StructuredProperty or using a KeyProperty in the Comment model?
Again, the objective is to minimize the number of read/writes to the datastore. You may make the following assumptions:
Comments will not be retrieved independently of their respective blog post. (I suspect that this makes the StructuredProperty most preferable.)
Comments will need to be sortable by date, rating, author, etc. (Subproperties in the datastore cannot be indexed, so perhaps this could affect performance?)
Both blog posts and comments may be edited (or even deleted) after they are created.
Using StructuredProperty:
from google.appengine.ext import ndb
class Comment(ndb.Model):
various properties...
class BlogPost(ndb.Model):
comments = ndb.StructuredProperty(Comment, repeated=True)
various other properties...
Using KeyProperty:
from google.appengine.ext import ndb
class BlogPost(ndb.Model):
various properties...
class Comment(ndb.Model):
blogPost = ndb.KeyProperty(kind=BlogPost)
various other properties...
Feel free to bring up any other considerations that relate to efficiently representing a one-to-many relationship with regards to minimizing the number of read/writes to the datastore.
Thanks.
I could be wrong, but from what I understand, a StructuredProperty is just a property within an entity, but with sub-properties.
This means reading a BlogPost and all its comments would only cost one read. So when you render your page, you only need one read op for your entire page.
Writes would be cheaper each too. You'll need one read op to get the BlogPost, and as long as you don't update any indexed properties, it'll just be one write op.
You can handle the comment sorting on your own after you read the entity out of the datastore.
You'll have to synchronize your comment updates/edits with transactions, to make sure one comment doesn't overwrite another, since they are both modifying the same entity. You may run into unsolveable problems if everyone is commenting and editing the same blog post at the same time.
In optimizing for cost though, you'll hit a wall with the maximum entity size of 1MB. This will limit the number of comments you can store per blog post.
Going with the KeyProperty would be quite a bit more expensive.
You'll need one read to get the blog post, plus 1 query plus 1 small read op for each comment.
Every comment is a new entity, so it'll be at least 4 write ops. You may want to index for sort order, so that'll end up costing even more write ops.
On the plus side, you'll have unlimited comments per blog post, you don't have to worry about synchronizing new comments. You might need to worry about synchronization for editing comments, but if you limit the edit to the creator, that shouldn't really be a problem. You don't have to do sorting yourself either.
It's a cost vs features tradeoff.
What about:
from google.appengine.ext import ndb
class Comment(ndb.Model):
various properties...
class BlogPost(ndb.Model):
comments = ndb.KeyProperty(Comment, repeated=True)
various other properties...
This way, you can store up to 5000 comments per blog post (the maximum number of repeated properties) independent of the size of each blog post. You won't need a query to fetch the blogs for a comment, you can just do ndb.get_multi(blog_post.comments). And for this operation, you can try to rely on ndb's memcache. Of course, it depends on your use case whether this is a good assumption or not.
Be aware of this caveat when using a repeated StructuredProperty:
Do not use repeated properties if you have more than 100-1000 values. (1000 is probably already pushing it.) They weren't designed for such use.
See Guido's answer in GAE ndb design, performance and use of repeated properties.
So while you may not hit the 1 MB entity limit with StructuredProperty, you may easily hit the 100-1000 suggested max.
Let's say I want to display a list of books and their authors. In traditional database design, I would issue a single query to retrieve rows from the Book table as well as the related Author table, a step known as eager fetching. This is done to avoid the dreaded N+1 select problem: If the Author records were retrieved lazily, my program would have to issue a separate query for each author, possibly as many queries as there are books in the list.
Does Google App Engine Datastore provide a similar mechanism, or is the N+1 select problem something that is no longer relevant on this platform?
I think you are implicitly asking if Google App Engine supports JOIN to avoid the N+1 select problem.
Google App Engine does not support JOIN directly but lets you define a one to many relationship using ReferenceProperty.
class Author(db.Model):
name = db.StringProperty()
class Book(db.Model):
title = db.StringProperty()
author= db.ReferenceProperty(Author)
In you specific scenario, with two query calls, the first one to get the author:
author = Author.all.filter('name =' , 'fooauthor').get()
and the second one to find all the books of a given author:
books = Book.all().filter('author=', author).fetch(...)
you can get the same result of a common SQL Query that uses JOIN.
The N+1 problem could for example appear when we want to get 100 books, each with its author name:
books = Book.all().fetch(100)
for book in books:
print book.author.name
In this case, we need to execute 1+100 queries, one to get the books list and 100 to dereference all the authors objects to get the author's name (this step is implicitly done on book.author.name statement).
One common technique to workaround this problem is by using get_value_for_datastore method that retrieves the referenced author's key of a given book without dereferencing it (ie, a datastore fetch):
author_key = Book.author.get_value_for_datastore(book)
There's a brilliant blog post on this topic that you might want to read.
This method, starting from the author_key list, prefetches the authors objects from datastore setting each one to the proper entity book.
Using this approach saves a lot of calls to datastore and practically * avoids the N+1 problem.
* theoretically, on a bookshelf with 100 books written by 100 different authors, we still have to call the datastore 100+1 times
Answering your question:
Google App Engine does not support
eager fetching
There are techniques (not out of the box) that
helps to avoid the dreaded N+1
problem
So I've read all the RMDB vs BigTable debates
I tried to model a simple game class using BigTable concepts.
Goals : Provide very fast reads and considerably easy writes
Scenario: I have 500,000 user entities in my User model. My user sees a user statistics at the top of his/her game page (think of a status bar like in Mafia Wars), so everywhere he/she goes in the game, the stats get refreshed.
Since it gets called so frequently, why don't I model my User around that fact?
Code:
# simple User class for a game
class User(db.Model):
username = db.StringProperty()
total_attack = db.IntegerProperty()
unit_1_amount = db.IntegerProperty()
unit_1_attack = db.IntegerProperty(default=10)
unit_2_amount = db.IntegerProperty()
unit_2_attack = db.IntegerProperty(default=20)
unit_3_amount = db.IntegerProperty()
unit_3_attack = db.IntegerProperty(default=50)
def calculate_total_attack(self):
self.total_attack = self.unit_1_attack * self.unit_1_amount + \
self.unit_2_attack * self.unit_2_amount + \
self.unit_3_attack * self.unit_3_amount + \
here's how I'm approaching it ( feel free to comment/critique)
Advantages:
1. Everything is in one big table
2. No need to use ReferenceProperty, no MANY-TO-MANY relationships
3. Updates are very easily done : Just get the user entity by keyname
4. It's easy to transfer queried entity to the templating engine.
Disadvantages:
1. If I have 100 different units with different capabilities (attack,defense,dexterity,magic,etc), then i'll have a very HUGE table.
2. If I have to change a value of a certain attack unit, then I'm going to have to go through all 500,000 user entities to change every one of them. ( maybe a cron job/task queue will help)
Each entity will have a size of 5-10 kb ( btw how do I check how large is an entity once I've uploaded them to the production server? ).
So I'm counting on the fact that disk space at App Engine is cheap, and I need to minimize the amount of datastore API calls. And I'll try to memcache the entity for a period of time.
In essence, everything here goes against RMDB
Would love to hear your thoughts/ideas/experiences.
First a simple answer to "how do I know how big an entity is?": Once you've got some data in your app on the app engine servers, you can go to your app's console and click the 'Datastore statistics' link. That will give you some basic stats on your entities, like how much space each Kind is using, what property types are using the most disk space, etc. I don't think you can drill down to the level of one particular User however.
Now here are some thoughts on your design. It is worth it to create a separate table for your Units. Even if you end up with a few hundred units, it will be easy to keep them all in memcache, so looking up the details of each unit will be negligible. It will cost you a few extra API calls to initially populate memcache with a unit's info the first time it is used, but after that you will be saving a good amount of CPU cycles by not having to fetch the details of each unit from the database,and saving huge amounts of API calls when you need to update a unit (which you have already realized will be very expensive) In addition, each User object will use less disk space if it only needs a reference to a Unit entity rather than holding all the details itself. (Of course this depends on the amount of info you need to store about each unit, but you did mention that eventually you will be storing lots of stats for each unit)
If you do have a separate table for Units, it will also allow you to keep your User object more flexible. Instead of needing a specific field for each unit, you could just have a list of refernces to units. That way, if you add a unit type, you would not have to modify your User kind.
You should create independent models for your units. "While a single entity or entity group has a limit on how quickly it can be updated, App Engine excels at handling many parallel requests distributed across distinct entities, and we can take advantage of this by using sharding." Have a look at this article. It may be useful.
based on Peter's thoughts, I came up with the following revised User model. What do you people think?
class Unit(db.Model):
name = db.StringProperty()
attack = db.IntegerProperty()
#initialize 4 different types of units
Unit(key_name="infantry",name="Infantry",attack=10).put()
Unit(key_name="rocketmen",name="Rocketmen",attack=20).put()
Unit(key_name="grenadiers",name="Grenadiers",attack=30).put()
Unit(key_name="engineers",name="Engineers",attack=40).put()
class User(db.Model):
username = db.StringProperty()
# eg: [10,50,100,200] -> this represents 10 infantry, 50 rocketmen, 100 grenadiers and 200 engineers
unit_list_count = db.ListProperty(item_type=int)
# this holds the list of key names of each unit type: ["infantry","rocketmen","grenadiers","engineers"]
unit_list_type = db.StringListProperty()
# total attack is not calculated inside the model. Instead, I will use a
# controller file ( a py file ) to call the contents of unit_list_count and
# unit_list_type of a certain user entity, and make simple multiplications and additions to get total attack
and yes, all the unit_types will be memcached so they can be retrieved for the fast calculation of total attack points.
Would like to hear everyone's thoughts on this.