Sorry if this question is too simple; I'm only entering 9th grade.
I'm trying to learn about NoSQL database design. I want to design a Google Datastore model that minimizes the number of read/writes.
Here is a toy example for a blog post and comments in a one-to-many relationship. Which is more efficient - storing all of the comments in a StructuredProperty or using a KeyProperty in the Comment model?
Again, the objective is to minimize the number of read/writes to the datastore. You may make the following assumptions:
Comments will not be retrieved independently of their respective blog post. (I suspect that this makes the StructuredProperty most preferable.)
Comments will need to be sortable by date, rating, author, etc. (Subproperties in the datastore cannot be indexed, so perhaps this could affect performance?)
Both blog posts and comments may be edited (or even deleted) after they are created.
Using StructuredProperty:
from google.appengine.ext import ndb
class Comment(ndb.Model):
various properties...
class BlogPost(ndb.Model):
comments = ndb.StructuredProperty(Comment, repeated=True)
various other properties...
Using KeyProperty:
from google.appengine.ext import ndb
class BlogPost(ndb.Model):
various properties...
class Comment(ndb.Model):
blogPost = ndb.KeyProperty(kind=BlogPost)
various other properties...
Feel free to bring up any other considerations that relate to efficiently representing a one-to-many relationship with regards to minimizing the number of read/writes to the datastore.
Thanks.
I could be wrong, but from what I understand, a StructuredProperty is just a property within an entity, but with sub-properties.
This means reading a BlogPost and all its comments would only cost one read. So when you render your page, you only need one read op for your entire page.
Writes would be cheaper each too. You'll need one read op to get the BlogPost, and as long as you don't update any indexed properties, it'll just be one write op.
You can handle the comment sorting on your own after you read the entity out of the datastore.
You'll have to synchronize your comment updates/edits with transactions, to make sure one comment doesn't overwrite another, since they are both modifying the same entity. You may run into unsolveable problems if everyone is commenting and editing the same blog post at the same time.
In optimizing for cost though, you'll hit a wall with the maximum entity size of 1MB. This will limit the number of comments you can store per blog post.
Going with the KeyProperty would be quite a bit more expensive.
You'll need one read to get the blog post, plus 1 query plus 1 small read op for each comment.
Every comment is a new entity, so it'll be at least 4 write ops. You may want to index for sort order, so that'll end up costing even more write ops.
On the plus side, you'll have unlimited comments per blog post, you don't have to worry about synchronizing new comments. You might need to worry about synchronization for editing comments, but if you limit the edit to the creator, that shouldn't really be a problem. You don't have to do sorting yourself either.
It's a cost vs features tradeoff.
What about:
from google.appengine.ext import ndb
class Comment(ndb.Model):
various properties...
class BlogPost(ndb.Model):
comments = ndb.KeyProperty(Comment, repeated=True)
various other properties...
This way, you can store up to 5000 comments per blog post (the maximum number of repeated properties) independent of the size of each blog post. You won't need a query to fetch the blogs for a comment, you can just do ndb.get_multi(blog_post.comments). And for this operation, you can try to rely on ndb's memcache. Of course, it depends on your use case whether this is a good assumption or not.
Be aware of this caveat when using a repeated StructuredProperty:
Do not use repeated properties if you have more than 100-1000 values. (1000 is probably already pushing it.) They weren't designed for such use.
See Guido's answer in GAE ndb design, performance and use of repeated properties.
So while you may not hit the 1 MB entity limit with StructuredProperty, you may easily hit the 100-1000 suggested max.
Related
I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?
Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.
I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.
Say I have a blog app with blog posts and comments. Lets say for the sake of argument that there can be a very large number of comments, big enough that a simple comments = StringProperty(repeated=True) would be insufficient.
Should I store the comments as a JSONProperty (serialized from python list):
class BlogPost(ndb.Model):
title = ndb.StringProperty()
description = ndb.TextProperty()
comments = ndb.JSONProperty()
Or should I create a separate Comment model altogether and store the corresponding blogpost's ID as a property:
class Comment(ndb.Model):
text = ndb.TextProperty()
blog_id = ndb.IntegerProperty()
created = ndb.DateTimeProperty(auto_now_add=True)
And I can query for all the comments of a specific blogpost as follows: query = Comment.query(Comment.blog_id==blog_id).order(-Comment.created)?
Is one approach preferable? Especially if comments could get very large > 1000.
You definitely want a separate model for comments.
One reason is that entities are limited to 1MB in size. If one post gets a huge number of comments, then you are in danger of exceeding the limit and your code would crash.
Another reason is that you want to consider read/write rates for entities and scalability. If you use JSON, then you need update the BlogPost entity every time a comment is made. If a lot of people are writing comments at the same time, then you will need transactions and have contention issues. If you have a separate model for comments, then you can easily scale to a million comments per second!
I implementing a Comments section for my current application. The Comments section can be thought of as a series of user posts on a given page. I am wondering which design would be most effective in a non-relational database (Google App Engine).
Design 1:
Group the comments by a groupId and filter on those results
Comment Entity >> [id, groupId, otherData...]
Queries for all comments pertaining to a page would look like:
Select from Comments filter by groupId
Design 2:
Store a single key for all comments within a group and use a Self Expanding List if the number of entries exceeds 5000 entries.
Comment Entity >> [id, SELid]
Queries would simply perform an id/key lookup.
I understand that Indexes can be expensive, but the first design proposal will only index the groupId field and will only require a single write to post a comment (well more writes if you include the index).
The second design will avoid costly indexing but each posted comment will require a read and a write operation. Furthermore, I"m worried about contention issues. These comments should not be experiencing extremely high throughput, but the second design seems to create a bottleneck.
As I am new to non-relational DB's, I would appreciate any input on these proposed designs and their associated tradeoffs.
In case of App Engine and Datastore, the approach you will take depends mainly on the consistency model (strong vs eventual) you require for your entities. In Google Cloud Datastore, there is a concept of an entity group. The entity group (an entity and its descendants) is a unit with strong consistency, transactionality, and locality but also imposes some restrictions (1 write per second).
Considerations
Do you require strong consistent results?
How often will be comments posted per page?
How many comments per page do you expect?
Do you have a use case requiring transactional behaviour?
Since neither of your design options uses entity group (page -> posts), I suppose you decided not to go this way.
Design 1
Eventual consistent lookup by groupId
Easier to maintain (you do not have to deal with 5000 entities limit)
Design 2
Strong consistent lookup by entityGroupId
Harder to maintain (you HAVE to deal with 5000 entities limit)
As mentioned, one entity representing all post for a page can be a bottleneck (can be reduced by means of Memcache)
I would probably go with the first approach even though it can resemble relational data model.
I'm new to GAE, would appreciate your advice on GAE-app data storage approaches.
Simple example:- there are Author and Document entities - each Author may be a creator of several Documents So we have two options:1) Add all Documents as children to corresponding Author entities (owned relationship)
2) Add a field to each Document which will identify the Author (unowned link or something)
What are pros and cons of every approach?
P.S. I know about groups and strong consistency. What else? Buy the way, eventual consistency, what is it in reality - minutes, hours, ...?
Thanks
The general guideline with most NoSQL stores is to structure your data so that it is optimal for your primary use case and denormalise as you need to to satisfy other needs.
If your most common operation is read all documents for an author, then putting documents under an author makes sense. If its fetch by document, then referencing author may be more practical.
How the datastore is priced (in terms of cost of reads vs writes) will help guide you - cheapest usually is also the most effective design. For example, if documents are write heavy and have many indexes, option 1 could be expensive when you want to update a single document.
W.R.T eventual consistency, it usually wont be longer than seconds worst case, however there are no guarantees. You should not rely on it being good enough in a situation where it must be accurate (for example an author editing a document then previewing it before publishing). Remember that a get by id is strongly consistent read, so generally you can mitigate this as needed.
Searching for answers I've run through number of acticles and also encountered this and this posts which are helpful.
So I formed my opinion and hope it will help someone:
Entity groups advantages:
+ Intrinsic strong consistency (see also about transactions)
+ Ancestor calls may serve similar to "namespaces in miniature". This may be used to separate data still with possibility to share it.
Entity groups disadvantages due to limits on writes per second (see here in the end):
- may hurt scalability
- may slow concurrent access
- shouldn't be large anyway since access to groups is serialized
So the use of entity groups IMHO is limited to:
- cases where strong consistency is demanded. Still to avoid contention groups should be kept as small as possible
- single user data storage
In all other cases I will avoid them.
I'm searching for the best practice to store a large amount of Comment Entities which have a one to many relationship to another entity.
I read a lot about the limitations about the datastore and don't know how to solve this.
I can't store them as structured properties due to the 1MB Entity Limitation.
Also Guido van Rossum answered the question about repeated properties with "if you have more than 100-1000 values" do not use repeated properties.
So repeated properties are no solution for my comments, too.
Final Question: What is the best practice to solve this problem? Are ancestors an opportunity?
Edit: In this question about ancestor or reference properties Nick Johnson mentioned that "Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down if you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern."
What exactly does " writing multiple entities concurrently mean" ? When different user comment at the same time to that entity?
Depends on the amount you read / write per bill.
You can store references for more than 1000 (until an amount depending by the key size and how you reference them) as json compressed unindexed properties. But take care then with referencing and dereferecing that amount. Plus your overhead and data amount that you will transfer on each request will be big. You don't want though to be doing ops on 1000000 compressed entity keys on the server for just a simple request. If you take this way trying to optimize this approach do it on the client as smart as you can.
Go for ancestors and/or optimize your logic not to be consistent (eg it doesn't matter if a comment is not shown immediately) and use iterators or pointer or seeks (whatever it's called)