Ancestor relation in datastore - google-app-engine

I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?

Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.

I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.

Related

GAE ndb best practice to store large one to many relations

I'm searching for the best practice to store a large amount of Comment Entities which have a one to many relationship to another entity.
I read a lot about the limitations about the datastore and don't know how to solve this.
I can't store them as structured properties due to the 1MB Entity Limitation.
Also Guido van Rossum answered the question about repeated properties with "if you have more than 100-1000 values" do not use repeated properties.
So repeated properties are no solution for my comments, too.
Final Question: What is the best practice to solve this problem? Are ancestors an opportunity?
Edit: In this question about ancestor or reference properties Nick Johnson mentioned that "Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down if you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern."
What exactly does " writing multiple entities concurrently mean" ? When different user comment at the same time to that entity?
Depends on the amount you read / write per bill.
You can store references for more than 1000 (until an amount depending by the key size and how you reference them) as json compressed unindexed properties. But take care then with referencing and dereferecing that amount. Plus your overhead and data amount that you will transfer on each request will be big. You don't want though to be doing ops on 1000000 compressed entity keys on the server for just a simple request. If you take this way trying to optimize this approach do it on the client as smart as you can.
Go for ancestors and/or optimize your logic not to be consistent (eg it doesn't matter if a comment is not shown immediately) and use iterators or pointer or seeks (whatever it's called)

GAE Datastore: Normalization?

Normalization not in a general relational database sense, in this context.
I have received reports from a User. The data in these reports was generated roughly at the same time, making the timestamp the same for all reports gathered in one request.
I'm still pretty new to datastore, and I know you can query on properties, you have to grab the ancestors' entity's key to traverse down... so I'm wondering which one is better performance and "write/read/etc" wise.
Should I do:
Option 1:
User (Entity, ancestor of ReportBundle): general user information properties
ReportBundle (Entity, ancestor of Report): timestamp
Report (Entity): general data properties
Option 2:
User (Entity, ancestor of Report): insert general user information properties
Report (Entity): timestamp property AND general data properties
Do option 2:
Because, you save time for reading and writing an additional Entity.
You also save database operations (which in the end will save money).
As I see from your options, you need to check the timestamp property anyhow so putting it inside the report object would be fine,
also your code is less complex and better maintainable.
As mentioned from Chris and in comments, using datastore means thinking denormalized.
It's better to store the data twice then doing complex queries, goal for your data design should be to get the entities by ID.
Doing so will also save on the amount of indexes you may need. This is important to know.
The reason why the amount of indexes is limited, is because of denormalization.
For each index you create, datastore creates a new table in behind, which holds the data in the right order based on your index. So when you use indexes your data is already stored more then one time. The good thing about this behavior is that writes are faster, because you can write to all the index tables in parallel. Also reads, because you read data already in right order based on your index.
Knowing this, and if only these 2 options are available, option 2 would be the better one.
We have lots of very denormalized models because of the inability to do JOINs.
You should think about how you are going to process the data, if you might expect request timeouts.

"Connected" vs "disconnected" data model at GAE: how should I proceed in this case?

It's better to start with an example to illustrate this case. Let's say we have an User class and it should have an list of Post.
The first thought is to create this list inside the User class, but analyzing the use cases we find out that most of the times we want to retrieve the user without its posts and retrieve the posts without the user. However we need the user ID to retrieve posts. So the other way to create the data model is to not have the associations but create Post indexed by User ID.
In terms of cost, what are the pros and cons of both implementations?
See the billing page, in particular the section on the datastore operations:
https://developers.google.com/appengine/docs/billing
Datastore read costs grow per entity.
Datastore write costs grow per indexed property.
The first method will be much cheaper since it only operates on one User entity, and there's no indexing required.
However, cost probably isn't your sole deciding factor. Entities are limited to 1MB each, so if you're storing your posts within your User entity, you'll likely run into a wall. Time to read/write entities also depend on size, so large entities will take longer to read/write.
My previous answer was assuming you were actually storing a list of Post objects within your User entity. It sounds like you're asking if the User and Post are both entities, and the User stores a list of keys to the Posts.
The main benefit to the first case (User with a List of keys to Post entities) is that it enables you to fetch Posts in a consistent manner. After getting the User object, you can read the list of POSTS and fetch them individually. Datastore get-by-key operations are consistent. Depending on how you issue the get operations, this may be slower than a query.(ie, if you just use a for loop).
There's a possible very minor other benefit is as long as you don't index your Post List in your User, you can update your User relatively inexpensively this way. As an extreme example, if your User adds 5 Posts at once, you can add them all to the list, and then write the User once with one write operation. This isn't really all that great, since you probably have to write your Post entity anyways, But it's one less index write op per entity.
There is still the limit on the size of the User entity, so your List will have a maximum limit. There's also a maximum on the number of index entries per entity, so if you index the List, that could be a limit (but that would make the User entity more expensive to write too).
From a read perspective, the first case is non optimal.
The second case works better from a read perspective, it makes it easier to get Posts if you have the User id, but you have the index write ops when you write your Post. If you don't write Posts often, this is better. Note that queries are eevntually consistent.

google app engine query opimization

I am trying to do my reads and writes for GAE as efficiently as possible and I was wondering which is the best of the following two options.
I have a website where users are able to post different things and right now whenever I want to show all posts by that user I do a query for all posts with that user's user ID and then I display them. Would it be better to store all of the post IDs in the user entity and do a get_by_id(post_ID_list) to return all of the posts? Or would that extra space being used up not be worth it?
Is there anywhere I can find more information like this to optimize my web app?
Thanks!
The main reason you would want to store the list of IDs would be so that you can get each entity separately for better consistency - entity gets by id are consistent with the latest version in the datastore, while queries are eventually consistent.
Check datastore costs and optimize for cost:
https://developers.google.com/appengine/docs/billing
Getting entities by key wouldn't be any cheaper than querying all the posts. The query makes use of an index.
If you use projection queries, you can reduce your costs quite a bit.
There is several cases.
First, if you keep track for all ids of user's posts. You must use entity group for consistency. Thats means speed of write to datastore would be ~1 entity per second. And cost is 1 read for object with ids and 1 read per entity.
Second, if you just use query. This is not need consistency. Cost is 1 read + 1 read per entity retrieved.
Third, if you quering only keys and after fetching. Cost is 1 read + 1 small per key retrieved. Watch this: Keys-Only Queries. This equals to projection quering for cost.
And if you have many result, and use pagination then you need use Query Cursors. That prevent useless usage of datastore.
The most economical solution is third case. Watch this: Batch Operations.
In case you have a list of id's because they are stored with your entity, a call to ndb.get_multi (in case you are using NDB, but it would be similar with any other framework using the memcache to cache single entities) would save you further datastore calls if all (or most) of the entities correpsonding to the keys are already in the datastore.
So in the best possible case (everything is in the memcache), the datastore wouldn't be touched at all, while using a query would.
See this issue for a discussion and caveats: http://code.google.com/p/appengine-ndb-experiment/issues/detail?id=118.

Clarification: can I put all of a user's data in a single entity group by making up an ancestor key?

I want to do several operations on a user's data in a single transaction, but won't need to update multiple users' data in a single transaction. I see from http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html#Entity_Groups_Ancestors_and_Paths that "A good rule of thumb for entity groups is that [entity groups] should be about the size of a single user's worth of data or smaller," so I think the correct choice is to use a single parent key when building the keys for the other entities related to a user.
Does this seem like a good idea?
Is it easy to code? Something like KeyBuilder.setParent(theKeyOfMyUserEntity)?
1) It is hard to comment without some addition details about the data. There are several things you should be aware of with entity groups; the biggest is that the group will be stored together. That means if you are trying to do many (separate) updates you could face contention, limiting your app's performance.
2) yes it is easy to code. The syntax is pretty close to what you posted.
There are other options for transactions. Check out Nick Johnson's article on distributed transactions. If you are wanting transactions for aggregates you should also check out Brett Slatkin's IO talk on high-throughput data pipelines.
Yes, it seems reasonable to store some user data as child entities of a User entity.
Why do you need to manually create keys ? The db.Model() constructor already has a convenient "parent" argument which will automatically put both the parent entity and the child entity in the same entity group.

Resources