GAE ndb best practice to store large one to many relations - google-app-engine

I'm searching for the best practice to store a large amount of Comment Entities which have a one to many relationship to another entity.
I read a lot about the limitations about the datastore and don't know how to solve this.
I can't store them as structured properties due to the 1MB Entity Limitation.
Also Guido van Rossum answered the question about repeated properties with "if you have more than 100-1000 values" do not use repeated properties.
So repeated properties are no solution for my comments, too.
Final Question: What is the best practice to solve this problem? Are ancestors an opportunity?
Edit: In this question about ancestor or reference properties Nick Johnson mentioned that "Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down if you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern."
What exactly does " writing multiple entities concurrently mean" ? When different user comment at the same time to that entity?

Depends on the amount you read / write per bill.
You can store references for more than 1000 (until an amount depending by the key size and how you reference them) as json compressed unindexed properties. But take care then with referencing and dereferecing that amount. Plus your overhead and data amount that you will transfer on each request will be big. You don't want though to be doing ops on 1000000 compressed entity keys on the server for just a simple request. If you take this way trying to optimize this approach do it on the client as smart as you can.
Go for ancestors and/or optimize your logic not to be consistent (eg it doesn't matter if a comment is not shown immediately) and use iterators or pointer or seeks (whatever it's called)

Related

Ancestor relation in datastore

I have three entities: user, post and comment. A user may have multiple posts and a post may have multiple comments.
I know I can add ancestor relations like this:
user(Grand Parent) post(parent) comment(child)
I'm little bit confused about ancestors. I read from documention and searches that ancestors are used for transactions, every ancestors are in same entity group and entity groups are stored in same datastore node which makes it less scaleable. Is this right?
Is creating user as parent of posts and post as parent of comments a good thing?
Rather than this we can add one extra property in the post entity like user_id as shown in example and filter by it.
Which is better/more scalable: filter posts by ancestors or add an extra property user_id in the post Entity and filter by it?
I know both approaches can get the same results but I want to know which one is better in performance and scalability?
Sorry, I'm new in datastore.
Update 11/4/2017
A large number of users is using this App. It's is quite possible there are more
than one posts per sec. But A single user can not create posts more than one per sec. But multiple user may be. As described in documentations maximum entity group write rate of 1/s. Is it still possible to use Ancestor ?
Same for comments. Multiple user can add comment in a same entity group. It's is
quite possible more than one comment in one sec.
Ancestor Queries are faster ?
I read in many places that ancestors queries are much faster than others.
As I know the reason why they are fast is that because it create entity group and store related data in same node. So, it require less time to get data from single node as compare to multiple nodes.
For Example: If post is store in Asia node and comment is store in Europe node and I want to get posts and comments then datastore API need to fetch two nodes to complete request. Which make it slow. Rather than if I create ancestor relation and make entity group which create a better performance.
But what if I don't need to get post and comment data at same time. If I need post in separate web page and comment in separate page.In this scenario datastore api need to fetch only one node at a time.It is not matter data save in single node or save in multiple node. What about query performance can ancestor make it fast in this case ?
Yes, you are correct: all ancestry-related entities are in the same entity group, which raises 2 scalability issues: data contention and maximum entity group write rate of 1/s. See somehow related Is there an Entity Group Max Size?
There are advantages of using ancestries and some may be willing to sacrifice scalability for them (see What would be the purpose of putting all datastore entities in a single group?), but IMHO not for your kind of app: I think you'll agree that it's not really critical to see every new user/post/comment in random searches immediately after it is created (i.e. strong consistency) - the fact that it eventually appears is IMHO good enough.
Simply having no ancestry at all and adding additional model properties (entity keys or even just entity key IDs for entities which never have ancestors) to allow cross-referencing entities is the more scalable approach and IMHO fits well with your app.
I think the question to ask is: Are you expecting:
User to create Posts more than once per seconds (I doubt :)
People to comment on a Post more than once per second (could happen)
It not, then having ancestors queries will be faster than normal queries. So it depends of your usecase. I'd go for query speed unless you know you will have thousands of comments on posts.

App Engine Data Modeling for Comments

I implementing a Comments section for my current application. The Comments section can be thought of as a series of user posts on a given page. I am wondering which design would be most effective in a non-relational database (Google App Engine).
Design 1:
Group the comments by a groupId and filter on those results
Comment Entity >> [id, groupId, otherData...]
Queries for all comments pertaining to a page would look like:
Select from Comments filter by groupId
Design 2:
Store a single key for all comments within a group and use a Self Expanding List if the number of entries exceeds 5000 entries.
Comment Entity >> [id, SELid]
Queries would simply perform an id/key lookup.
I understand that Indexes can be expensive, but the first design proposal will only index the groupId field and will only require a single write to post a comment (well more writes if you include the index).
The second design will avoid costly indexing but each posted comment will require a read and a write operation. Furthermore, I"m worried about contention issues. These comments should not be experiencing extremely high throughput, but the second design seems to create a bottleneck.
As I am new to non-relational DB's, I would appreciate any input on these proposed designs and their associated tradeoffs.
In case of App Engine and Datastore, the approach you will take depends mainly on the consistency model (strong vs eventual) you require for your entities. In Google Cloud Datastore, there is a concept of an entity group. The entity group (an entity and its descendants) is a unit with strong consistency, transactionality, and locality but also imposes some restrictions (1 write per second).
Considerations
Do you require strong consistent results?
How often will be comments posted per page?
How many comments per page do you expect?
Do you have a use case requiring transactional behaviour?
Since neither of your design options uses entity group (page -> posts), I suppose you decided not to go this way.
Design 1
Eventual consistent lookup by groupId
Easier to maintain (you do not have to deal with 5000 entities limit)
Design 2
Strong consistent lookup by entityGroupId
Harder to maintain (you HAVE to deal with 5000 entities limit)
As mentioned, one entity representing all post for a page can be a bottleneck (can be reduced by means of Memcache)
I would probably go with the first approach even though it can resemble relational data model.

Google AppEngine: approaches to store data in DataStore

I'm new to GAE, would appreciate your advice on GAE-app data storage approaches.
Simple example:- there are Author and Document entities - each Author may be a creator of several Documents So we have two options:1) Add all Documents as children to corresponding Author entities (owned relationship)
2) Add a field to each Document which will identify the Author (unowned link or something)
What are pros and cons of every approach?
P.S. I know about groups and strong consistency. What else? Buy the way, eventual consistency, what is it in reality - minutes, hours, ...?
Thanks
The general guideline with most NoSQL stores is to structure your data so that it is optimal for your primary use case and denormalise as you need to to satisfy other needs.
If your most common operation is read all documents for an author, then putting documents under an author makes sense. If its fetch by document, then referencing author may be more practical.
How the datastore is priced (in terms of cost of reads vs writes) will help guide you - cheapest usually is also the most effective design. For example, if documents are write heavy and have many indexes, option 1 could be expensive when you want to update a single document.
W.R.T eventual consistency, it usually wont be longer than seconds worst case, however there are no guarantees. You should not rely on it being good enough in a situation where it must be accurate (for example an author editing a document then previewing it before publishing). Remember that a get by id is strongly consistent read, so generally you can mitigate this as needed.
Searching for answers I've run through number of acticles and also encountered this and this posts which are helpful.
So I formed my opinion and hope it will help someone:
Entity groups advantages:
+ Intrinsic strong consistency (see also about transactions)
+ Ancestor calls may serve similar to "namespaces in miniature". This may be used to separate data still with possibility to share it.
Entity groups disadvantages due to limits on writes per second (see here in the end):
- may hurt scalability
- may slow concurrent access
- shouldn't be large anyway since access to groups is serialized
So the use of entity groups IMHO is limited to:
- cases where strong consistency is demanded. Still to avoid contention groups should be kept as small as possible
- single user data storage
In all other cases I will avoid them.

Database storage design of large amounts of heterogeneous data

Here is something I've wondered for quite some time, and have not seen a real (good) solution for yet. It's a problem I imagine many games having, and that I can't easily think of how to solve (well). Ideas are welcome, but since this is not a concrete problem, don't bother asking for more details - just make them up! (and explain what you made up).
Ok, so, many games have the concept of (inventory) items, and often, there are hundreds of different kinds of items, all with often very varying data structures - some items are very simple ("a rock"), others can have insane complexity or data behind them ("a book", "a programmed computer chip", "a container with more items"), etc.
Now, programming something like that is easy - just have everything implement an interface, or maybe extend an abstract root item. Since objects in the programming world don't have to look the same on the inside as on the outside, there is really no issue with how much and what kind of private fields any type of item has.
But when it comes to database serialization (binary serialization is of course no problem), you are facing a dilemma: how would you represent that in, say, a typical SQL database ?
Some attempts at a solution that I have seen, none of which I find satisfying:
Binary serialization of the items, the database just holds an ID and a blob.
Pro's: takes like 10 seconds to implement.
Con's: Basically sacrifices every database feature, hard to maintain, near impossible to refactor.
A table per item type.
Pro's: Clean, flexible.
Con's: With a wide variety come hundreds of tables, and every search for an item has to query them all since SQL doesn't have the concept of table/type 'reference'.
One table with a lot of fields that aren't used by every item.
Pro's: takes like 10 seconds to implement, still searchable.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
A few tables with a few 'base profiles' for storage where similar items get thrown together and use the same fields for different data.
Pro's: I've got nothing.
Con's: Waste of space, performance, confusing from the database to tell what fields are in use.
What ideas do you have? Have you seen another design that works better or worse?
It depends if you need to sort, filter, count, or analyze those attribute.
If you use EAV, then you will screw yourself nicely. Try doing reports on an EAV schema.
The best option is to use Table Inheritance:
PRODUCT
id pk
type
att1
PRODUCT_X
id pk fk PRODUCT
att2
att3
PRODUCT_Y
id pk fk PRODUCT
att4
att 5
For attributes that you don't need to search/sort/analyze, then use a blob or xml
I have two alternatives for you:
One table for the base type and supplemental tables for each “class” of specialized types.
In this schema, properties common to all “objects” are stored in one table, so you have a unique record for every object in the game. For special types like books, containers, usable items, etc, you have another table for each unique set of properties or relationships those items need. Every special type will therefore be represented by two records: the base object record and the supplemental record in a particular special type table.
PROS: You can use column-based features of your database like custom domains, checks, and xml processing; you can have simpler triggers on certain types; your queries differ exactly at the point of diverging concerns.
CONS: You need two inserts for many objects.
Use a “kind” enum field and a JSONB-like field for the special type data.
This is kind of like your #1 or #3, except with some database help. Postgres added JSONB, giving you an improvement over the old EAV pattern. Other databases have a similar complex field type. In this strategy you roll your own mini schema that you stash in the JSONB field. The kind field declares what you expect to find in that JSONB field.
PROS: You can extract special type data in your queries; can add check constraints and have a simple schema to deal with; you can benefit from indexing even though your data is heterogenous; your queries and inserts are simple.
CONS: Your data types within JSONB-like fields are pretty limited and you have to roll your own validation.
Yes, it is a pain to design database formats like this. I'm designing a notification system and reached the same problem. My notification system is however less complex than yours - the data it holds is at most ids and usernames. My current solution is a mix of 1 and 3 - I serialize data that is different from every notification, and use a column for the 2 usernames (some may have 2 or 1). I shy away from method 2 because I hate that design, but it's probably just me.
However, if you can afford it, I would suggest thinking outside the realm of RDBMS - it sounds like Non-RDBMS (especially key/value storage ones) may be a better fit to store these data, especially if item 1 and item 2 differ from each item a lot.
I'm sure this has been asked here a million times before, but in addition to the options which you have discussed in your question, you can look at EAV schema which is very flexible, but which has its own sets of cons.
Another alternative is database systems which are not relational. There are object databases as well as various key/value stores and document databases.
Typically all these things break down to some extent when you need to query against the flexible attributes. This is kind of an intrinsic problem, however. Conceptually, what does it really mean to query things accurately which are unstructured?
First of all, do you actually need the concurrency, scalability and ACID transactions of a real database? Unless you are building a MMO, your game structures will likely fit in memory anyway, so you can search and otherwise manipulate them there directly. In a scenario like this, the "database" is just a store for serialized objects, and you can replace it with the file system.
If you conclude that you do (need a database), then the key is in figuring out what "atomicity" means from the perspective of the data management.
For example, if a game item has a bunch of attributes, but none of these attributes are manipulated individually at the database level (even though they could well be at the application level), then it can be considered as "atomic" from the data management perspective. OTOH, if the item needs to be searched on some of these attributes, then you'll need a good way to index them in the database, which typically means they'll have to be separate fields.
Once you have identified attributes that should be "visible" versus the attributes that should be "invisible" from the database perspective, serialize the latter to BLOBs (or whatever), then forget about them and concentrate on structuring the former.
That's where the fun starts and you'll probably need to use "all of the above" strategy for reasonable results.
BTW, some databases support "deep" indexes that can go into heterogeneous data structures. For example, take a look at Oracle's XMLIndex, though I doubt you'll use Oracle for a game.
You seem to be trying to solve this for a gaming context, so maybe you could consider a component-based approach.
I have to say that I personally haven't tried this yet, but I've been looking into it for a while and it seems to me something similar could be applied.
The idea would be that all the entities in your game would basically be a bag of components. These components can be Position, Energy or for your inventory case, Collectable, for example. Then, for this Collectable component you can add custom fields such as category, numItems, etc.
When you're going to render the inventory, you can simply query your entity system for items that have the Collectable component.
How can you save this into a DB? You can define the components independently in their own table and then for the entities (each in their own table as well) you would add a "Components" column which would hold an array of IDs referencing these components. These IDs would effectively be like foreign keys, though I'm aware that this is not exactly how you can model things in relational databases, but you get the idea.
Then, when you load the entities and their components at runtime, based on the component being loaded you can set the corresponding flag in their bag of components so that you know which components this entity has, and they'll then become queryable.
Here's an interesting read about component-based entity systems.

Clarification: can I put all of a user's data in a single entity group by making up an ancestor key?

I want to do several operations on a user's data in a single transaction, but won't need to update multiple users' data in a single transaction. I see from http://code.google.com/appengine/docs/python/datastore/keysandentitygroups.html#Entity_Groups_Ancestors_and_Paths that "A good rule of thumb for entity groups is that [entity groups] should be about the size of a single user's worth of data or smaller," so I think the correct choice is to use a single parent key when building the keys for the other entities related to a user.
Does this seem like a good idea?
Is it easy to code? Something like KeyBuilder.setParent(theKeyOfMyUserEntity)?
1) It is hard to comment without some addition details about the data. There are several things you should be aware of with entity groups; the biggest is that the group will be stored together. That means if you are trying to do many (separate) updates you could face contention, limiting your app's performance.
2) yes it is easy to code. The syntax is pretty close to what you posted.
There are other options for transactions. Check out Nick Johnson's article on distributed transactions. If you are wanting transactions for aggregates you should also check out Brett Slatkin's IO talk on high-throughput data pipelines.
Yes, it seems reasonable to store some user data as child entities of a User entity.
Why do you need to manually create keys ? The db.Model() constructor already has a convenient "parent" argument which will automatically put both the parent entity and the child entity in the same entity group.

Resources