Google AppEngine: approaches to store data in DataStore - google-app-engine

I'm new to GAE, would appreciate your advice on GAE-app data storage approaches.
Simple example:- there are Author and Document entities - each Author may be a creator of several Documents So we have two options:1) Add all Documents as children to corresponding Author entities (owned relationship)
2) Add a field to each Document which will identify the Author (unowned link or something)
What are pros and cons of every approach?
P.S. I know about groups and strong consistency. What else? Buy the way, eventual consistency, what is it in reality - minutes, hours, ...?
Thanks

The general guideline with most NoSQL stores is to structure your data so that it is optimal for your primary use case and denormalise as you need to to satisfy other needs.
If your most common operation is read all documents for an author, then putting documents under an author makes sense. If its fetch by document, then referencing author may be more practical.
How the datastore is priced (in terms of cost of reads vs writes) will help guide you - cheapest usually is also the most effective design. For example, if documents are write heavy and have many indexes, option 1 could be expensive when you want to update a single document.
W.R.T eventual consistency, it usually wont be longer than seconds worst case, however there are no guarantees. You should not rely on it being good enough in a situation where it must be accurate (for example an author editing a document then previewing it before publishing). Remember that a get by id is strongly consistent read, so generally you can mitigate this as needed.

Searching for answers I've run through number of acticles and also encountered this and this posts which are helpful.
So I formed my opinion and hope it will help someone:
Entity groups advantages:
+ Intrinsic strong consistency (see also about transactions)
+ Ancestor calls may serve similar to "namespaces in miniature". This may be used to separate data still with possibility to share it.
Entity groups disadvantages due to limits on writes per second (see here in the end):
- may hurt scalability
- may slow concurrent access
- shouldn't be large anyway since access to groups is serialized
So the use of entity groups IMHO is limited to:
- cases where strong consistency is demanded. Still to avoid contention groups should be kept as small as possible
- single user data storage
In all other cases I will avoid them.

Related

App Engine Data Modeling for Comments

I implementing a Comments section for my current application. The Comments section can be thought of as a series of user posts on a given page. I am wondering which design would be most effective in a non-relational database (Google App Engine).
Design 1:
Group the comments by a groupId and filter on those results
Comment Entity >> [id, groupId, otherData...]
Queries for all comments pertaining to a page would look like:
Select from Comments filter by groupId
Design 2:
Store a single key for all comments within a group and use a Self Expanding List if the number of entries exceeds 5000 entries.
Comment Entity >> [id, SELid]
Queries would simply perform an id/key lookup.
I understand that Indexes can be expensive, but the first design proposal will only index the groupId field and will only require a single write to post a comment (well more writes if you include the index).
The second design will avoid costly indexing but each posted comment will require a read and a write operation. Furthermore, I"m worried about contention issues. These comments should not be experiencing extremely high throughput, but the second design seems to create a bottleneck.
As I am new to non-relational DB's, I would appreciate any input on these proposed designs and their associated tradeoffs.
In case of App Engine and Datastore, the approach you will take depends mainly on the consistency model (strong vs eventual) you require for your entities. In Google Cloud Datastore, there is a concept of an entity group. The entity group (an entity and its descendants) is a unit with strong consistency, transactionality, and locality but also imposes some restrictions (1 write per second).
Considerations
Do you require strong consistent results?
How often will be comments posted per page?
How many comments per page do you expect?
Do you have a use case requiring transactional behaviour?
Since neither of your design options uses entity group (page -> posts), I suppose you decided not to go this way.
Design 1
Eventual consistent lookup by groupId
Easier to maintain (you do not have to deal with 5000 entities limit)
Design 2
Strong consistent lookup by entityGroupId
Harder to maintain (you HAVE to deal with 5000 entities limit)
As mentioned, one entity representing all post for a page can be a bottleneck (can be reduced by means of Memcache)
I would probably go with the first approach even though it can resemble relational data model.

GAE ndb best practice to store large one to many relations

I'm searching for the best practice to store a large amount of Comment Entities which have a one to many relationship to another entity.
I read a lot about the limitations about the datastore and don't know how to solve this.
I can't store them as structured properties due to the 1MB Entity Limitation.
Also Guido van Rossum answered the question about repeated properties with "if you have more than 100-1000 values" do not use repeated properties.
So repeated properties are no solution for my comments, too.
Final Question: What is the best practice to solve this problem? Are ancestors an opportunity?
Edit: In this question about ancestor or reference properties Nick Johnson mentioned that "Every entity with the same parent will be in the same entity group, and writes to entity groups are serialized, so using ancestors here will slow things down if you're writing multiple entities concurrently. Since all the entities in a group are 'owned' by the user that forms the root of the group in your instance, though, this shouldn't be a problem - and in fact, what you're doing is actually a recommended design pattern."
What exactly does " writing multiple entities concurrently mean" ? When different user comment at the same time to that entity?
Depends on the amount you read / write per bill.
You can store references for more than 1000 (until an amount depending by the key size and how you reference them) as json compressed unindexed properties. But take care then with referencing and dereferecing that amount. Plus your overhead and data amount that you will transfer on each request will be big. You don't want though to be doing ops on 1000000 compressed entity keys on the server for just a simple request. If you take this way trying to optimize this approach do it on the client as smart as you can.
Go for ancestors and/or optimize your logic not to be consistent (eg it doesn't matter if a comment is not shown immediately) and use iterators or pointer or seeks (whatever it's called)

drawbacks of storing all ''things' in a central table

I am not sure if there is a term to describe this, but I have observed that content management systems store all kinds of data in a single table with their bare minimum properties while the meta data is stored in another table in form of key value pairs.
for eg. everything (blog posts, pages, images, events etc) is stored in one table and considered as a post.
I understand that this allows for abstraction and easy extensibility
we are considering designing our new project this way. It is not exactly a CMS but we plan to keep adding modules to it in stages. Lets say initially there will be only posts and images on which comments can be posted. Later on we might add videos which will also have the commenting feature.
what are the drawbacks of this approach ? and will it work for a requirement like ours ?
Thanks
The drawback is that the main table will get zillions of reads (and plenty of writes, too).
This means that there will be lots of lock contentions, heavy reindexing etc.
In order to mitigate this a bit you may consider splitting the "main table" in a series of not-so-main-tables.
Say, you will have one main table for "Posts" (possibly refined through metadata or subtables for specific types of posts, like Sticky, Announcement, Shoutbox, Private...)
One main table for Images (possibly refined for gifs, jpegs etc.)
One main table for Videos...
If this is a custom application (and not intended to be something that has to be "infinitely tweakable" like a CMS or a Portal framework) I think this kind of split is acceptable, and may provide some better performance (if you expect to have large amounts of data).
Regarding your "examples" comment... first of all, if you keep comments again in a single gigantic table you may have similar problems as if you kept all type of items in it.
Assuming this is not a problem, you can obviously put a sort of reference key (you can't use the normal foreign keys, of course) that links comments to their original item.
This works fine when you go from item to comments, a bit less when you have to move from comments to the originating item. So the tradeoff is about what kind of operations would be more frequent for your problem.
Simplicity and extensibility are indeed often attractive aspects of attribute-value and (as you say) "single table of things" approaches.
There's no 100% right answer here -- depending on your performance/throughput goals and extensibility needs, this approach might work for you too.
In most cases, however, where you know what kinds of data you will store, it's usually in your interest to model distinct entities into their own tables and relate the data accordingly. RDBMSes have been architected and refined over decades to cater to this use case and to simply use tables as generic dumping grounds doesn't typically buy you any distinct advantages, except the act of delaying the inevitable need to model your data properly. Furthermore, when you boil everything into one table, you then force users outside your app itself (if you have any, for example report writers) to have to struggle with your "model within a model", which can just make folks frustrated when they write queries, etc. And you will sink to your lowest common denominator -- if you want to optimize queries about type X and you have types Y and Z in that same table in droves, they will impact performance on querying X.
Again, to be clear, there is distinct benefit to the "all things in one table" name/value style metadata approaches. I have used them myself and turned against modeling for similar reasons. However, my advice is to limit yourself to times when you really need to do that (i.e., you need to implement something before you can correctly model the space of things you will need). Most typically, I find myself doing that when I'm prototyping complex systems and I need to get something going sooner than later.

When do you have too many tables?

Two of my colleagues and I are building a system to do all sorts of hydrology and related stuff. It has a lot of requirements and have a good number of tables.
We are handling all sorts of sampling that it is done within this scope (hydrology) and we are trying to figure out a way to do it in a less painful way.
Sometimes we need to get all that sampling together and I'm starting to think we are over-complicating our database design.
How or when do you know that you are over-designing a database? Of course we are considering a lot of Normal Form Rules and other good practices, but when it is OK to drop one of those rules, e.g. not normalizing something?
What are your opinions on this?
Short Answer
You can't, worry about something else.
Long Answer
This sounds like yet another form of premature optimization. (YAFPO?)
You should design your schema using third normal form (3NF). Once designed, you should populate your tables with data and begin profiling.
If a particular query is deemed too costly then you should look into denormalization on a case by case basis.
Technical Answer (for the nitpickers who will inevitably object to: "you can't")
You will reach a limit at some point based on your choice of RDBMS and/or storage engine. Likely ceilings will be memory consumption or open file descriptors.
"When do you have too many tables?"
At the level of logical design, the correct answer is "never".
At the level of physical design (insofar as "having a table" really refers to some concept that pertains to the physical design), the correct answer is "if and when the queries that you need to do, given the restrictions of the DBMS you are using, are causing performance to be unacceptably low.".
We have a system with literally hundreds of tables - its no big deal, its just that a lot of different things are stored in the database.
We have a ton of tables in our system as well. What we did was normalize the database to a good point, then created a few views that encompass the most common table usage needs of our system. Something like that could help you as well.

schema design

Let's say you are a GM dba and you have to design around the GM models
Is it better to do this?
table_model
type {cadillac, saturn, chevrolet}
Or this?
table_cadillac_model
table_saturn_model
table_chevrolet_model
Let's say that the business lines have the same columns for a model and that there are over a million records for each subtype.
EDIT:
there is a lot of CRUD
there are a lot of very processor intensive reports
in either schema, there is a model_detail table that contains 3-5 records for each model and the details for each model differ (you can't add a cadillac detail to a saturn model)
the dev team doesn't have any issues with db complexity
i'm not really sure that this is a normalization question. even though the structures are the same they might be thought of as different entities.
EDIT:
Reasons for partitioning the structure into multiple tables
- business lines may have different business rules regarding parts
- addModelDetail() could be different for each business line (even though the data format is the same)
- high add/update activity - better performance with partitioned structure instead of single structure (I'm guessing and not sure here)?
I think this is a variation of the EAV problem. When posed as a EAV design, the single table structure generally gets voted as a bad idea. When posed in this manner, the single table strucutre generally gets voted as a good idea. Interesting...
I think the most interesting answer is having two different structures - one for crud and one for reporting. I think I'll try concatenated/flattened view for reporting and multiple tables for crud and see how that works.
Definitely the former example. Do you want to be adding tables to your database whenever you add a new model to your product range?
On data with a lot of writes, (e.g. an OLTP application), it is better to have more, narrower tables (e.g. tables with fewer fields). There will be less lock contention because you're only writing small amounts of data into different tables.
So, based on the criteria you have described, the table structure I would have is:
Vehicle
VehicleType
Other common fields
CadillacVehicle
Fields specific to a Caddy
SaturnVehicle
Fields specific to a Saturn
For reporting, I'd have an entirely different database on an entirely different server that does not have the normalized structure (e.g. just has CadillacVehicle and SaturnVehicle tables with all of the fields from the Vehicle table duplicated into them).
With proper indexes, even the OLTP database could be performant in your SELECT's, regardless of the fact that there are tens of millions of rows. However, since you mentioned that there are processor-intensive reports, that's why I would have a completely separate reporting database.
One last comment. About the business rules... the data store cares not about the business rules. If the business rules are different between models, that really shouldn't factor into your design decisions about the database schema (other than to help dictate which fields are nullable and their data types).
Use the former. Setting up separate tables for the specialisations will complicate your code and doesn't bring any advantages that can't be achieved in other ways. It will also massively simplify your reports.
If the tables really do have the same columns, then the former is the best way to do it. Even if they had different columns, you'd probably still want to have the common columns be in their own table, and store a type designator.
You could try having two separate databases.
One is an OLTP (OnLine Transaction Processing) system which should be highly normalized so that the data model is highly correct. Report performance must not be an issue, and you would deal with non-reporting query performance with indexes/denormalization etc. on a case-by-case basis. The data model should try to match up very closely with the conceptual model.
The other is a Reports system which should pull data from the OLTP system periodically, and massage and rearrange that data in a way that makes report-generation easier and more performant. The data model should not try to match up too closely with the conceptual model. You should be able to regenerate all the data in the reporting database at any time from the data currently in the main database.
I would say the first way looks better.
Are there reasons you would want to do it the second way?
The first way follows normalization better and is closer to how most relational database schema are developed.
The second way seems to be harder to maintain.
Unless there is a really good reason for doing it the second way I would go with the first method.
Given the description that you have given us, the answer is either.
In other words you haven't given us enough information to give a decent answer. Please describe what kind of queries you expect to perform on the data.
[Having said that, I think the answer is going to be the first one ;-)
As I imaging even though they are different models, the data for each model is probably going to be quite similar.
But this is a complete guess at the moment.]
Edit:
Given your updated edit, I'd say the first one definitely. As they have all the same data then they should go into the same table.
Another thing to consider in defining "better"--will end users be querying this data directly? Highly normalized data is difficult for end-users to work with. Of course this can be overcome with views but it's still something to think about as you're finalizing your design.
I do agree with the other two folks who answered: which form is "better" is subjective and dependent on what you're hoping to achieve. If you're hoping to achieve very quick queries that's one thing. If you're hoping to achieve high programmer productivity--that's a different goal again and possibly conflicts with quick queries.
Choice depends on required performance.
The best database is normalized database. But there could be performance issues in normalized database then you have to denormalize it.
Principle "Normalize first, denormalize for performance" works well.
It depends on the datamodel and the use case. If you ever need to report on a query that wants data out of the "models" then the former is preferable because otherwise (with the latter) you'd have to change the query (to include the new table) every time you added a new model.
Oh and by "former" we mean this option:
table_model
* type {cadillac, saturn, chevrolet}
#mson has asked the question "What do you do when a question is not satisfactorily answered on SO?", which is a direct reference to the existing answers to this question.
I contributed the following answer to that discussion, primarily critiquing the way the question was asked.
Quote (verbatim):
I looked at the original question yesterday, and decided not to contribute an answer.
One problem was the use of the term 'model' as in 'GM models' - which cited 'Chevrolet, Saturn, Cadillac' as 'models'. To my understanding, these are not models at all; they are 'brands', though there might also be an industry-insider term for them that I'm not familiar with, such as 'division'. A model would be a 'Saturn Vue' or 'Chevrolet Impala' or 'Cadillac Escalade'. Indeed, there could well be models at a more detailed level than that - different variants of the Saturn Vue, for example.
So, I didn't think that the starting point was well framed. I didn't critique it; it wasn't quite compelling enough, and there were answers coming in, so I let other people try it.
The next problem is that it is not clear what your DBMS is going to be storing as data. If you're storing a million records per 'model' ('brand'), then what sorts of data are you dealing with? Lurking in the background is a different scenario - the real scenario - and your question has used an analogy that failed to be sufficiently realistic. That means that the 'it depends' parts of the answer are far more voluminous than the 'this is how to do it' ones. There is just woefully too little background information on the data to be modelled to allow us to guess what might be best.
Ultimately, it will depend on what uses people have for the data. If the information is going to go flying off in all different directions (different data structures in different brands; different data structures at the car model levels; different structures for the different dealerships - the Chevrolet dealers are handled differently from the Saturn dealers and the Cadillac dealers), then the integrated structure provides limited benefit. If everything is the same all the way down, then the integrated structure provides a lot of benefit.
Are there legal reasons (or benefits) to segregating the data? To what extent are the different brands separate legal entities where shared records could be a liability? Are there privacy issues, such that it will be easier to control access to the data if the data for the separate brands is stored separately?
Without a lot more detail about the scenario being modelled, no-one can give a reliable general answer - at least, not more than the top-voted one already gives (or doesn't give).
Data modelling is not easy.
Data modelling without sufficient information is impossible to do reliably.
I have copied the material here since it is more directly relevant. I do think that to answer this question satisfactorily, a lot more context should be given. And it is possible that there needs to be enough extra context to make SO the wrong place to ask it. SO has its limitations, and one of those is that it cannot deal with questions which require long explanations.
From the SO FAQs page:
What kind of questions can I ask here?
Programming questions, of course! As long as your question is:
detailed and specific
written clearly and simply
of interest to at least one other programmer somewhere
...
What kind of questions should I not ask here?
Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!
This question is, IMO, close to the 'require extended discussion' limit.

Resources