Using GAE datastore relationship vs flat kind design - google-app-engine

We have a requirement to implement in GAE datastore. There are set of documents (in millions) and each document has a owner, some comments and revisions associated with it.
If the owner of document is leaving the organization, then we need to change the ownership of the document to the person who did last revision. Also we need to retain the revisions and comments for each document. This ownership change is to be implemented by a job which will process each and every document one by one.
Is it the right approach to have Parent-Child relationships between the entities Document,Comment and Revision like Document is the parent with Comment and Revision as its child? OR in typical NoSql way we need to flatten the table and make a single entity?
The typical NoSQL implementation needs only insert and read but no updates. Is this the way the Google datastore works? Please clarify.
Our research says that we can have relationship but that will look more like RDBMS.

To choose proper schema design, you should clarify how you plan to work with data and keep in mind datastore limitations. In brief:
NoSql approach (single entity)
one update per second per entity group
you read and write the whole entity every time (except for projection queries)
Parent-child relations (ancestor relationships)
cannot be changed in future
form single entity-group so you achieve strong consistency while reading the query
one update per second per entity group! (So if you have a case with lots of live comments this wont work for you)
RDBMS approach (tables and relations)
datastore has no joins on multiple tables (so only split data in tables where you are not intending to query together)
eventually consistent reads

Related

AWS DynamoDB and Storing User Data with Transactional Data in one table

I am looking at ways to store user data alongside transactional data like orders and invoices. Normally, I would use a relational database like postgresql, but I wanted to know if it would be a good idea to store the user data along with their transactional data in one noSQL table like DynamoDB?
I would assume if you did that you would structure your data to either use objects or arrays to store the orders or invoices but I'm not sure if that is the best was to go about it.
EDIT
So after doing some more research and trying understand how to fit everything into a single table design I found this article in the AWS documentation. I decided to organise my data into collections using a combinaton of the primary key and the sort key. The sort key is used to determine collections (i.e., orders, customer-data, etc). This solution is perfect for my use case because I can keep all the user data (including transactions like orders) in one dynamodb table.
In short, don't do that. DynamoDB is a great tool, but you need to understand it first. It's not just a no-sql, it's also a distributed one. It gives great performance, scalability and pricing. But modeling is trickier. You can not build requests as you please, those has to be taken into consideration when you design your model. Read about queries vs scans and global vs local indexes. When you get that you might try reading about Single Table Design. It should give you an idea about the limitations of the DynamoDB.

Relational model of database management system

Instead of having separate relation what if the problem of having one big relation that will store all the data related to a system?
A single database may not fulfill all the needs of a complete system. For example : Lets take a ecommerce website. Here you have following important things :
Customer Information
Order information
Inventory Information etc.
Now if you see these three are not related to each other other than that they are all a part of ecommerce system. You don't need to store the user information along with inventory information in a single table as they are not related.
After this part is answered, Now, you can have doubts so as to why we need normalization. The objectives of normalization as mentioned here are:
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.

What is a good web application SQL Server data mart implementation in ElasticSearch?

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

When to use entity groups in GAE's Datastore

Following up on my earlier question regarding GAE Datastore entity hierarchies, I'm still confused about when to use entity groups.
Take this simple example:
Every Company has one or more Employee entities
An Employee cannot be moved to another Company, and users that deal with one Company can never see the Employees of another Company
This looks like a case where I could make Employee a child entity of Company, but what are the practical consequences? Does this improve scalability, hurt scalability, or have no impact?
What are other advantages/disadvantages of using or not using an entity hierarchy?
(Entity groups enable transactions, but assume for this example that I do not need transactions).
If you don't need transactions, don't use entity groups. They slow things down in some cases, and never speed anything up. Their only benefit is that they enable transactions.
As far as I can tell, the best place to use entity groups is on data that isn't likely to be accessed by many users at the same time, and that you'll frequently want to include in a transaction. So, if you stored the contents of a shopping cart, which probably only the owner of that cart will deal with frequently, those contents might be good for an entity group - it'll be nice to be able to use a transaction for that data when you're adding or updating an entity, and you're not locking anyone else out of anything when you do so.
Nick stated clearly that you should not make the groups larger than necessary, the Best practices for writing scalable applications has some discussion one why.
Use entity groups when you need transactions. In the example you gave, a ReferenceProperty on employee will achieve a similar result.
Aside from transactions, entity groups can be helpful because key-fetches and queries can be keyed off of a parent entity. However, you might want to consider multitenancy for these types of use-cases.
Ultimately large entity groups might hurt scalability, entities within an entity group are stored in the same tablet. The more stuff you cram into one entity group, the more you reduce the amount of work that can be done in parallel -- it needs done serially instead.

Referential Integrity and HBase

One of the first sample schemas you read about in the HBase FAQ is the Student-Course example for a many-many relationship. The schema has a Courses column in the Student table and a Students column in the Course table.
But I don't understand how in HBase you guarantee integrity between these two objects. If something were to crash between updating one table and before another, we'd have a problem.
I see there is a transaction facility, but what is the cost of using this on what might be every Put? Or are there other ways to think about the problem?
We hit the same issue.
I have developed a commercial plugin for hbase that handles transactions and the relationship issues that you mention. Specifically, we utilize DataNucleus for a JDO Compliant environment. Our plugin is listed on this page http://www.datanucleus.org/products/accessplatform_3_0/datastores.html or you can go directly to our small blog http://www.inciteretail.com/?page_id=236.
We utilize JTA for our transaction service. So in your case, we would handle the relationship issue and also any inserts for index tables (Hard to have an app without index lookup and sorting!).
Without an additional log you won't be able to guarantee integrity between these two objects. HBase only has atomic updates at the row level. You could probably use that property though to create a Tx log that could recover after a failure.
If you have to perform two INSERTs as a single unit of work, that means you have to use a transaction manager to preserve ACID properties. There's no other way to think about the problem that I know of.
The cost is less of a concern that referential integrity. Code it properly and don't worry about performance. Your code will be the first place to look for performance problems, not the transaction manager.
Logical relational models use two main varieties of relationships: one-to-many and
many-to-many. Relational databases model the former directly as foreign keys (whether
explicitly enforced by the database as constraints, or implicitly referenced by your
application as join columns in queries) and the latter as junction tables (additional
tables where each row represents one instance of a relationship between the two main
tables). There is no direct mapping of these in HBase, and often it comes down to de-
normalizing the data.
The first thing to note is that HBase, not having any built-in joins or constraints,
has little use for explicit relationships. You can just as easily place data that is one-to-
many in nature into HBase tables:. But
this is only a relationship in that some parts of the row in the former table happen to
correspond to parts of rowkeys in the latter table. HBase knows nothing of this rela-
tionship, so it’s up to your application to do things with it (if anything).

Resources