How does data denormalization work with the Microservice Pattern?

How does data denormalization work with the Microservice Pattern? - database

I just read an article on Microservices and PaaS Architecture. In that article, about a third of the way down, the author states (under Denormalize like Crazy):
Refactor database schemas, and de-normalize everything, to allow complete separation and partitioning of data. That is, do not use underlying tables that serve multiple microservices. There should be no sharing of underlying tables that span multiple microservices, and no sharing of data. Instead, if several services need access to the same data, it should be shared via a service API (such as a published REST or a message service interface).
While this sounds great in theory, in practicality it has some serious hurdles to overcome. The biggest of which is that, often, databases are tightly coupled and every table has some foreign key relationship with at least one other table. Because of this it could be impossible to partition a database into n sub-databases controlled by n microservices.
So I ask: Given a database that consists entirely of related tables, how does one denormalize this into smaller fragments (groups of tables) so that the fragments can be controlled by separate microservices?
For instance, given the following (rather small, but exemplar) database:
[users] table
=============
user_id
user_first_name
user_last_name
user_email
[products] table
================
product_id
product_name
product_description
product_unit_price
[orders] table
==============
order_id
order_datetime
user_id
[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
product_id
order_id
quantity_ordered
Don't spend too much time critiquing my design, I did this on the fly. The point is that, to me, it makes logical sense to split this database into 3 microservices:
UserService - for CRUDding users in the system; should ultimately manage the [users] table; and
ProductService - for CRUDding products in the system; should ultimately manage the [products] table; and
OrderService - for CRUDding orders in the system; should ultimately manage the [orders] and [products_x_orders] tables
However all of these tables have foreign key relationships with each other. If we denormalize them and treat them as monoliths, they lose all their semantic meaning:
[users] table
=============
user_id
user_first_name
user_last_name
user_email
[products] table
================
product_id
product_name
product_description
product_unit_price
[orders] table
==============
order_id
order_datetime
[products_x_orders] table (for line items in the order)
=======================================================
products_x_orders_id
quantity_ordered
Now there's no way to know who ordered what, in which quantity, or when.
So is this article typical academic hullabaloo, or is there a real world practicality to this denormalization approach, and if so, what does it look like (bonus points for using my example in the answer)?

This is subjective but the following solution worked for me, my team, and our DB team.
At the application layer, Microservices are decomposed to semantic function.
e.g. a Contact service might CRUD contacts (metadata about contacts: names, phone numbers, contact info, etc.)
e.g. a User service might CRUD users with login credentials, authorization roles, etc.
e.g. a Payment service might CRUD payments and work under the hood with a 3rd party PCI compliant service like Stripe, etc.
At the DB layer, the tables can be organized however the devs/DBs/devops people want the tables organized
The problem is with cascading and service boundaries: Payments might need a User to know who is making a payment. Instead of modeling your services like this:
interface PaymentService {
PaymentInfo makePayment(User user, Payment payment);
}
Model it like so:
interface PaymentService {
PaymentInfo makePayment(Long userId, Payment payment);
}
This way, entities that belong to other microservices only are referenced inside a particular service by ID, not by object reference. This allows DB tables to have foreign keys all over the place, but at the app layer "foreign" entities (that is, entities living in other services) are available via ID. This stops object cascading from growing out of control and cleanly delineates service boundaries.
The problem it does incur is that it requires more network calls. For instance, if I gave each Payment entity a User reference, I could get the user for a particular payment with a single call:
User user = paymentService.getUserForPayment(payment);
But using what I'm suggesting here, you'll need two calls:
Long userId = paymentService.getPayment(payment).getUserId();
User user = userService.getUserById(userId);
This may be a deal breaker. But if you're smart and implement caching, and implement well engineered microservices that respond in 50 - 100 ms each call, I have no doubt that these extra network calls can be crafted to not incur latency to the application.

It is indeed one of key problems in microservices which is quite conviniently omitted in most of articles. Fortunatelly there are solutions for this. As a basis for discussion let's have tables which you have provided in the question.
Image above shows how tables will look like in monolith. Just few tables with joins.
To refactor this to microservices we can use few strategies:
Api Join
In this strategy foreign keys between microservices are broken and microservice exposes an endpoint which mimics this key. For example: Product microservice will expose findProductById endpoint. Order microservice can use this endpoint instead of join.
It has an obvious downside. It is slower.
Read only views
In the second solution you can create copy of the table in the second database. Copy is read only. Each microservice can use mutable operations on its read/write tables. When it comes to read only tables which are copied from other databases they can (obviously) use only reads
High performance read
It is possible to achieve high performance read by introducing solutions such as redis/memcached on top of read only view solution. Both sides of join should be copied to flat structure optimized for reading. You can introduce completely new stateless microservice which can be used for reading from this storage. While it seems like a lot of hassle it is worth to note that it will have higher performance than monolithic solution on top of relational database.
There are few possible solutions. Ones which are simplest in implementation have lowest performance. High performance solutions will take few weeks to implement.

I realise this is possibly not a good answer but what the heck. Your question was:
Given a database that consists entirely of related tables, how does
one denormalize this into smaller fragments (groups of tables)
WRT the database design I'd say "you can't without removing foreign keys".
That is, people pushing Microservices with the strict no shared DB rule are asking database designers to give up foreign keys (and they are doing that implicitly or explicitly). When they don't explicitly state the loss of FK's it makes you wonder if they actually know and recognise the value of foreign keys (because it is frequently not mentioned at all).
I have seen big systems broken into groups of tables. In these cases there can be either A) no FK's allowed between the groups or B) one special group that holds "core" tables that can be referenced by FK's to tables in other groups.
... but in these systems "groups of tables" is often 50+ tables so not small enough for strict compliance with microservices.
To me the other related issue to consider with the Microservice approach to splitting the DB is the impact this has reporting, the question of how all the data is brought together for reporting and/or loading into a data warehouse.
Somewhat related is also the tendency to ignore built in DB replication features in favor of messaging (and how DB based replication of the core tables / DDD shared kernel) impacts the design.
EDIT: (the cost of JOIN via REST calls)
When we split up the DB as suggested by microservices and remove FK's we not only lose the enforced declarative business rule (of the FK) but we also lose the ability for the DB to perform the join(s) across those boundaries.
In OLTP FK values are generally not "UX Friendly" and we often want to join on them.
In the example if we fetch the last 100 orders we probably don't want to show the customer id values in the UX. Instead we need to make a second call to customer to get their name. However, if we also wanted the order lines we also need to make another call to the products service to show product name, sku etc rather than product id.
In general we can find that when we break up the DB design in this way we need to do a lot of "JOIN via REST" calls. So what is the relative cost of doing this?
Actual Story: Example costs for 'JOIN via REST' vs DB Joins
There are 4 microservices and they involve a lot of "JOIN via REST". A benchmark load for these 4 services comes to ~15 minutes. Those 4 microservices converted into 1 service with 4 modules against a shared DB (that allows joins) executes the same load in ~20 seconds.
This unfortunately is not a direct apples to apples comparison for DB joins vs "JOIN via REST" as in this case we also changed from a NoSQL DB to Postgres.
Is it a surprise that "JOIN via REST" performs relatively poorly when compared to a DB that has a cost based optimiser etc.
To some extent when we break up the DB like this we are also walking away from the 'cost based optimiser' and all that in does with query execution planning for us in favor of writing our own join logic (we are somewhat writing our own relatively unsophisticated query execution plan).

I would see each microservice as an Object, and as like any ORM , you use those objects to pull the data and then create joins within your code and query collections, Microservices should be handled in a similar manner. The difference only here will be each Microservice shall represent one Object at a time than a complete Object Tree. An API layer should consume these services and model the data in a way it has to be presented or stored.
Making several calls back to services for each transaction will not have an impact as each service runs in a separate container and all these calles can be executed parallely.
#ccit-spence, I liked the approach of intersection services, but how it can be designed and consumed by other services? I believe it will create a kind of dependency for other services.
Any comments please?

Related

Database Bottleneck In Distributed Application

I hear about SOA and Distributed Applications everywhere now. I would like know about some best practices related to keeping the single data source responsive or in case if you have copy of data on every server how it is better to synchronise those databases to keep them updated ?

There are many answers to this question and in order to choose the most appropriate solution, you need to carefully consider what kind of data you are storing and what you want to do with it.
Replication
This is the traditional mechanism for many RDBMS, and normally relies on features provided by the RDBMS. Replication has a latency which means although servers can handle load independently, they may not necessarily be reading the latest data. This may or may not be a problem for a particular system. When replication is bidirectional then simultaneous changes on two databases can lead to conflicts that need resolving somehow. Depending on your data, the choice might be easy (i.e. audit log => append both), or difficult (i.e. hotel room booking - cancel one? select alternative hotel?). You also have to consider what to do in the event that the replication network link is down (i.e. do you deny updates on both database, one database or allow the databases to diverge and sort out the conflicts later). This is all dependent on the exact type of data you have. One possible compromise, for read-heavy systems, is to use unidirectional replication to many databases for reading, and send all write operations to the source database. This is always a trade-off between Availability and Consistency (see CAP Theorem). The advantage of RDBMS and replication is that you can easily query your entire dataset in complex ways and have greater opportunity to
remove duplication by using relational links to data items.
Sharding
If your data can be cleanly partitioned into disjoint subsets (e.g. different customers), such that all possible relational links between data items are contained within each subset (e.g. customers -> orders). Then you can put each subset in separate databases. This is the principle behind NoSQL databases, or as Martin Fowler calls them 'Aggregate-Oriented Databases'. The downside of this approach is that it requires more work to run queries over your entire dataset, as you have to query all your databases and then combine the results (e.g. map-reduce). Another disadvantage is that in separating your data you may need to duplicate some (e.g. sharding by customers -> orders might mean product data is duplicated). It is also hard to manage the data schema as it lies independently on multiple databases, which is why most NoSQL databases are schema-less.
Database-per-service
In the microservice approach, it is advised that each microservice should have its own dedicated database, that is not allowed to be accessed by any other microservice (of a different type). Hence, a microservice that manages customer contact information stores the data in a separate database from the microservice that manages customer orders. Links can be made between the databases using globally unique ids, or URIs (especially if the microservices are RESTful) etc. The downside again from this is that it is even harder to perform complex queries on the entire dataset (especially since all access should go via the microservice API not direct to the databases).
Polyglot storage
So many of my projects in the past have involved a single RDBMS in which all data was placed. Some of this data was well suited to the relational model, much of it was not. For example, hierarchical data might be better stored in a graph database, stock ticks in a column-oriented database, html templates in a NoSQL database. The trend with micro-services is to move towards a model where different parts of your dataset are placed in storage providers that are chosen according to the need.

If you thinking to keep different copies of the database for each microservice and you want to achieve eventual consistency than you can use Kafka Connect. I can briefly tell you that kafka connect will watch your DBS and whenever there are any changes it will read the log file and will add these logged events as a message in Queue then another database those are a subscriber to this Queue can execute the same statement at their side also.
Kafka connect isn't the only framework, you can search and find other frameworks or application for the same implementation.

Splitting monolith to microservices database issues

I am splitting monolith application to microservices and I was able to split it to three microservices, for easier explanation suppose these are:
Users (CRUD)
Messages (CRUD)
Other things (CRUD)
All of these are distinct bounded contexts and I'm using database table for microservice. So in DB i have:
USERS table
id
surname
lastname
...
OTHER_THINGS table
id
col1
col2
...
MESSAGES table
id
title
created_time
USER_ID
OTHER_THING_ID
...
Now my web page needs searching/filtering of messages by all of the specified columns of all of these tables. For example:
Web page user can enter:
surname of USER,
col2 of OTHER_THINGS
title of messages
And I should return only filtered rows.
With monolith I have used simple database JOINS, but in this situation I can't find the best option. Can you suggest me possible options and which ones are better?

"suppose I have Orders and Customers tables, where ORDER has FK to CUSTOMER. For me these seems to be in different microservices. "
Still nope to the foreign key. The Orders microservice has a data store with its own Customers table. The Customer Update microservice has a data store with its own Customers table. The Customer Orders search would be a feature of the Orders microservice and so will search its data store not the Customer Update data store.
The whole point about microservices is the absence of dependencies. They are entire, discrete systems in the their own right. This makes them easy to build and easy to deploy. The snag is the issue you are butting up against: data management. Most enterprises aspire to a single source of truth regarding their data. Which usually means a central database, which imposes constraints on applications because everything has to share the same data model and changes to common entities such as Customer cause major upheaval.
Microservices appear to offer a solution to this by spinning out subsets of functionality which own their own data model. This inevitably means data integrity across the enterprise is looser, because it is handled asynchronously. There is no longer a single source of truth.
So the Customer Update microservice will publish updates about Customers as messages which the Orders microservice will consume and apply. Likewise, if the Orders microservice can create new Customers then it will publish a similar stream of messages which the Customer Update microservice will consume and apply. What happens if the two microservices create records for the same new Customer in the same window between refreshes? Well, yes, a good question.
The upshot is, the microservice will work in some scenarios and be absolutely disastrous in others. Certainly most enterprise applications will remain largely monolithic not just through inertia but because the benefits of centrally shared data outweigh the agility of microservices in many instances.

Database table creation for large data

I am making a client management application in which I am storing the data of employee , admin and company. In the future the database will have hundreds of companies registered. I am thinking to go for the best approach to database design.
I can think of 2 approaches:
Making all tables of app separately for each company
Storing all data in app database
Can you suggest the best way to do that?
Please note that all 3 tables are linked on the basis of ids and there will be hundreds of companies and each company will have many admin and each admin will have hundreds of employee . What would be the best approach to do with security and query performance

With the partial information you provided, it look like 3 normalized tables is what you need, plus the auxiliar data like lookups and other stuff.
But when you design a database you would need to consider many more point like, security, visibility, client access methods, etc
For example if you want to ensure isolation, and don't allow users to have any visibility to other's data, you could create dynamically a schema per company, create user and access rights for each schema dynamically. Then you'll need support these stuff in the DAL, which in fact will be quite fat.
Another approach for the DAl could be exposing views that always return subsets for one company.
A big reason reason that I would suggest going for the normalized approach is that maintenance will be much easier this way.
From a SQL point of view I don't see any performance advantage having many tables or just 3, efficiency of the indexes, and smart DAL will make the difference.

The performance of the query doesn't much depends on the size of table but it depends more on the indexes you have on that table. so you need to put clustered and non clustered indexes as per your requirement and i can guarantee that up to 10 GB of data you will not face any problem

This is a classic problem shared my most web business services: for discussions of the factors involved, Google "multi-tenant architecture."
You almost certainly want to put all companies into a common set of tables: each data table should reference the company key, and all queries should join on that key, among their other criteria. This allows the best overall performance, and saves you the potential maintenance nightmare of duplicating views, stored procedures and so on hundreds of times, or of having to apply the same structural changes to hundreds of tables should you wish to add a field or a table.
To help assure that you don't inadvertently intermingle data from different customers, it might be useful to do all data access through a validated set of stored procedures (all of which take the company ID as a parameter).
Hundreds of parallel databases will not scale very well: the DB server will constantly be pushing tables and indexes out of memory to accommodate the next query, resulting in disk thrashing and poor performance, as well. There is only pain down that path.

depending on the use-cases of your application there is no "best" way.
Please explain the operations your application will provide so we can get further insight into your problem.
The data to be stored seemed to be structured so a relational database at a first glance would work out well, but stick to the point i marked above.

You have not said how this data links at all or if there are even any links between them. However, at a guess, you need 3 tables.
EmployeeTable
AdminTable
CompanyTable
Each with the required properties in there, without additional information I'm not able to provide any more guidance.

How to handle different organization with single DB?

Background
building an online information system which user can access through any computer. I don't want to replicate DB and code for every university or organization.
I just want user to hit a domain like www.example.com sign in and use it.
For second user it will also hit the same domain www.example.com sign in and use it. but the data for them are different.
Scenario
suppose a university has 200 employees, 2nd university has 150 and so on.
Qusetion
Do i need to have separate employee table for each university or is it OK to have a single table with a column that has University ID?
I assume 2nd is best but Suppose i have 20 universities or organizations and a total of thousands of employees.
What is the best approach?
This same thing is for all table? This is just to give you an example.
Thanks

The approach will depend upon the data, usage, and client requirements/restrictions.
Use an integrated model, as suggested by duffymo. This may be appropriate if each organization is part of a larger whole (i.e. all colleges are part of a state college board) and security concerns about cross-query access are minimal2. This approach has a minimal amount of separation between each organization as the same schema1 and relations are "openly" shared. It leads to a very simple model initially, but it can become very complicated (with compound FKs and correct usage of such) if needing relations for organization-specific values because it adds another dimension of data.
Implement multi-tenancy. This can be achieved with implicit filters on the relations (perhaps hidden behinds views and store procedures), different schemas, or other database-specific support. Depending upon implementation this may or may not share schema or relations even though all data may reside in the same database. With implicit isolation, some complicated keys or relationships can be hidden/eliminated. Multi-tenancy isolation also generally makes it harder/impossible to cross-query.
Silo the databases entirely. Each customer or "organization" has a separate database. This implies separate relations and schema groups. I have found this approach to to be relatively simple with automated tooling, but it does require managing multiple database. Direct cross-querying is impossible, although "linked databases" can be used if there is a need.
Even though it's not "a single DB", in our case, we had the following restrictions 1) not allowed to ever share/expose data between organizations, and 2) each organization wanted their own local database. Thus, our product ended up using a silo approach. Make sure that the approach chosen meets customer requirements.
None of these approaches will have any issue with "thousands", "hundreds of thousands", or even "millions" of records as long as the indices and queries are correctly planned. However, switching from one to another can violate many assumed constraints and so the decision should be made earlier on.
1 In this response I am using "schema" to refer to the security grouping of database objects (e.g. tables, views) and not the database model itself. The actual database model used can be common/shared, as we do even when using separate databases.
2 An integrated approach is not necessarily insecure - but it doesn't inherently have some of the built-in isolation of other designs.

I would normalize it to have UNIVERSITY and EMPLOYEE tables, with a one-to-many relationship between them.
You'll have to take care to make sure that only people associated with a given university can see their data. Role based access will be important.

This is called a multi-tenant architecture. you should read this:
http://msdn.microsoft.com/en-us/library/aa479086.aspx
I would go with Tenant Per Schema, which means copying the structure across different schemas, however, as you should keep all your SQL DDL in source control, this is very easy to script.
It's easy to screw up and "leak" information between tenants if doing it all in the same table.

Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.

Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm

It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.

It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.

I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight