Microservices and database joins

Microservices and database joins - database

For people that are splitting up monolithic applications into microservices how are you handling the connundrum of breaking apart the database. Typical applications that I've worked on do a lot of database integration for performance and simplicity reasons.
If you have two tables that are logically distinct (bounded contexts if you will) but you often do aggregate processing on a large volumes of that data then in the monolith you're more than likely to eschew object orientation and are instead using your database's standard JOIN feature to process the data on the database prior to return the aggregated view back to your app tier.
How do you justify splitting up such data into microservices where presumably you will be required to 'join' the data through an API rather than at the database.
I've read Sam Newman's Microservices book and in the chapter on splitting the Monolith he gives an example of "Breaking Foreign Key Relationships" where he acknowledges that doing a join across an API is going to be slower - but he goes on to say if your application is fast enough anyway, does it matter that it is slower than before?
This seems a bit glib? What are people's experiences? What techniques did you use to make the API joins perform acceptably?

When performance or latency doesn't matter too much (yes, we don't
always need them) it's perfectly fine to just use simple RESTful APIs
for querying additional data you need. If you need to do multiple
calls to different microservices and return one result you can use
API Gateway pattern.
It's perfectly fine to have redundancy in Polyglot persistence environments. For example, you can use messaging queue for your microservices and send "update" events every time you change something. Other microservices will listen to required events and save data locally. So instead of querying you keep all required data in appropriate storage for specific microservice.
Also, don't forget about caching :) You can use tools like Redis or Memcached to avoid querying other databases too often.

It's OK for services to have read-only replicated copies of certain reference data from other services.
Given that, when trying to refactor a monolithic database into microservices (as opposed to rewrite) I would
create a db schema for the service
create versioned* views** in that schema to expose data from that schema to other services
do joins against these readonly views
This will let you independently modify table data/strucutre without breaking other applications.
Rather than use views, I might also consider using triggers to replicate data from one schema to another.
This would be incremental progress in the right direction, establishing the seams of your components, and a move to REST can be done later.
*the views can be extended. If a breaking change is required, create a v2 of the same view and remove the old version when it is no longer required.
**or Table-Valued-Functions, or Sprocs.

CQRS---Command Query Aggregation Pattern is the answer to thi as per Chris Richardson.
Let each microservice update its own data Model and generates the events which will update the materialized view having the required join data from earlier microservices.This MV could be any NoSql DB or Redis or elasticsearch which is query optimized. This techniques leads to Eventual consistency which is definitely not bad and avoids the real time application side joins.
Hope this answers.

I would separate the solutions for the area of use, on let’s say operational and reporting.
For the microservices that operate to provide data for single forms that need data from other microservices (this is the operational case) I think using API joins is the way to go. You will not go for big amounts of data, you can do data integration in the service.
The other case is when you need to do big queries on large amount of data to do aggregations etc. (the reporting case). For this need I would think about maintaining a shared database – similar to your original scheme and updating it with events from your microservice databases. On this shared database you could continue to use your stored procedures which would save your effort and support the database optimizations.

In Microservices you create diff. read models, so for eg: if you have two diff. bounded context and somebody wants to search on both the data then somebody needs to listen to events from both bounded context and create a view specific for the application.
In this case there will be more space needed, but no joins will be needed and no joins.

Related

Best approach to interact with same data base table from more than one microservices

I have a situation, where I need to add/update/retrieve records from same database table from more than one microservices. I can think of below three approaches, please help me pick up the best suitable approach.
Having a dedicated Microservices say database-data-manager which will interact with data base and & add/update/retrieve data and all the other microservices will call the end points of database-data-manager to add/update/retrieve data when required.
Having a maven library called database-data-manager and all the other microservices will use this library for the db interactions.
Having the same code(copy paste) in all the applications to take care of db interactions.
Approach - 1 seems expensive as we need to host a dedicated application for a basic functionality.
Approach - 2 would reduce boiler plate code but difficult to manage library version.
Approach - 3 would cause lot of boiler plate code and maintenance efforts to keep similar code in all the microservices.
Please suggest, Thanks in advance.

A strict definition of "microservice" would include the fact it's essentially self-contained... that would include any data storage it might need. So you really have a collection of services talking to a common database. Schematics aside...
Option 1 sounds like it's on the right track: you need to have something sitting between the microservices and database. This could be a cache or a dedicated proxy service. Let's say you have an old legacy system which is really fragile, controlling data in/out through a more capable service, acting as a proxy, is a well proven pattern.
Such a proxy might do a bulk read of the database, hold the data in memory to service high-volumes of reads, and handle updates.
Updating is non-trivial and there are various options:
The services cached data becomes the pseudo master - updates are applied to the cached data first, then go into a queue to apply to the underlying database.
The services data is used only for data-reads; updates are applied to the database first, and if the update is successful it is then applied to the cached data.
Option one is great for performance, on the assumption that the proxy service is really good at managing the data and satisfying service requests. But, depending on how you implement, it might be vulnerable to outages, in which case you might lose any data that has made it into the cache but not into the pipeline that gets it into the database.
Option 2 is good for ensuring a solid master set of data, but there's the risk that consuming services might read cached data that is now out of date because it's just being updated in the database.
In terms of implementation, a queue of some sort to handle getting updates to the database might be something you want to consider, as it would give you a place to control how updates (and which updates) get to the database.

Database Bottleneck In Distributed Application

I hear about SOA and Distributed Applications everywhere now. I would like know about some best practices related to keeping the single data source responsive or in case if you have copy of data on every server how it is better to synchronise those databases to keep them updated ?

There are many answers to this question and in order to choose the most appropriate solution, you need to carefully consider what kind of data you are storing and what you want to do with it.
Replication
This is the traditional mechanism for many RDBMS, and normally relies on features provided by the RDBMS. Replication has a latency which means although servers can handle load independently, they may not necessarily be reading the latest data. This may or may not be a problem for a particular system. When replication is bidirectional then simultaneous changes on two databases can lead to conflicts that need resolving somehow. Depending on your data, the choice might be easy (i.e. audit log => append both), or difficult (i.e. hotel room booking - cancel one? select alternative hotel?). You also have to consider what to do in the event that the replication network link is down (i.e. do you deny updates on both database, one database or allow the databases to diverge and sort out the conflicts later). This is all dependent on the exact type of data you have. One possible compromise, for read-heavy systems, is to use unidirectional replication to many databases for reading, and send all write operations to the source database. This is always a trade-off between Availability and Consistency (see CAP Theorem). The advantage of RDBMS and replication is that you can easily query your entire dataset in complex ways and have greater opportunity to
remove duplication by using relational links to data items.
Sharding
If your data can be cleanly partitioned into disjoint subsets (e.g. different customers), such that all possible relational links between data items are contained within each subset (e.g. customers -> orders). Then you can put each subset in separate databases. This is the principle behind NoSQL databases, or as Martin Fowler calls them 'Aggregate-Oriented Databases'. The downside of this approach is that it requires more work to run queries over your entire dataset, as you have to query all your databases and then combine the results (e.g. map-reduce). Another disadvantage is that in separating your data you may need to duplicate some (e.g. sharding by customers -> orders might mean product data is duplicated). It is also hard to manage the data schema as it lies independently on multiple databases, which is why most NoSQL databases are schema-less.
Database-per-service
In the microservice approach, it is advised that each microservice should have its own dedicated database, that is not allowed to be accessed by any other microservice (of a different type). Hence, a microservice that manages customer contact information stores the data in a separate database from the microservice that manages customer orders. Links can be made between the databases using globally unique ids, or URIs (especially if the microservices are RESTful) etc. The downside again from this is that it is even harder to perform complex queries on the entire dataset (especially since all access should go via the microservice API not direct to the databases).
Polyglot storage
So many of my projects in the past have involved a single RDBMS in which all data was placed. Some of this data was well suited to the relational model, much of it was not. For example, hierarchical data might be better stored in a graph database, stock ticks in a column-oriented database, html templates in a NoSQL database. The trend with micro-services is to move towards a model where different parts of your dataset are placed in storage providers that are chosen according to the need.

If you thinking to keep different copies of the database for each microservice and you want to achieve eventual consistency than you can use Kafka Connect. I can briefly tell you that kafka connect will watch your DBS and whenever there are any changes it will read the log file and will add these logged events as a message in Queue then another database those are a subscriber to this Queue can execute the same statement at their side also.
Kafka connect isn't the only framework, you can search and find other frameworks or application for the same implementation.

Data retrieval and search accross multiple services

I'm building a system that comprises a multiple heterogeneous services that talk to each other over a network, although in the standard deployment model they are all on the same machine. The UI client for managing the entities within that complex system should be able to display aggregated data from all comprising services while enabling search across that aggregated data.
I'm wondering how to design the data retrieval within this system so that it is scalable as the amount of data to be searched is already high and increases?
I'm thinking about two approaches:
The client queries data from all services on demand and aggregates the results in its layer. In many cases it will have to do joins between data coming from multiple services, so I'm concerned about performance here.
Denormalize the services data in a way so that it is convenient for the client queries and even store aggregations between the multiple services data so that the client doesn't have to do joins on demand. Probably, it would be better to store each service's denormalized data in its own database or cache as thus it would be easier to keep all denormalized data up-to-date. However, I'll need to put the aggregated views across multiple services' data in some other place and I'm concerned about the overhead of keeping this remote cache up-to-date.
Any examples or references to existing architectures that solve similar problems would be highly appreciated. Thanks!

Having an aggregated cache would surely can have better performance but think carefully about be the cost - the synchronization. It will end up that your client (or some remote service that will do this job for the clients) has its own database that synchronizes with the service data (something like implementing own database asynchronous pull replication). Check how the data retrieved from the services can change. The best for you would be if the data is not deleted/modified and only new can be added. It would be also easier if the data do not have to be consistent. Choosing appropriate synchronization mechanism depends on existing architecture and requirements.

In Memory Database

I'm using SqlServer to drive a WPF application, I'm currently using NHibernate and pre-read all the data so it's cached for performance reasons. That works for a single client app, but I was wondering if there's an in memory database that I could use so I can share the information across multiple apps on the same machine. Ideally this would sit below my NHibernate stack, so my code wouldn't have to change. Effectively I'm looking to move my DB from it's traditional format on the server to be an in memory DB on the client.
Note I only need select functionality.

I would be incredibly surprised if you even need to load all your information in memory. I say this because, just as one example, I'm working on a Web app at the moment that (for various reasons) loads thousands of records on many pages. This is PHP + MySQL. And even so it can do it and render a page in well under 100ms.
Before you go down this route make sure that you have to. First make your database as performant as possible. Now obviously this includes things like having appropriate indexes and tuning your database but even though are putting the horse before the cart.
First and foremost you need to make sure you have a good relational data model: one that lends itself to performant queries. This is as much art as it is science.
Also, you may like NHibernate but ORMs are not always the best choice. There are some corner cases, for example, that hand-coded SQL will be vastly superior in.
Now assuming you have a good data model and assuming you've then optimized your indexes and database parameters and then you've properly configured NHibernate, then and only then should you consider storing data in memory if and only if performance is still an issue.
To put this in perspective, the only times I've needed to do this are on systems that need to perform millions of transactions per day.
One reason to avoid in-memory caching is because it adds a lot of complexity. You have to deal with issues like cache expiry, independent updates to the underlying data store, whether you use synchronous or asynchronous updates, how you give the client a consistent (if not up-to-date) view of your data, how you deal with failover and replication and so on. There is a huge complexity cost to be paid.
Assuming you've done all the above and you still need it, it sounds to me like what you need is a cache or grid solution. Here is an overview of Java grid/cluster solutions but many of them (eg Coherence, memcached) apply to .Net as well. Another choice for .Net is Velocity.
It needs to be pointed out and stressed that something like NHibernate is only consistent so long as nothing externally updates the database and that there is exactly one NHibernate-enabled process (barring clustered solutions). If two desktop apps on two different PCs are both updating the same database with NHibernate the caching simply won't work because the persistence units simply won't be aware of the changes the other is making.

http://www.db4o.com/ can be your friend!

Velocity is an out of process object caching server designed by Microsoft to do pretty much what you want although it's only in CTP form at the moment.
I believe there are also wrappers for memcached, which can also be used to cache objects.

You can use HANA, express edition. You can download it for free, it's in-memory, columnar and allows for further analytics capabilities such as text analytics, geospatial or predictive. You can also access with ODBC, JDBC, node.js hdb library, REST APIs among others.

What is the best approach for decoupled database design in terms of data sharing?

I have a series of Oracle databases that need to access each other's data. The most efficient way to do this is to use database links - setting up a few database links I can get data from A to B with the minimum of fuss. The problem for me is that you end up with a tightly-coupled design and if one database goes down it can bring the coupled databases with it (or perhaps part of an application on those databases).
What alternative approaches have you tried for sharing data between Oracle databases?
Update after a couple of responses...
I wasn't thinking so much a replication, more on accessing "master data". For example, if I have a central database with currency conversion rates and I want to pull a rate into a separate database (application). For such a small dataset igor-db's suggestion of materialized views over DB links would work beautifully. However, when you are dynamically sampling from a very large dataset then the option of locally caching starts to become trickier. What options would you go for in these circumstances. I wondered about an XML service but tuinstoel (in a comment to le dorfier's reply) rightly questioned the overhead involved.
Summary of responses...
On the whole I think igor-db is closest, which is why I've accepted that answer, but I thought I'd add a little to bring out some of the other answers.
For my purposes, where I'm looking at data replication only, it looks like Oracle BASIC replication (as opposed to ADVANCED) replication is the one for me. Using materialized view logs on the master site and materialized views on the snapshot site looks like an excellent way forward.
Where this isn't an option, perhaps where the data volumes make full table replication an issue, then a messaging solution seems the most appropriate Oracle solution. Oracle Advanced Queueing seems the quickest and easiest way to set up a messaging solution.
The least preferable approach seems to be roll-your-own XML web services but only where the relative ease of Advanced Queueing isn't an option.

Streams is the Oracle replication technology.
You can use MVs over database links (so database 'A' has a materialized view of the data from database 'B'. If 'B' goes down, the MV can't be refreshed but the data is still in 'A').
Mileage may depend on DB volumes, change volumes...

It looks to me like it's by definition tightly coupled if you need simultaneous synchronous access to multiple databases.
If this is about transferring data, for instance, and it can be asynchronous, you can install a message queue between the two and have two processes, with one reading from the source and the other writing to the sink.

The OP has provided more information. He states that the dataset is very large. Well how large is large? And how often are the master tables changed?
With the use of materialized view logs Oracle will only propagate the changes made in the master table. A complete refresh of the data isn't necessary. Oracle streams also only communicate the modifications to the other side.
Buying storage is cheap, so why not local caching? Much cheaper than programming your own solutions.
An XML service doesn't help you when its database is not available so I don't understand why it would help? Oracle has many options for replication, explore them.
edit
I've build xml services. They provide interoperability between different systems with a clear interface (contract). You can build a xml service in C# and consume the service with Java. However xml services are not fast.

Why not use Advanced Queuing? Why roll your own XML service to move messages (DML) between Oracle instances - It's already there. You can have propagation move messages from one instance to another when they are both up. You can process them as needed in the destination servers. AQ is really rather simple to set up and use.

Why do they need to be separate databases?
Having a single database/instance with multiple schemas might be easier.
Keeping one database up (with appropriate standby databases etc) will be easier than keeping N up.

What kind of immediacy do you need and how much bi-directionality? If the data can be a little older and can be pulled from one "master source", create a series of simple ETL scripts run on a schedule to pull the data from the "source" database into the others.
You can then tailor the structure of the data to feed the needs of the client database(s) more precisely and you can change the structure of the source data until you're blue in the face.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight