Data retrieval and search accross multiple services

Data retrieval and search accross multiple services - distributed

I'm building a system that comprises a multiple heterogeneous services that talk to each other over a network, although in the standard deployment model they are all on the same machine. The UI client for managing the entities within that complex system should be able to display aggregated data from all comprising services while enabling search across that aggregated data.
I'm wondering how to design the data retrieval within this system so that it is scalable as the amount of data to be searched is already high and increases?
I'm thinking about two approaches:
The client queries data from all services on demand and aggregates the results in its layer. In many cases it will have to do joins between data coming from multiple services, so I'm concerned about performance here.
Denormalize the services data in a way so that it is convenient for the client queries and even store aggregations between the multiple services data so that the client doesn't have to do joins on demand. Probably, it would be better to store each service's denormalized data in its own database or cache as thus it would be easier to keep all denormalized data up-to-date. However, I'll need to put the aggregated views across multiple services' data in some other place and I'm concerned about the overhead of keeping this remote cache up-to-date.
Any examples or references to existing architectures that solve similar problems would be highly appreciated. Thanks!

Having an aggregated cache would surely can have better performance but think carefully about be the cost - the synchronization. It will end up that your client (or some remote service that will do this job for the clients) has its own database that synchronizes with the service data (something like implementing own database asynchronous pull replication). Check how the data retrieved from the services can change. The best for you would be if the data is not deleted/modified and only new can be added. It would be also easier if the data do not have to be consistent. Choosing appropriate synchronization mechanism depends on existing architecture and requirements.

Related

Best approach to interact with same data base table from more than one microservices

I have a situation, where I need to add/update/retrieve records from same database table from more than one microservices. I can think of below three approaches, please help me pick up the best suitable approach.
Having a dedicated Microservices say database-data-manager which will interact with data base and & add/update/retrieve data and all the other microservices will call the end points of database-data-manager to add/update/retrieve data when required.
Having a maven library called database-data-manager and all the other microservices will use this library for the db interactions.
Having the same code(copy paste) in all the applications to take care of db interactions.
Approach - 1 seems expensive as we need to host a dedicated application for a basic functionality.
Approach - 2 would reduce boiler plate code but difficult to manage library version.
Approach - 3 would cause lot of boiler plate code and maintenance efforts to keep similar code in all the microservices.
Please suggest, Thanks in advance.

A strict definition of "microservice" would include the fact it's essentially self-contained... that would include any data storage it might need. So you really have a collection of services talking to a common database. Schematics aside...
Option 1 sounds like it's on the right track: you need to have something sitting between the microservices and database. This could be a cache or a dedicated proxy service. Let's say you have an old legacy system which is really fragile, controlling data in/out through a more capable service, acting as a proxy, is a well proven pattern.
Such a proxy might do a bulk read of the database, hold the data in memory to service high-volumes of reads, and handle updates.
Updating is non-trivial and there are various options:
The services cached data becomes the pseudo master - updates are applied to the cached data first, then go into a queue to apply to the underlying database.
The services data is used only for data-reads; updates are applied to the database first, and if the update is successful it is then applied to the cached data.
Option one is great for performance, on the assumption that the proxy service is really good at managing the data and satisfying service requests. But, depending on how you implement, it might be vulnerable to outages, in which case you might lose any data that has made it into the cache but not into the pipeline that gets it into the database.
Option 2 is good for ensuring a solid master set of data, but there's the risk that consuming services might read cached data that is now out of date because it's just being updated in the database.
In terms of implementation, a queue of some sort to handle getting updates to the database might be something you want to consider, as it would give you a place to control how updates (and which updates) get to the database.

keeping databases in sync (after write/update) across regions/zones

I have to write a webservice in php to serve at three different zones/(cities or countries). Each zone will have its own machine to run this web service instance behind every webservice is a database which is exact clone/copy in each region, web service serves the clients with data from db. Main reason for multiples instances of web service is to distribute client load.
The clients can make read and write calls via web service APIs.
Write calls will modify the database for that instance but this change has to be applied as soon as possible to all databases in other zones also as all the databases in each zone are clones and exact copies, so changes in one db must be synced in all the databases in other zones.
I presume the write calls must go to some kind of master server which coordinates among all the web services etc. But I am sure this pattern is quite common and some solution is already out there.
Please advise if there is any database or application level technique which would keep the databases in sync when there are write calls so that modification or addition is reflected in all instances of db ? I can choose the database of my choice but primary choice would be mysql server or postgres, but can change to other database which can solve this issue.

You're right, this pattern is quite common and there is a name for it - Synchronous Master-Master replication. Most modern RDBMS support it:
PosgreSQL supports it thru pg_cluster https://wiki.postgresql.org/wiki/PgCluster
MySQL https://www.howtoforge.com/mysql_master_master_replication
But before implementing it straight away I'd recommend reading more about different types of replication, their pros and cons:
https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
https://dev.mysql.com/doc/refman/8.0/en/replication.html
Synchronous Master-Master replication will be quite slow, especially in a multi-zone scenario, so you might consider other techniques:
Asynchronous replication
Sharding/Partitioning
A mix of sharding and replication
There is a very good book on different distributed techniques(including sharding and replication) - "Designing Data Intensive Applications" by Martin Kleppmann.

Replication techniques are definitely worth looking at, but there can be a certain amount of technical overhead and cost to replication. I work for a company called Redactics (https://www.redactics.com), and we came up with a simpler solution that is sort of a near realtime replication based on delta updates using a pure SQL approach.
There are certainly pros and cons to both approaches, I'm not trying to push Redactics hard if this is not the most appropriate solution for your needs, but Redactics simply tracks the most recent primary keys and uses modification timestamps to find new and changed records, and then copies them over. You can run the sync pretty often without a lot of load since it is just a delta update. Obviously any workflow can break, but repairing broken replication can be tricky, so we like this approach and running these sync workflows within your own infrastructure.

Database Bottleneck In Distributed Application

I hear about SOA and Distributed Applications everywhere now. I would like know about some best practices related to keeping the single data source responsive or in case if you have copy of data on every server how it is better to synchronise those databases to keep them updated ?

There are many answers to this question and in order to choose the most appropriate solution, you need to carefully consider what kind of data you are storing and what you want to do with it.
Replication
This is the traditional mechanism for many RDBMS, and normally relies on features provided by the RDBMS. Replication has a latency which means although servers can handle load independently, they may not necessarily be reading the latest data. This may or may not be a problem for a particular system. When replication is bidirectional then simultaneous changes on two databases can lead to conflicts that need resolving somehow. Depending on your data, the choice might be easy (i.e. audit log => append both), or difficult (i.e. hotel room booking - cancel one? select alternative hotel?). You also have to consider what to do in the event that the replication network link is down (i.e. do you deny updates on both database, one database or allow the databases to diverge and sort out the conflicts later). This is all dependent on the exact type of data you have. One possible compromise, for read-heavy systems, is to use unidirectional replication to many databases for reading, and send all write operations to the source database. This is always a trade-off between Availability and Consistency (see CAP Theorem). The advantage of RDBMS and replication is that you can easily query your entire dataset in complex ways and have greater opportunity to
remove duplication by using relational links to data items.
Sharding
If your data can be cleanly partitioned into disjoint subsets (e.g. different customers), such that all possible relational links between data items are contained within each subset (e.g. customers -> orders). Then you can put each subset in separate databases. This is the principle behind NoSQL databases, or as Martin Fowler calls them 'Aggregate-Oriented Databases'. The downside of this approach is that it requires more work to run queries over your entire dataset, as you have to query all your databases and then combine the results (e.g. map-reduce). Another disadvantage is that in separating your data you may need to duplicate some (e.g. sharding by customers -> orders might mean product data is duplicated). It is also hard to manage the data schema as it lies independently on multiple databases, which is why most NoSQL databases are schema-less.
Database-per-service
In the microservice approach, it is advised that each microservice should have its own dedicated database, that is not allowed to be accessed by any other microservice (of a different type). Hence, a microservice that manages customer contact information stores the data in a separate database from the microservice that manages customer orders. Links can be made between the databases using globally unique ids, or URIs (especially if the microservices are RESTful) etc. The downside again from this is that it is even harder to perform complex queries on the entire dataset (especially since all access should go via the microservice API not direct to the databases).
Polyglot storage
So many of my projects in the past have involved a single RDBMS in which all data was placed. Some of this data was well suited to the relational model, much of it was not. For example, hierarchical data might be better stored in a graph database, stock ticks in a column-oriented database, html templates in a NoSQL database. The trend with micro-services is to move towards a model where different parts of your dataset are placed in storage providers that are chosen according to the need.

If you thinking to keep different copies of the database for each microservice and you want to achieve eventual consistency than you can use Kafka Connect. I can briefly tell you that kafka connect will watch your DBS and whenever there are any changes it will read the log file and will add these logged events as a message in Queue then another database those are a subscriber to this Queue can execute the same statement at their side also.
Kafka connect isn't the only framework, you can search and find other frameworks or application for the same implementation.

Microservices and database joins

For people that are splitting up monolithic applications into microservices how are you handling the connundrum of breaking apart the database. Typical applications that I've worked on do a lot of database integration for performance and simplicity reasons.
If you have two tables that are logically distinct (bounded contexts if you will) but you often do aggregate processing on a large volumes of that data then in the monolith you're more than likely to eschew object orientation and are instead using your database's standard JOIN feature to process the data on the database prior to return the aggregated view back to your app tier.
How do you justify splitting up such data into microservices where presumably you will be required to 'join' the data through an API rather than at the database.
I've read Sam Newman's Microservices book and in the chapter on splitting the Monolith he gives an example of "Breaking Foreign Key Relationships" where he acknowledges that doing a join across an API is going to be slower - but he goes on to say if your application is fast enough anyway, does it matter that it is slower than before?
This seems a bit glib? What are people's experiences? What techniques did you use to make the API joins perform acceptably?

When performance or latency doesn't matter too much (yes, we don't
always need them) it's perfectly fine to just use simple RESTful APIs
for querying additional data you need. If you need to do multiple
calls to different microservices and return one result you can use
API Gateway pattern.
It's perfectly fine to have redundancy in Polyglot persistence environments. For example, you can use messaging queue for your microservices and send "update" events every time you change something. Other microservices will listen to required events and save data locally. So instead of querying you keep all required data in appropriate storage for specific microservice.
Also, don't forget about caching :) You can use tools like Redis or Memcached to avoid querying other databases too often.

It's OK for services to have read-only replicated copies of certain reference data from other services.
Given that, when trying to refactor a monolithic database into microservices (as opposed to rewrite) I would
create a db schema for the service
create versioned* views** in that schema to expose data from that schema to other services
do joins against these readonly views
This will let you independently modify table data/strucutre without breaking other applications.
Rather than use views, I might also consider using triggers to replicate data from one schema to another.
This would be incremental progress in the right direction, establishing the seams of your components, and a move to REST can be done later.
*the views can be extended. If a breaking change is required, create a v2 of the same view and remove the old version when it is no longer required.
**or Table-Valued-Functions, or Sprocs.

CQRS---Command Query Aggregation Pattern is the answer to thi as per Chris Richardson.
Let each microservice update its own data Model and generates the events which will update the materialized view having the required join data from earlier microservices.This MV could be any NoSql DB or Redis or elasticsearch which is query optimized. This techniques leads to Eventual consistency which is definitely not bad and avoids the real time application side joins.
Hope this answers.

I would separate the solutions for the area of use, on let’s say operational and reporting.
For the microservices that operate to provide data for single forms that need data from other microservices (this is the operational case) I think using API joins is the way to go. You will not go for big amounts of data, you can do data integration in the service.
The other case is when you need to do big queries on large amount of data to do aggregations etc. (the reporting case). For this need I would think about maintaining a shared database – similar to your original scheme and updating it with events from your microservice databases. On this shared database you could continue to use your stored procedures which would save your effort and support the database optimizations.

In Microservices you create diff. read models, so for eg: if you have two diff. bounded context and somebody wants to search on both the data then somebody needs to listen to events from both bounded context and create a view specific for the application.
In this case there will be more space needed, but no joins will be needed and no joins.

How to get reports from web services in efficient manner

We have a distributed system with 3 sites. Each site has its own services that encapsulates both logic and data.All services are using mysql database as the persistence system and SOAP services. But we get a trouble with database reports since maintaining services encapsulation prevents from accessing database directly. So How to get reports from web services without breaking encapsulation provided by web services and in the same time maintaining efficiency.

Share a common data-structure known by the services and the clients.
I'd implement a very simple serializable data-structure, and have these entities to be interchanged, known between the client and the server(s). And of course all services would output the same data-structures.
If you have already a persistence layer (if not, build one), with DAO/DAL(s) entities, have them to be responsible of querying the data and performing the transformation between the original data to these new common data-structures. A helper class would do that automatically.
What I think it could be this data-structure, is an entity based on a set of rows and columns (array of object instances), plus, an array of columns identifiers, known by both the client and the server, so that your model knows which are the columns being requested by the client.
In this way you could have a client requesting 3 columns of the report, and a different client, might be requesting many others of the same report.
Additionally, I'd of course, not including any HTML in the data, just the raw data, and your clients to be responsible on how to present that data.
This above is a little bit abstract.. but hope it helps you anyway.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight