I don't understand SOA (Service-oriented Architecture) and databases. While I'm attracted by the SOA concept (encapsulating reusable business logic into services) I can't figure out how it's supposed to work if data tables encapsulated in a service are required by other services/systems---or is SOA suitable at all in this scenario?
To be more concrete, suppose I have two services:
CustomerService: contains my Customers database table and associated business logic.
OrderService: contains my Orders table and logic.
Now what if I need to JOIN the Customers and Orders tables with an SQL statement? If the tables contain millions of entries, unacceptable performance would result if I have to send the data over the network using SOAP/XML. And how to perform the JOIN?
Doing a little research, I have found some proposed solutions:
Use replication to make a local copy of the required data where needed. But then there's no encapsulation and then what's the point of using SOA? This is discussed on StackOverflow but there's no clear consensus.
Set up a Master Data Service which encapsulates all database data. I guess it would get monster sized (with essentially one API call for each stored procedure) and require updates all the time. To me this seems related to the enterprise data bus concept.
If you have any input on this, please let me know.
One of the defining principals of a "service" in this context is that it owns, absolutely, that data in the area it is responsible for, as well as operations on that data.
Copying data, through replication or any other mechanism, ditches that responsibility. Either you replicate the business rules, too, or you will eventually wind up in a situation where you wind up needing the other service updated to change your internal rules.
Using a single data service is just "don't do SOA"; if you have one single place that manages all data, you don't have independent services, you just have one service.
I would suggest, instead, the third option: use composition to put that data together, avoiding the database level JOIN operation entirely.
Instead of thinking about needing to join those two values together in the database, think about how to compose them together at the edges:
When you render an HTML page for a customer, you can supply HTML from multiple services and compose them next to each other visually: the customer details come from the customer service, and the order details from the order service.
Likewise an invoice email: compose data supplied from multiple services visually, without needing the in-database join.
This has two advantages: one, you do away with the need to join in the database, and even the need to have the data stored in the same type of database. Now each service can use whatever data store is most appropriate for their need.
Two, you can more easily change the outside of your application. If you have small, composable parts you can easily add rearrange the parts in new ways.
The guiding principle is that it is ok to cache immutable data
This means that simple immutable data from the customer entity can exist in the order service and there's no need to go to the customer service every time you need the info. Breaking everything to isolated services and then always making these remote procedure calls ignores the fallacies of distributed computing.
If you have extensive reporting needs you need to create an additional service. I call that Aggregated Reporting service, which, again gets read-only data for reporting purposes. You can see an article I wrote about that for InfoQ a few years ago
In the SO question you quoted, various people state that it is OK for a service to access another services data, so the Order service could have a GetAllWithCustomer functionality, which would return all the orders along with the customer details for that order.
Also, this question of mine may be helpful:
https://softwareengineering.stackexchange.com/questions/115958/is-it-bad-practice-for-services-to-share-a-database-in-soa
Related
I'm about to propose some fundamental changes to my employers and would like the opinion of the community (opinion because I know that getting a solid answer to something like this is a bit far-fetched).
Context:
The platform I'm working on was built by a tech consultancy before I joined. While I was being onboarded they explained that they used DDD to build it, they have 2 domains, the client side and the admin side, each has its own database, its own GraphQl server, and its own back-end and front-end frameworks. The data between the tables is being synchronized through an http service that's triggered by the GraphQl server on row insertions, updates, and deletes.
Problem:
All of the data present on the client domain is found in the admin domain, there's no domain specific data there. Synchronization is a mess and is buggy. The team isn't large enough to manage all the resources and keep track of the different schemas.
Proposal:
Remove the client database and GraphQl servers, have a single source of truth database for all the current and potentially future applications. Rethink the schema, split the tables that need to be split, consolidate the ones that should be joined, and create new tables according to the actual current business flow.
Am I justified in my proposal, or was the tech consultancy doing the right thing and I'm sending us backwards?
Normally you have a database, or schema, for each separated boundary context. That means, that the initial idea of the consultancy company was correct.
What's not correct is the way that the consistency between the two is managed. You don't do it on tables changes but with services inside one (or both) the domains listening to the events and taking the update actions. It's a lot of work, anyway, because you have to update the event handlers on every change (in the events or tables structure).
This code is what's called anti corruption layer, that's exactly what it does: it avoids any corruption between the copies of the domain in another domain.
Said this, as you pointed out, your team is small and it could be that maintaining such a layer (and hence code) could cost a lot of energies. But, you've also to remember that once you've done, you have just to update it when needed.
Anyway, back to the proposal, you could also take this route. What you should (must, I would say) is that in each domain the external tables should be accessed only by some services, or queries, and this code should never ever modify the content that it access. Never. But I suppose that you already know this.
Nothing is written in the stone, the rules should always be adapted when put in a real context. Two separated databases means more work, but also a much better separation of the domains. It could never happen that someone accidentally modifies the content of the tables of the other domain. On the other side, one database means less work, but also much more care about what the code does.
For people that are splitting up monolithic applications into microservices how are you handling the connundrum of breaking apart the database. Typical applications that I've worked on do a lot of database integration for performance and simplicity reasons.
If you have two tables that are logically distinct (bounded contexts if you will) but you often do aggregate processing on a large volumes of that data then in the monolith you're more than likely to eschew object orientation and are instead using your database's standard JOIN feature to process the data on the database prior to return the aggregated view back to your app tier.
How do you justify splitting up such data into microservices where presumably you will be required to 'join' the data through an API rather than at the database.
I've read Sam Newman's Microservices book and in the chapter on splitting the Monolith he gives an example of "Breaking Foreign Key Relationships" where he acknowledges that doing a join across an API is going to be slower - but he goes on to say if your application is fast enough anyway, does it matter that it is slower than before?
This seems a bit glib? What are people's experiences? What techniques did you use to make the API joins perform acceptably?
When performance or latency doesn't matter too much (yes, we don't
always need them) it's perfectly fine to just use simple RESTful APIs
for querying additional data you need. If you need to do multiple
calls to different microservices and return one result you can use
API Gateway pattern.
It's perfectly fine to have redundancy in Polyglot persistence environments. For example, you can use messaging queue for your microservices and send "update" events every time you change something. Other microservices will listen to required events and save data locally. So instead of querying you keep all required data in appropriate storage for specific microservice.
Also, don't forget about caching :) You can use tools like Redis or Memcached to avoid querying other databases too often.
It's OK for services to have read-only replicated copies of certain reference data from other services.
Given that, when trying to refactor a monolithic database into microservices (as opposed to rewrite) I would
create a db schema for the service
create versioned* views** in that schema to expose data from that schema to other services
do joins against these readonly views
This will let you independently modify table data/strucutre without breaking other applications.
Rather than use views, I might also consider using triggers to replicate data from one schema to another.
This would be incremental progress in the right direction, establishing the seams of your components, and a move to REST can be done later.
*the views can be extended. If a breaking change is required, create a v2 of the same view and remove the old version when it is no longer required.
**or Table-Valued-Functions, or Sprocs.
CQRS---Command Query Aggregation Pattern is the answer to thi as per Chris Richardson.
Let each microservice update its own data Model and generates the events which will update the materialized view having the required join data from earlier microservices.This MV could be any NoSql DB or Redis or elasticsearch which is query optimized. This techniques leads to Eventual consistency which is definitely not bad and avoids the real time application side joins.
Hope this answers.
I would separate the solutions for the area of use, on let’s say operational and reporting.
For the microservices that operate to provide data for single forms that need data from other microservices (this is the operational case) I think using API joins is the way to go. You will not go for big amounts of data, you can do data integration in the service.
The other case is when you need to do big queries on large amount of data to do aggregations etc. (the reporting case). For this need I would think about maintaining a shared database – similar to your original scheme and updating it with events from your microservice databases. On this shared database you could continue to use your stored procedures which would save your effort and support the database optimizations.
In Microservices you create diff. read models, so for eg: if you have two diff. bounded context and somebody wants to search on both the data then somebody needs to listen to events from both bounded context and create a view specific for the application.
In this case there will be more space needed, but no joins will be needed and no joins.
I want to provide controlled access to data which is stored in multiple tables. The access is decided based on certain run-time attributes associated with the user. I am looking for a solution which is extensible, performant as well as highly secured.
ILLUSTRATION:
There is a framework level module which stores authorization/access related data for multiple other modules. Then there are n numbers of modules which manage their own life cycle objects. e.g. module Test1 has 1000 instances which are created and stored in its base table. As framework solution I want to protect access to this data by users hence I created a notion of privileges and stored their mapping to user in my own table. Now to provide controlled access to data, my aim is that a user is shown only the objects to which he/she has access to.
Approaches in my mind:
I use oracle database and currently we are using VPD (virtual private database) so here we add a policy on each of the base table of above mentioned modules which firstly evaluates the access of currently logged in user from the privileges given to him and then that data is appended into all the query to each of the base tables of other modules (by default by database itself).
PROS: very efficient and highly secured solution.
CONS: Can not work if the base tables and our current table are in two different schema. May be two different schema in the same database instance can be overcome but some of my integrator systems might be in separate databases altogether.
Design at java layer:
We connect to our DB's through JPA data sources. So I can write a thin layer basically a wrapper of sorts over EntityManager and then replicate what VPD does for me that is firstly get the access related data from my tables then use a monitored query on the table of my integrator and then may be cache the data into a caching server(optimization).
CONS: I want to use it in production system hence want to get it done in the first shot. Want to know any patterns which are already implemented in the industry.
I do not think your solution are flexible enough to work well in a complex scenario like yours. If you have very simple queries, then yes, you can design something like SQL screener at database or "java" level and then just pass all your queries through.
But this is not flexible. As soon as your queries will start to grow complex, improving this query screener will become tremendously difficult since it is not a part of bussiness logic and cannot know the details of your permission system.
I suggest you implement some access checks in your service layer. Service must know for which user it generates or processes the data. Move query generation logic to repositories and have your services call different repository methods depending on user permissions for example. Or just customize repository calls with parameters depending on user permissions.
I have a Spring application which supports a single customer.
I would like to extend this application to support multiple customers where each customers database is stored in a separate database. The schema for the database is the same for each customer, and the same DAOs and business logic should remain the same.
How would I accomplish this with Spring/JPA? Would I need to have multiple persistence contexts and wire in an appropriate entity manager factory based upon the currently logged in user? Are there any examples of implementing something similar to this?
I would advice against running separate database under a single application. If a redesign of the data model to incorporate multiple customers is not an option, why don't you run multiple instances of your application server/web container, one for each customer? As otherwise you'll have to deal with the drawbacks of having a shared platform and isolated databases.
With multiple customer databases and a single application your code will become more complex, you can't guarantee that customer data is fully isolated (e.g. due to a bug in the application a customer is shown the wrong data, so there's not much benefit in isolating each customer) and you'll have the nightmare of maintaining each customer database. Also, by having different databases you can virtually guarantee that someone pointy-haired is going to ask for some bespoke functionality for customer A while leaving customer B's functionality untouched, because "... it will be easy, as we've got different databases...", forgetting that the application is shared.
If you really, really want to have separate databases for particular customers, this would be the way to go — define separate persistence units with the same entity definitions, but different entity manager factory configurations.
To me it sounds more like a need to redesign the database structure. I'm guessing that the application has been written for only one client in mind and it turned out that more appeared on the horizon, so, hey, let's do something about it, and fast! Aren't you trying to copy-paste, but in a bigger scale? You'll going to have a lot of redundancy with JPA if you want to have a few databases with the same structure: for example, everything what's defined inside the mapping-file (queries, entity relationship mappings, etc.) is defined per persistence unit — you'll have to repeat these definitions and keep them all synchronized.
I'll stop here, as it is merely guesswork, for the lack of broader description.
I'm working on an application for one of our departments that contains medical data. It interfaces with a third party system that we have here.
The object itself (a claim) isn't terribly complex, but due to the nature of the data and the organization of the database, retrieving the claim data is very complex. I cannot simply join all the tables together and get the data. I need to do a "base" query to get the basics of the claim, and then piece together supplemental data about the claim based on various issues.
Would it be better to when working with this data:
Generate the object in a stored procedure, where all of the relevant data is readily available, and iterate through a table variable (using SQL Server 2005) to piece together all the supplemental information.
Generate the object in the data access layer, where I have a little more powerful data manipulation at my disposal, and make a bunch of quick and simple calls to retrieve the lookup data.
Use an OR/M tool and map out all the complex situations to generate the object.
Something else.
EDIT: Just to clarify some of the issues listed below. The complexity really isn't a business issue. If a claim as a type code of "UB", then I have to pull some of the supplemental data from Table X. If the claim has a type code of "HCFA", then I have to pull some of the data from Table Y. It is those types of things. I hope this helps.
One more vote for stored procedures in this case.
What you are trying to model is a very specific piece of business logic ("what is a claim") that needs to be consistent across any application that deals with the concept of a claim.
If you only ever have one application, or multiple applications using the same middleware, you can put that in the client code; however, practice shows that databases tend to outlive software that accesses them.
You do not want to wind up in a situation where subtle bugs and corner cases in redundant implementations make different applications see the data in slightly different ways. DRY, and all that.
I would use a stored procedure for security reasons. You don't have to give SELECT privileges to the claims tables that you are using, which sound somewhat important. You only have to give the user access to that stored procedure. If the database user already has SELECT privileges on the tables, I don't see anything wrong with generating the object in the data access layer either. Just be consistent with whatever option you choose. If you are using stored procedures elsewhere, continue to use them here. The same applies to generating the objects in the data access layer.
Push decisions/business logic as high up in your applications code hierarchy as possible. ORMs/stored procedures are fine but cannot be as efficient as hand written queries. The higher up in your code you go the more you know what the data will be used for and have the information to intelligently get it.
I'm not a fan of pushing business logic down to the persistence layer, so I wouldn't recommend option 1. The approach I'd take involves having a well-defined program object that models the underlying database entity, so ORM oriented, but your option 3 sounds like you're thinking of it as an onerous mapping task, which I really don't. I'd just have the logic necessary for loading up whatever you're concerned about with this object set up in methods on the program object modeling it.
As a general rule, I use a data access layer just to retrieve data (possibly from different sources) and return it in a meaningful manner.
Anything that requires business rules or logic (decisions) goes in my business layer.
I do not deviate from that choice lightly*.
It sounds like the claim you are generating is really a view of data stored in various places, without any decisions or business logic. If that's the case, I would tend to manage access to that data in the data layer.
**I'll never forget one huge system I worked on that got very over-complicated because the only person available to work on a central piece was an expert at stored procedures... so lots of the business logic ended up there.*
Think of the different ways you're planning to consume the data. The whole purpose of an application layer is to make your life easier. If it doesn't, I agree with #hoffmandirt that it's safer in the database.
Stored procedures are bad, m'kay?
It sounds like views would be better than stored procedures in this case.
If you are using .NET, I would highly recommend going with an ORM to get support for Linq.
In general, spreading business logic between the database and application code is not a good idea.
In the end, any solution will likely work. You aren't facing a make or break type decision. Just get moving, don't get hung up on this kind of issue.