Data integrity across the databases of different Microservices - database

I am using relational databases for my microservices. I have CustomersMService which has its own database with table Customer, then I have OrdersMService which also has its own database but with table Order, and that table has column CustomerId. How can I ensure data integrity between databases, so Orders won't point to non-existent Customers?

How can I ensure data integrity between databases, so Orders won't point to non-existent Customers?
There is an important dimension missing, which is that of the span of time over which you wish to establish the referential integrity.
If you ask, "How can I ensure that all my data is 100% consistent at all times?" - you can't. If you want that you will need to enforce it, either via foreign key constraints (which are unavailable across databases), or by never writing to more than one database outside of a distributed transaction (which would defeat the purpose of using service orientation).
If you ask, "How can I ensure that all my data is 100% consistent after a reasonable span of time?", then there are things you can do. A common approach is to implement durable, asynchronous eventing between your services. This ensures that changes can be written locally and then dispatched remotely in a reliable, but offline, manner. A further approach is a caretaker process which periodically remediates inconsistencies in your data.
However, outside of a transaction, even over a reasonable span of time, consistency is impossible to guarantee. If absolute consistency is a requirement for your application then microservices may not be the right approach.

Related

Database Bottleneck In Distributed Application

I hear about SOA and Distributed Applications everywhere now. I would like know about some best practices related to keeping the single data source responsive or in case if you have copy of data on every server how it is better to synchronise those databases to keep them updated ?
There are many answers to this question and in order to choose the most appropriate solution, you need to carefully consider what kind of data you are storing and what you want to do with it.
Replication
This is the traditional mechanism for many RDBMS, and normally relies on features provided by the RDBMS. Replication has a latency which means although servers can handle load independently, they may not necessarily be reading the latest data. This may or may not be a problem for a particular system. When replication is bidirectional then simultaneous changes on two databases can lead to conflicts that need resolving somehow. Depending on your data, the choice might be easy (i.e. audit log => append both), or difficult (i.e. hotel room booking - cancel one? select alternative hotel?). You also have to consider what to do in the event that the replication network link is down (i.e. do you deny updates on both database, one database or allow the databases to diverge and sort out the conflicts later). This is all dependent on the exact type of data you have. One possible compromise, for read-heavy systems, is to use unidirectional replication to many databases for reading, and send all write operations to the source database. This is always a trade-off between Availability and Consistency (see CAP Theorem). The advantage of RDBMS and replication is that you can easily query your entire dataset in complex ways and have greater opportunity to
remove duplication by using relational links to data items.
Sharding
If your data can be cleanly partitioned into disjoint subsets (e.g. different customers), such that all possible relational links between data items are contained within each subset (e.g. customers -> orders). Then you can put each subset in separate databases. This is the principle behind NoSQL databases, or as Martin Fowler calls them 'Aggregate-Oriented Databases'. The downside of this approach is that it requires more work to run queries over your entire dataset, as you have to query all your databases and then combine the results (e.g. map-reduce). Another disadvantage is that in separating your data you may need to duplicate some (e.g. sharding by customers -> orders might mean product data is duplicated). It is also hard to manage the data schema as it lies independently on multiple databases, which is why most NoSQL databases are schema-less.
Database-per-service
In the microservice approach, it is advised that each microservice should have its own dedicated database, that is not allowed to be accessed by any other microservice (of a different type). Hence, a microservice that manages customer contact information stores the data in a separate database from the microservice that manages customer orders. Links can be made between the databases using globally unique ids, or URIs (especially if the microservices are RESTful) etc. The downside again from this is that it is even harder to perform complex queries on the entire dataset (especially since all access should go via the microservice API not direct to the databases).
Polyglot storage
So many of my projects in the past have involved a single RDBMS in which all data was placed. Some of this data was well suited to the relational model, much of it was not. For example, hierarchical data might be better stored in a graph database, stock ticks in a column-oriented database, html templates in a NoSQL database. The trend with micro-services is to move towards a model where different parts of your dataset are placed in storage providers that are chosen according to the need.
If you thinking to keep different copies of the database for each microservice and you want to achieve eventual consistency than you can use Kafka Connect. I can briefly tell you that kafka connect will watch your DBS and whenever there are any changes it will read the log file and will add these logged events as a message in Queue then another database those are a subscriber to this Queue can execute the same statement at their side also.
Kafka connect isn't the only framework, you can search and find other frameworks or application for the same implementation.

Pivoting EAV data from the client to a Relational Model on the server

I am building a software platform for mobile electronic data collection. It should support any type of data. For example, the government might use it for a population survey; a manufacturing company might use it to evaluate plant condition at their factories; a research organizations might use it for clinical trials, e.t.c
As such, the software is powered by a database, with standard relational design for the metadata and entity attribute value for the actual data. Client software then reads the metadata and renders the appropriate user interface, complete with rules, validations, skip logic and so on. I believe the choice of EAV is a good one owing to the diversity of data that might be collected, but ...
Once the data is submitted from the mobile clients to the customer's server, the EAV model is no longer useful because the customer expects just his set of (usually very few) tables, for visualization and processing.
I have considered two options for pivoting the data.
1) Pivot the data immediately it is submitted to the server (via a JSON web service) and save it straightaway into a relational model.
2) Save the data in a similar schema on the server but have a background process that pivots it periodically and saves it in a relational model.
The first alternative seems more efficient as pivoting one record at a time is obviously quicker and less CPU intensive. The disadvantage is that if the metadata is changed, this process needs to adapt immediately by changing the relational model for the data accordingly. Depending on the extent of the changes, this can take some time. Worse, if it fails for any reason, upload requests might start being declined. If using the second approach, such failure would not "break" anything as urgent.
Are there other potential pitfalls I might be missing or design considerations I should make? What are some good reasons to do it one way or the other? Are there other alternatives I should be exploring to solve this problem?
Just define a straightforward relational schema of tables for their data using DDL. EAV is just an encoding of a proper schema & its metadata. Which, of course, the DBMS can't understand so you lose practically all the benefits of a DBMS. The only possible reason to use EAV is when tables are not known at compile time and DDL isn't fast enough or able to hold enough tables.
The EAV requests are just textual rearrangements of the DDL requests. (EAV configuration is typically a table for multiple entity-attribute-value requests given a table and key column(s) of the entities having virtual tables.) Moreover one only has to write a single interface easily implemented to map EAV configuration-then-updates to whichever of the two implementations one chooses. (It is better to use a pure relational interface and hide the chosen implementation but the nature of interfaces to SQL DBMSes, namely SQL, makes that difficult. Ie it would be easy if one is using a relational API rather than SQL.)
The EAV configuration without such an interface is only simpler if you don't declare the appropriate constraints or transactions on the virtual per-entity tables. Also every EAV version update or query must reconstruct the virtual tables then embed those expressions in the DDL version's update or query. (Only in the case of simply inserting or deleting or retrieving a single triple is the EAV DML as simple.)
Only if you showed that creating & deleting new tables was infeasible and the corresponding horrible integrity-&-concurrency-challenged mega-joining table-and-metadata-encoded-in-table EAV information-equivalent design was feasible should you even think of using EAV.

Handle Databaserelations on serverside or in the program

after a few discussions with a collegue we still not have the same meaning about this topic.
In my opinion it makes more sense to create a properly designed Database with all including relations.
Im not really experienced in this area, this is why im asking you.
Advantages in my opinion
- No "wrong" inserts because of the relation conflicts in the Database
- Database and Program is strictly seperated
- Several programms for the same Datasource requires less work to customize
- Making the use of LINQ much easier
- and many more.... ?
Possible disadvantages of this way?
What are the advantages of not related Tables?
Transactional systems should "always" have the referential integrity enforced as close to the database as possible. Most people would agree that this is best done right inside the database itself. You have correctly recognized many of the advantages of letting the DBMS enforce referential integrity.
I said "always" above because I believe in common sense and deliberate decisions not rules of thumb.
One reason why someone may not want to enforce referential integrity within the database is that you have a cyclical relationship where the parent and the child need to point to each other and it is not possible to insert one record because the other isn't there yet. This leaves you with a so-called catch-22. In this case, you may need to enforce the referential integrity in program logic. Still, the best place for this is in the data layer, not in the application layer.
Another reason why some people don't worry about referential integrity is when the data is read-only. This can happen in a reporting database or data warehouse. Referential integrity in the database creates indexes which are used to enforce the relationships. This can sometimes be a space issue, but more often it is just a problem with making the data warehouse load harder because of the order of operations required.
One more reason why referential integrity is sometimes not used is that archiving old transactional data can get tricky because of complex interrelationships between master tables and transaction tables. You can easily find yourself in a position where it's impossible to delete any data, no matter how old it is, because it is somehow related to something that is related to another thing that is needed by something current.
Having said all of this you should definitely start from the position of using referential integrity features of your database and only back away from this if you have a really good, well considered reason.
Of course !!! You must enforce the referenctial integrity within your database model ! Safer, more efficient, guaranteed data integrity, and you do not rely on the programmer. No discussion here.
Not related tables are ONLY usable if you are just building a "reporting db" that downloads nightly data from various systems, for example.

data integrity with dbms or application stack

My question is very simple and it's more of an advice that i'm seeking. What is better when it comes to maintaining data integrity: DBMS or Application Code?
Example: With DBMS we can use things like Triggers, Transactions, Procedures etc to do ALMOST proper data management and making sure things go and fit into the right place.. Same can be achieved with Application code.
Which one would you prefer in particular?
Application Code or a Combination of both?
Generally you would want to bake any data-at-rest integrity into the DBMS. Referential requirements, data limitations (length, etc.), things like that. The idea is that the DBMS should, in and of itself (disconnected from and not reliant on the application[s] which use[s] it), maintain the integrity and rules of the data it contains.
They key thing to note there is that multiple applications may use this database. Even if that's not the case for a particular database, or not even likely to be the case in the foreseeable future, it's still good design. Not all of the applications necessarily have the same business logic or the same integrity checks. The DBMS should never assume that what it's getting has been checked for integrity.
The application(s) should apply the business logic and maintain integrity for data-in-motion. It should do its best to prevent even trying to persist invalid data to the database. But in any given point in the application it may very well not be reasonable to assume that it "knows" all of the other data in the database. The application can apply logic to the small piece of data it's currently holding, then try to interact with the database to persist it.
But the job of the database is to know and maintain all of the data, not just what's currently being used by the application. So where the application may believe that it has a perfectly good piece of data, the database may disagree based on the state of some other data which the application wasn't using. It's perfectly acceptable for the database to then return an error to the application to tell it that there's a problem with the data being sent. The application should be able to handle such errors.
To sum up...
From the point of view of the application, active data is a small subset of all data and the application is responsible only for the active data. Also, data is in motion and very fluid and part of a richer logical model of business logic.
From the point of view of the DBMS, "active data" is all data and the DBMS is responsible for maintaining the integrity of all data. Generally, data is at rest and static and should at any given snapshot of the tables be "good data."

Should referential integrity be enforced?

One of the reasons why referential integrity should not be enforced is performance. Because Db has to validate all updates against relationships, it just makes things slower but what are the other pros and cons of enforcing and not enforcing?
Because relationships are maintained in the business logic layer anyway, it just makes them redundant for db to do it. What are your thoughts on it?
The database is responsible for data. That's it. Period.
If referential integrity is not done in the database, then it's not integrity. It's just trusting people not to do bad things, in which case you probably shouldn't even worry about password-protecting your data either :-)
Who's to say you won't get someone writing their own JDBC-connected client to totally screw up the data, despite your perfectly crafted and bug-free business layer (the fact that it probably won't be bug-free is another issue entirely, mandating that the DB should protect itself).
First of all, it's almost impossible to make it really work correctly. To have any chance of working right, you need to wrap a lot of the cascading modifications as transactions, so you don't have things out of sync while you've changed one part of the database, but are still updating others that depend on the first. This means code that should be simple and aware only of business logic suddenly needs to know about all sorts of concurrency issues.
Second, keeping it working is almost impossible to hope for -- every time anybody touches the business logic, they need to deal with those concurrency issues again.
Third, this makes the referential integrity difficult to understand -- in the future, when somebody wants to learn about your database structure, they'll have to reverse engineer it out of your business logic. With it in the database, it's separate, so what you have to look at only deals with referential integrity, not all sorts of unrelated issues. You have (for example) direct chains of logic showing what a modification to a particular field will trigger. At least for quite a few databases, that logic can be automatically extracted and turned into fairly useful documentation (e.g., tree diagrams showing dependencies). Extracting the same kind of information from the BLL is more likely to be a fairly serious project.
There are certainly some points in the other direction, and reasons to craft all of this by hand -- scalability and performance being the most obvious. When/if you go that route, however, you should be aware of what you're giving up to get that performance. In some cases, it's a worthwhile tradeoff -- but in other cases it's not, and you need information to make a reasoned decision.
Relationships may be maintained in a business logic layer. Unless you can guarantee 100% beyond any doubt that your BLL is and always will be bug-free, then you don't have data integrity. And you can't make that guarantee.
Also, if another app will ever touch your database, it isn't required to follow (read: reimplement, maybe in a subtlely wrong way) the rules in your BLL. It could corrupt the data, even if you somehow managed to be one of the 3 programmers on Earth to write bug-free code.
The database, meanwhile, enforces the same rules for everybody -- and rules enforced by the database are far less likely to be overlooked when you're updating, since the DB won't allow it.
Have a listen to Dan Pritchett, Technical Fellow at eBay on why certain database constructs such as transactions and referential integrity are not the mandates that textbooks might indicate they should be... It comes down to the types of data, the volume of queries and business requirements. Balance those and it will lead you to pragmatic solutions, not dogmatic answers...
However, do not assume that keeping relationships in the BLL will protect your data. You cannot guarantee that future developers won't expose new APIs that bypass the BLL for "performance" reasons, or simple lack of understanding of your architecture...
The performance assumption on which the question is based is incorrect as a general rule. Usually if you require RI to be enforced then the database is the most efficient place to do it, NOT the application - otherwise the application has to requery more data in order to be able to validate RI outside the database.
Also, RI constraints in the database are useful for the query optimiser for making other queries more efficient. Integrity constraints in the application can't achieve that.
Lastly, the cost of maintaining integrity constraints in every application is generally more expensive and complex than doing it once in one place.
But Colonel Ingus, if you've got the customer with an id in the session you've already probed the database! The problem is when you then write your sales order away, but didn't attach it to a product because you didn't prob for a product. One way or another you'll end up with orphaned records, just like the very large company I'm currently working for has. We have customers with no history and history with no customers; customers with outstanding balances who've never bought anything and goods sold to customers who don't exist - interesting business concepts - and it keeps a team of very frustrated support staff in full time employment trying to sort it out. It would be far less expensive to have put RI on everything and bought a bigger box to sort out any perceived performance problems.
A lot has already been said about the fact that the DB should be the final place to validate/control your constraints (and I couldn't agree more)
If the data is important, then your application won't be the last to access the database and it won't be the only one.
But there is another very important fact about referential integrity (and other constraints): it documents your datamodel and makes the dependencies between the tables explicit.
As far as performance is concerned, defining FKs (or other constraints) in the database can make things even faster in certain cases, because the DBMS can rely on the constraints and make approriate optimizations.
It depends on the data, if its highly transactional data such as business transactions and what not where frequent updates are happening then enforcing the business rules in the database is extremely important.. But for everything else the performance impact may not be worth it..
What paxdiablo and dportas said. And my two cents. There are two other considerations.
In order to validate referential integrity for a new insert, you have to do a probe into the database to verify that the reference is valid. You just nullfied the performance gain that led you to want to enforce integrity in the application. It's actually faster to let the DBMS enforce referential integrity.
Beyond that, consider the case where you have more than one application all reading and writing data in a single database. If you enforce referential integrity in the business application layer, you have to make sure that all of the applications do things right. Otherwise, some aberrant application could store invalid refrences, and the problem could surface when a different application went to use the data. That's a real mess.
Better to have the DBMS enforce the data rules for all the applications.
If you are maintaining the relationships in the business layer, you can guarantee that a few years down the pike you will have bad data in the database. The business layer is the worst possible place to do that.
Further, when you replace the business layer with something else you have to redefine all these things. Datbases often outlast the original application they are written for by many years, put the correct realtionships and constraints in the datbase where they belong.
What happens when you try to insert a record into the database and it fails referential integrity? You get an error from the database. Then you have to change your code so that it doesn't try to insert invalid data. To avoid ref integrity errors your code MUST know which data is which. Therefore, referential integrity is useless.
Walter Mitty said "In order to validate referential integrity for a new insert, you have to do a probe into the database to verify that the reference is valid." Sigh... this is complete nonsense. If I have a Customer object in the session (that's memory, aka RAM for some of you fellas), I know the Customer's ID and can use it to insert a SalesOrder object. There is no need to look up the Customer.
I am on a system now with tight Referential Integrity and Hibernate wrapped around it with its gross tenticles. It's the slowest system I have ever seen. I did not design it and if I had, it would be many times faster AND easier to maintain. Hibernate sucks.

Resources