Idioms or algorithms for distributed transactions? - database

Imagine you have 2 entities on different systems, and need to perform some sort of transaction that alters one or both of them based on information associated with one or both of them, and require that either changes to both entities will complete or neither of them will.
Simple example, that essentially has to run the 2 lines on 2 separate pieces of hardware:
my_bank.my_account -= payment
their_bank.their_account += payment
Presumably there are algorithms or idioms that exist specifically for this sort of situation, working correctly (for some predictable definition of correct) in the presence of other attempted access to the same values. The two-phase commit protocol seems to be one such approach. Are there any simpler alternatives, perhaps with more limitations? (eg. Perhaps they require that no system can shutdown entirely or fail to respond.) Or maybe there more complex ones that are better in some way? And is there a standard or well-regarded text on the matter?

There's also the 3PC "3 Phase Commit Protocol". 3PC solves some of the issues of 2PC by having an extra phase called pre-commit. A participant in the transaction receives a pre-commit message to know that all the other participants have agreed to commit, but have not done it yet. This phase removes the uncertainty of the 2PC when all participants are waiting for either a commit or abort message from the coordinator.
AFAIK - most databases work just fine with 2PC protocol, because in the unlikely conditions that it fails, they always have the transaction logs to undo/redo operations and leave the data in a consistent state.
Most of this stuff is very well discussed in
"Database Solutions, second edition"
and
"Database Systems: The Complete Book"
More in the distributed world you might want to check current state of Web Service technology on distributed transactions and workflows. Not my cup of tea, to be honest. There are frameworks for Python, Java and .Net to run this kind of services (an example).
As my last year project, some years ago, I implemented a distributed 2PC protocol on top of Web Services and I was able to run transactions on two separate databases, just like the example you gave. However, I am sure today people implement this in a most restful-alike approach, for instance see here. Even though, some other protocols are mentioned in these links, in the end they all end up implementing 2PC.
In summary, a 2PC protocol implementation with with proper operation logs to undo/redo in case of crash is one of the most sensible options to go for.

Related

When do you exactly use consensus algorithm in distributed system?

As i understand in distributed system we are supposed to handle network partition failure which is solved by using multiple copies of the same data.
Is this the only place where we use consensus algorithm?
What is the difference between 2PC/3PC/Paxos (is paxos modified version of 3PC? if so then 2PC/3PC , PC also kind of consensus algorithm?)
Network partition is not "solved" by having many copies of the same data. Although redundancy is ofcourse essential to deal with any sort of failures :)
There are many other problems regarding network partition. Generally to increase tolerance for network partitions you use an algorithm that relies on a quorum rather than total-communication, in the quroum approach you can still make progress on one side of the partition as long as f+1 nodes out of 2f are reachable. Paxos for example uses quorum approach. It is quite clear that a protocol like 2PC cannot make progress in case of any type of network partition since it requires "votes" from all of the nodes.
What is the difference between 2PC/3PC/Paxos (is paxos modified version of 3PC? if so then 2PC/3PC , PC also kind of consensus algorithm?)
2PC/3PC/Paxos are all variants of consensus-protocols, although 2PC and 3PC are often described as handling the more specific scenario of "atomic commit in a distributed system" which essentially is a consensus problem. 2PC, 3PC, Paxos are similar but different. You can easily find detailed information about each algorithm on the web.
Is this the only place where we use consensus algorithm?
Consensus protocols have many many use cases in distributed systems, for example: atomic commit, atomic broadcast, leader election or basicly any algorithm that require a set of processes to agree upon some value or action.
Caveat: consensus protocols and related problems of distributed systems are not trivial and you'll need to do some reading to get a deep understanding. If you are comfortable reading academic papers you can find most of the famous ones available online, for example "Paxos made simple" by Leslie Lamport or you can find good blogposts by googling. Also the wiki-article for paxos is very good quality in my opinion!
Hope that could answer some of your questions although I probably introduced you to even more! (if your interested you have some research to do).

Two-phase commit: availability, scalability and performance issues

I have read a number of articles and got confused.
Opinion 1:
2PC is very efficient, a minimal number of messages are exchanged and latency is low.
Source:
http://highscalability.com/paper-consensus-protocols-two-phase-commit
Opinion 2:
It is very hard to scale distributed transactions to high level, moreover they reduce throughput. As 2PC guarantess ACID It puts a great burden due to its complex coordination algorithm.
Source: http://ivoroshilin.com/2014/03/18/distributed-transactions-and-scalability-issues-in-large-scale-distributed-systems/
Opinion 3:
“some authors have claimed that two-phase commit is too expensive to support, because
of the performance or availability problems that it brings. We believe it is better to have
application programmers deal with performance problems due to overuse of transactions
as bottlenecks arise, rather than always coding around the lack of transactions. Running
two-phase commit over Paxos mitigates the availability problems.”
Source: http://courses.cs.washington.edu/courses/csep552/13sp/lectures/6/spanner.pdf
Opinion 4:
The 2PC coordinator also represents a Single Point of Failure, which is unacceptable for critical systems - I believe it is a coordinator.
Source: http://www.addsimplicity.com/adding_simplicity_an_engi/2006/12/2pc_or_not_2pc_.html
First 3 opinions contradict each other. The 4-th one I think is correct. Please clarify what is wrong and what is correct. It would be great also to give facts why that is.
The 4th statement is correct, but maybe not in the way you are reading it. In 2PC, if the coordinator fails, the system cannot make progress. It therefore often desirable to use a fault-tolerant protocol like Paxos (see Gray and Lamport for example), which will allow the system to safely progress when there are failures.
Opinion 3 should be read in context of the rest of the Spanner paper. The authors are saying that they have developed a system which allows efficient transactions in a distributed database, and that they think it's the right default tradeoff for users of the system. The way Spanner does that is well detailed in the paper, and it is worth reading. Take note that Spanner is simply a way (a clever way, granted) of organizing the coordination which is inherently required to implement serializable transactions. See Gilbert and Lynch for one way to look at the limits on coordination).
Opinion 2 is a common belief, and there are indeed tradeoffs between availability and richness of transaction semantics in real-world distributed systems. Current research, however, is making it clear that these tradeoffs are not as dire as they have been portrayed in the past. See this talk by Peter Bailis for one of the research directions. If you want true serializability or linearizability in the strictest sense, you need to obey certain lower bounds of coordination in order to achieve them.
Opinion 1 is technically true, but not very helpful in the way you quoted it. 2PC is optimal in some sense, but seldom implemented naively because of the availability tradeoffs. Many adhoc attempts to adress these tradeoffs lead to incorrect protocols. Others, like Paxos and Raft, successfully address them at the cost of some complexity.

SQL Server Bi-Directional Transactional Replication - Is it a good use-case?

We're having a problem with scaling out with SQL server. This is largely because of a few reasons: 1) poorly designed data structures, 2) heavy lifting and business/processing logic is all done in T-SQL. This was verified by a Microsoft SQL guy from Redmond we hired to perform an analysis on our server. We're literally solving issues by continually increasing the command timeout, which is ridiculous, and not a good long term solution. We have since put together the following strategy and set of phases:
Phase 1: Throw hardware/software at the issue to stop the bleeding.
This includes a few different things like a caching server, but what I'd like to ask everyone here about is specifically related to implementing bi-directional transactional replication on a new SQL server. We have two use-cases for wanting to implement this:
We were thinking of running the long running (and table/row locking) SELECTs on this new SQL "processing box" and throwing them into a caching layer and having the UI read them from the cache. These SELECTs are generating reports and also returning results on the web.
Most of the business logic is in SQL. We have some LONG running queries for SELECTs, INSERTs, UPDATEs, and DELETEs which perform processing logic. The end result is really just a hand-full of INSERTs, UPDATEs, and DELETEs after the processing is complete (lots of cursors). The thought would be to balance the load between these two servers.
I have some questions:
Are these good use-cases for bi-directional transactional replication?
I need to ensure that this solution is going to "just work" and not have to worry about conflicts. Where would conflicts arise within this solution? I have read a few articles about resetting the increment on your identity seed in order to prevent collisions, which makes sense, but how does it handle UPDATEs/DELETEs or other places where conflicts might occur?
What other issues might I run into and we need to watch out for?
Is there a better solution to this problem?
Phase 2: Rewrite the logic into .NET, where it should be, and optimize SQL stored procedures to perform only set-based operations, as it should also be.
This will obviously take a while, which is why we wanted to see if there were some preliminary steps we could take to stop the pain our users are experiencing.
Thanks.
Imho bidirectional replication is very very far from 'it will just work'. Preventing update conflicts requires exquisite planning, ensuring that all that 'processing' is carefully orchestrated never to work on overlapping data. Master-master replication is one of the most difficult solution to pull off.
Consider this: you envision a solution that is providing a cheap 2x scale out with nearly no code modification. such a solution would be quite useful, one would expect to see it deployed everywhere. Yet is nowhere to be seen.
I recommend you search for the many blogs and articles describing gotchas and warnings about (the much more popular) MySQL master-master deployments (eg. If You Must Deploy Multi-Master Replication, Read This First), judge for yourself is the trouble is worth it.
I don't have all the details you do, but I would focus on the application. If you want to just throw money at the problem short term I would make sure that the cheap scale-up is exhausted before considering scale-out (SSD/Fusion drives, more RAM). Also investigate snapshot isolation level/read committed snapshot first, if locking is the main issue.

What are the pros and cons of using database for IPC to share data instead of message passing?

To be more specific for my application: the shared data are mostly persistent data such as monitoring status, configurations -- not more than few hundreds of items, and are updated and read frequently but no more than 1 or 2Hz. The processes are local to each other on the same machine.
EDIT 1: more info - Processes are expected to poll on a set of data they are interested in (ie. monitoring) Most of the data are persistent during the lifetime of the program but some (eg. configs) are required to be restored after software restart. Data are updated by the owner only (assume one owner for each data) Number of processes are small too (not more than 10)
Although using a database is notably more reliable and scalable, it always seems to me it is kind of an overkill or too heavy to use when all I do with it is to share data within an application. Whereas message passing with eg. JMS also has the middleware part, but it is more lightweight and has a more natural or flexible communication API. Implementing event notification and command pattern is also easier with messaging I think.
It will be of great help if anyone can give me an example of what circumstance would one be more preferable to the other.
For example I know we can more readily share persistent data between processes using database, although it is also doable with messaging by distributing across processes and/or storing with some XML files.
And according to here, http://en.wikipedia.org/wiki/Database-as-IPC and here, http://tripatlas.com/Database_as_an_IPC. It says it'd be an anti pattern when used in place of message passing, but it does not elaborate on eg. how bad can the performance hit be using database compared to message passing.
I have gone thru several previous posts that asked a similar question but I am hoping to find an answer that'd focus on design justification. But from those questions that I have read so far I can see a lot of people did use database for IPC (or implemented message queue with database)
Thanks!
I once wrote a data acquisition system that ran on about 20 lab computers. All the communication between the machines took place through small tables on a MySQL database. This scheme worked great, and as the years went by and the system expanded everything scaled well and remained very responsive. It was easy to add features and fix bugs. You could debug it easily because all the message passing was easy to tap into.
What the database does for you is that it provides a fast, well debugged way of maintaining concurrency while the network traffic gets very busy. Someone at MySQL spent a lot of time making the network stuff work well under high load, and if you write your own system using tcp/ip sockets you're going to be re-inventing those wheels at great expense of time and effort.
Another advantage is that if you're using the database anyway for actual data storage then you don't have to add anything else to your system. You keep the possible points of failure to a minimum.
So all these people who say IPC through databases is bad aren't always right.
Taking into account that DBMS is a way to store information, and messages are the way to transport information, your decision should be based on the answer to the question: "do I need persistence of data in time or the data is consumed by recipient".

How to share data across an organization

What are some good ways for an organization to share key data across many deparments and applications?
To give an example, let's say there is one primary application and database to manage customer data. There are ten other applications and databases in the organization that read that data and relate it to their own data. Currently this data sharing is done through a mixture of database (DB) links, materialized views, triggers, staging tables, re-keying information, web services, etc.
Are there any other good approaches for sharing data? And, how do your approaches compare to the ones above with respect to concerns like:
duplicate data
error prone data synchronization processes
tight vs. loose coupling (reducing dependencies/fragility/test coordination)
architectural simplification
security
performance
well-defined interfaces
other relevant concerns?
Keep in mind that the shared customer data is used in many ways, from simple, single record queries to complex, multi-predicate, multi-sort, joins with other organization data stored in different databases.
Thanks for your suggestions and advice...
I'm sure you saw this coming, "It Depends".
It depends on everything. And the solution to sharing Customer data for department A may be completely different for sharing Customer data with department B.
My favorite concept that has risen up over the years is the concept of "Eventual Consistency". The term came from Amazon talking about distributed systems.
The premise is that while the state of data across a distributed enterprise may not be perfectly consistent now, it "eventually" will be.
For example, when a customer record gets updated on system A, system B's customer data is now stale and not matching. But, "eventually", the record from A will be sent to B through some process. So, eventually, the two instances will match.
When you work with a single system, you don't have "EC", rather you have instant updates, a single "source of truth", and, typically, a locking mechanism to handle race conditions and conflicts.
The more able your operations are able to work with "EC" data, the easier it is to separate these systems. A simple example is a Data Warehouse used by sales. They use the DW to run their daily reports, but they don't run their reports until the early morning, and they always look at "yesterdays" (or earlier) data. So there's no real time need for the DW to be perfectly consistent with the daily operations system. It's perfectly acceptable for a process to run at, say, close of business and move over the days transactions and activities en masse in a large, single update operation.
You can see how this requirement can solve a lot of issues. There's no contention for the transactional data, no worries that some reports data is going to change in the middle of accumulating the statistic because the report made two separate queries to the live database. No need to for the high detail chatter to suck up network and cpu processing, etc. during the day.
Now, that's an extreme, simplified, and very coarse example of EC.
But consider a large system like Google. As a consumer of Search, we have no idea when or how long it takes for a search result that Google harvests to how up on a search page. 1ms? 1s? 10s? 10hrs? It's easy to imaging how if you're hitting Googles West Coast servers, you may very well get a different search result than if you hit their East Coast servers. At no point are these two instances completely consistent. But by large measure, they are mostly consistent. And for their use case, their consumers aren't really affected by the lag and delay.
Consider email. A wants to send message to B, but in the process the message is routed through system C, D, and E. Each system accepts the message, assume complete responsibility for it, and then hands it off to another. The sender sees the email go on its way. The receiver doesn't really miss it because they don't necessarily know its coming. So, there is a big window of time that it can take for that message to move through the system without anyone concerned knowing or caring about how fast it is.
On the other hand, A could have been on the phone with B. "I just sent it, did you get it yet? Now? Now? Get it now?"
Thus, there is some kind of underlying, implied level of performance and response. In the end, "eventually", A's outbox matches B inbox.
These delays, the acceptance of stale data, whether its a day old or 1-5s old, are what control the ultimate coupling of your systems. The looser this requirement, the looser the coupling, and the more flexibility you have at your disposal in terms of design.
This is true down to the cores in your CPU. Modern, multi core, multi-threaded applications running on the same system, can have different views of the "same" data, only microseconds out of date. If your code can work correctly with data potentially inconsistent with each other, then happy day, it zips along. If not you need to pay special attention to ensure your data is completely consistent, using techniques like volatile memory qualifies, or locking constructs, etc. All of which, in their way, cost performance.
So, this is the base consideration. All of the other decisions start here. Answering this can tell you how to partition applications across machines, what resources are shared, and how they are shared. What protocols and techniques are available to move the data, and how much it will cost in terms of processing to perform the transfer. Replication, load balancing, data shares, etc. etc. All based on this concept.
Edit, in response to first comment.
Correct, exactly. The game here, for example, if B can't change customer data, then what is the harm with changed customer data? Can you "risk" it being out of date for a short time? Perhaps your customer data comes in slowly enough that you can replicate it from A to B immediately. Say the change is put on a queue that, because of low volume, gets picked up readily (< 1s), but even still it would be "out of transaction" with the original change, and so there's a small window where A would have data that B does not.
Now the mind really starts spinning. What happens during that 1s of "lag", whats the worst possible scenario. And can you engineer around it? If you can engineer around a 1s lag, you may be able to engineer around a 5s, 1m, or even longer lag. How much of the customer data do you actually use on B? Maybe B is a system designed to facilitate order picking from inventory. Hard to imagine anything more being necessary than simply a Customer ID and perhaps a name. Just something to grossly identify who the order is for while it's being assembled.
The picking system doesn't necessarily need to print out all of the customer information until the very end of the picking process, and by then the order may have moved on to another system that perhaps is more current with, especially, shipping information, so in the end the picking system doesn't need hardly any customer data at all. In fact, you could EMBED and denormalize the customer information within the picking order, so there's no need or expectation of synchronizing later. As long as the Customer ID is correct (which will never change anyway) and the name (which changes so rarely it's not worth discussing), that's the only real reference you need, and all of your pick slips are perfectly accurate at the time of creation.
The trick is the mindset, of breaking the systems up and focusing on the essential data that's necessary for the task. Data you don't need doesn't need to be replicated or synchronized. Folks chafe at things like denormalization and data reduction, especially when they're from the relational data modeling world. And with good reason, it should be considered with caution. But once you go distributed, you have implicitly denormalized. Heck, you're copying it wholesale now. So, you may as well be smarter about it.
All this can mitigated through solid procedures and thorough understanding of workflow. Identify the risks and work up policy and procedures to handle them.
But the hard part is breaking the chain to the central DB at the beginning, and instructing folks that they can't "have it all" like they may expect when you have a single, central, perfect store of information.
This is definitely not a comprehensive reply. Sorry, for my long post and I hope it adds to thoughts that would be presented here.
I have a few observations on some of the aspect that you mentioned.
duplicate data
It has been my experience that this is usually a side effect of departmentalization or specialization. A department pioneers collection of certain data that is seen as useful by other specialized groups. Since they don't have unique access to this data as it is intermingled with other data collection, in order to utilize it, they too start collecting / storing the data, inherently making it duplicate. This issue never goes away and just like there is a continuos effort in refactoring code and removing duplication, there is a need to continuously bring duplicate data for centralized access, storage and modification.
well-defined interfaces
Most interfaces are defined with good intention keeping other constraints in mind. However, we simply have a habit of growing out of the constraints placed by previously defined interfaces. Again a case for continuos refactoring.
tight coupling vs loose coupling
If any thing, most software is plagued by this issue. The tight coupling is usually a result of expedient solution given the constraint of time we face. Loose coupling incurs a certain degree of complexity which we dislike when we want to get things done. The web services mantra has been going rounds for a number of years and I am yet to see a good example of solution that completely alleviates the point
architectural simplification
To me this is the key to fighting all the issues you have mentioned in your question. SIP vs H.323 VoIP story comes into my mind. SIP is very simplified, easy to build while H.323 like a typical telecom standard tried to envisage every issue on the planet about VoIP and provide a solution for it. End result, SIP grew much more quickly. It is a pain to be H.323 compliant solution. In fact, H.323 compliance is a mega buck industry.
On a few architectural fads that I have grown up to.
Over years, I have started to like REST architecture for it's simplicity. It provides a simple unique access to data and easy to build applications around it. I have seen enterprise solution suffer more from duplication, isolation and access of data than any other issue like performance etc. REST to me provides a panacea to some of those ills.
To solve a number of those issues, I like the concept of central "Data Hubs". A Data Hub represents a "single source of truth" for a particular entity, but only stores IDs, no information like names etc. In fact, it only stores ID maps - for example, these map the Customer ID in system A, to the Client Number from system B, and to the Customer Number in system C. Interfaces between the systems use the hub to know how to relate information in one system to the other.
It's like a central translation; instead of having to write specific code for mapping from A->B, A->C, and B->C, with its attendance exponential increase as you add more systems, you only need to convert to/from the hub: A->Hub, B->Hub, C->Hub, D->Hub, etc.

Resources