Best practices of distributed transactions(java) [closed] - distributed-transactions

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
Please share your experience with distributed transactions.What kind of frameworks(java) will you advise to use?

The practices of the distributed transactions are far behind its theory. Only one of the approaches to the distributed transactions (2PC) has a large variety of libraries and frameworks to choose from; while the others you need to implement them on you own. Unfortunately 2PC is also the least advanced algorithm so choosing it you sacrifice partition tolerance and performance for the convenience and speed of the development.
Let's take a look of the other major algorithms in the area of the distributed transactions. All of them allow you to do transactions which span for multiple data sources.
Two-phase commit algorithm (2PC)
2PC is the most developed algorithm. It's the heart of the X/Open XA standard which models a generic 2PC-based distributed transaction processing and formalizes the interaction between clients, a coordinator and resources. XA allows vendors do not integrate their solution with all the other solutions but just follow the standard and get the integration for free. JTA is a Java interface to the X/Open XA model.
Some problems of 2PC comes from the fact that the coordinator is a single point of failure. If it is down then the system is unavailable, if there is a network partitioning and the coordinator happens to be in other partition than clients and resources then the system is also unavailable.
Another problem of the algorithm is its blocking nature: once a resource has sent an agreement message to the coordinator, it will block until a commit or rollback is received. As a result the system can't use all the potential of the hardware it uses.
Percolator's transactions
Percolator's transactions are distributed serializable optimistic transactions. They were introduced in the Large-scale Incremental Processing Using Distributed Transactions and Notifications paper by Google and later were implemented in the Amazon's Transaction Library for DynamoDB and in the CockroachDB database.
Unlike 2PC Percolator's transactions:
don't require coordinator so they work when there is a network partitioning and a client executing a transaction and the resources affected by the transaction happen to be in the same partition
use technique similar to the lock-free algorithms so they are non-blocking and better utilize hardware of the cluster
It's very handy that Percolator's transactions can be implemented on the client side. The only requirement is that the datasources must be linearizable and provide compare-and-set operation. The downside is that in the case of race the concurrent transactions can abort each other.
You can take a look on the Visualization of the Percolator's transaction to understand how they work.
RAMP transactions
RAMP transactions are Read Committed isolation level distributed transaction. They were introduced in the Scalable Atomic Visibility with RAMP Transactions paper by Peter Bailis. They are pretty new so they didn't get into any database yet but there are rumors that Cassandra may support them. Also Facebook reported that they are working on the Apollo database which uses Paxos for replication and CRDT & RAMP for cross shard transactions.
As well as 2PC, RAMP transactions require coordinator-like servers but unlike them there can be any number of such servers so there is no availability impact.
Just like the Percolator's transactions RAMP uses non-blocking approach and the relaxed isolation level helps it avoid the contention issues and achieve incredible performance see the paper for the details.
RAMP also has the same requirements to the storages as Percolator's transactions: linearizability and compare-and-set operations.
You can take a look on the Visualization of the RAMP transaction to understand how they work.

I would start e.g. here
https://en.wikipedia.org/wiki/Java_Transaction_API#Open_source_JTA_implementations
As working as qa on Narayana project I would personally recommend http://narayana.io

Related

Isolation Level vs Optimistic Locking-Hibernate , JPA

I have a web application where I want to ensure concurrency with a DB level lock on the object I am trying to update. I want to make sure that a batch change or another user or process may not end up introducing inconsistency in the DB.
I see that Isolation levels ensure read consistency and optimistic lock with #Version field can ensure data is written with a consistent state.
My question is can't we ensure consistency with isolation level only? By making my any transaction that updates the record Serializable(not considering performance), will I not ensure that a proper lock is taken by the transaction and any other transaction trying to update or acquire lock or this transaction will fail?
Do I really need version or timestamp management for this?
Depending on isolation level you've chosen, specific resource is going to be locked until given transaction commits or rollback - it can be lock on a whole table, row or block of sql. It's a pessimistic locking and it's ensured on database level when running a transaction.
Optimistic locking on the other hand assumes that multiple transactions rarely interfere with each other so no locks are required in this approach. It is a application-side check that uses #Version attribute in order to establish whether version of a record has changed between fetching and attempting to update it.
It is reasonable to use optimistic locking approach in web applications as most of operations span through multiple HTTP request. Usually you fetch some information from database in one request, and update it in another. It would be very expensive and unwise to keep transactions open with lock on database resources that long. That's why we assume that nobody is going to use set of data we're working on - it's cheaper. If the assumption happens to be wrong and version has changed in between requests by someone else, Hibernate won't update the row and will throw OptimisticLockingException. As a developer, you are responsible for managing this situation.
Simple example. Online auctions service - you're watching an item page. You read its description and specification. All of it takes, let's say, 5 minutes. With pessimistic locking and some isolation levels you'd block other users from this particular item page (or all of the items even!). With optimistic locking everybody can access it. After reading about the item you're willing to bid on it so you click the proper button. If any other of users watching this item and change its state (owner changed its description, someone other bid on it) in the meantime you will probably (depending on app implementation) be informed about the changes before application will accept your bid because version you've got is not the same as version persisted in database.
Hope that clarifies a few things for you.
Unless we are talking about some small, isolated web application (only app that is working on a database), then making all of your transactions to be Serializable would mean having a lot of confidence in your design, not taking into account the fact that it may not be the only application hitting on that certain database.
In my opinion the incorporation of Serializable isolation level, or a Pessimistic Lock in other words, should be very well though decision and applied for:
Large databases and short transactions that update only a few rows
Where the chance that two concurrent transactions will modify the same rows is relatively low.
Where relatively long-running transactions are primarily read-only.
Based on my experience, in most of the cases using just the Optimistic Locking would be the most beneficial decision, as frequent concurrent modifications mostly happen in only small percentage of cases.
Optimistic locking definately also helps other applications run faster (dont think only of yourself!).
So when we take the Pessimistic - Optimistic locking strategies spectrum, in my opinion the truth lies somewhere more towards the Optimistic locking with a flavor of serializable here and there.
I really cannot reference anything here as the answer is based on my personal experience with many complex web projects and from my notes when i was preapring to my JPA Certificate.
Hope that helps.

Which built-in Postgres replication fits my Django-based use-case best?

I've noticed that Postgres now has built-in replication, including synchronous replication, streaming replication and some other variants. it even provides the ability to control synchrony for specific operations at the application-leve (e.g., use synchronous for important stuff like money transfers, but maybe don't for less critical things like user comments, etc.)
I'm working on a software using Django 1.5 (i.e, dev) and will possibly need synchronous replication (will have commerce related transactions going on).
Do you think that the built-in tools are best for the job in most cases, and do you have any thoughts on one variant of the built-in replication vs another, ease of use related, quality, etc.?
One final thing; Slony and PGPool II seem to be pretty popular (Slony, in particular) for replication. Is there A) a particular, technical reason for their popularity over built-in replication or B) is it just because a lot of people are using versions that don't have built-in replication, or C) have I been under a rock and PG built-in replication is already quite popular?
Update (more details)
I have only 2 physical servers, and they're located in the same rack. My intention is to provide a slave which can automatically turn into the master, should something go catastophically wrong in one machine (or even something simple like double power supply failure, etc.). I don't mind if my clients experience downtime during an automatic failover, so long as the downtime is a few minutes or so, not an hour or something.
I would like for zero data loss, and am willing to sacrifice more time in the failover process for that. Is there a way to make this trade off without going for synchronous replication (e.g, streaming logs without write back confirmation or something)?
What strategy or variant of replication would you recommend?
I think you misunderstand the benefits and cost of synchronous commits on replication. In PostgreSQL, replication works by recovering slaves up to the master, using the standard crash recovery features of PostgreSQL. In the event that, for example, the power goes out, you can be sure that the write-ahead log segments will be run on both master and slave. With asynchronous commit, the commit is written to the WAL, the application is notified and the slave is notified more or less all at the same time depending on network characteristics, etc.
Where synchronous commit comes in handy is where two things are true:
You have more than one slave (this is critical!) and
You need durability guarantees that asynchronous commits can't offer you.
With synchronous commit, the database waits until it hears back from a configurable number of slaves to tell the application that the commit has happened. This offers durability guarantees in a few cases where asynchronous commits are unable to work.
For example, suppose your master server takes a bullet through a raid array and immediately crashes (sorry, I couldn't think of any better examples with good hardware). Or suppose someone trips on a power cord and not only powers off the server but corrupts the software RAID device. In this case it is possible that a couple of transactions may not have been replicated and your WAL is unrecoverable, so those transactions are lost. With synchronized commit, the application would have waited until durability guarantees were met.
One thing this means is that if you do synchronous commit with only one slave your availability cannot outlast a crash on either master or slave, so your availability will be worse than it would have been with just one server. It also means that if your slave is geographically removed, that you introduce significant latency in your application's request to commit transactions.
Switching from async to sync commit it not a big change, but generally, I think that you get the most out of sync commit when you have already done as much as you can assurance and availability-wise on your hardware already. Start with async and move up when you can't further optimize your setup as async.
Re: "Slony and PGPool II seem to be pretty popular (Slony, in particular) for replication. Is there A) a particular, technical reason for their popularity over built-in replication or B) is it just because a lot of people are using versions that don't have built-in replication, or C) have I been under a rock and PG built-in replication is already quite popular?"
Slony is popular because it has been around for quite a long time, and the built-in PostgreSQL replication is relatively new. Cascading replication built in to PostgreSQL is even newer, and is something that Slony-I was built with.
There are two main advantages to Slony-I, first, you can replicate between differing versions of PostgreSQL, whereas the built-in replication system not only must use the same version, but the two servers must also be binary compatible. The other advantage is that you can replicate only certain tables on Slony-I instead of the whole database cluster. The disadvantages of Slony-I are numerous, and include poor user-friendliness, no synchronous commits, and difficult DDL (schema) changes. I believe that use of the built-in replication in Postgres will quickly exceed the Slony-I user base if it hasn't already done so.
As far as I remember, PGPool II uses statement-based replication (like what MySQL has had built-in), and is definitely the least desirable, in my opinion.
I would use the built-in hot standby/streaming replication in PostgreSQL. You can start with synchronous commit turned on and turn it off if you don't need it or the penalty is too high, or vice versa. Over a LAN, asynchronous mode seems to reach the slave in the order of a hundred milliseconds or so (from my own experience).

Distributed store with transactions

I currently develop an application hosted at google app engine. However, gae has many disadvantages: it's expensive and is very hard to debug since we can't attach to real instances.
I am considering changing the gae to an open source alternative. Unfortunately, none of the existing NOSQL solutions which satisfy me support transactions similar to gae's transactions (gae support transactions inside of entity groups).
What do you think about solving this problem? I am currently considering a store like Apache Cassandra + some locking service (hazelcast) for transactions. Did anyone has any experience in this area? What can you recommend
There are plans to support entity groups in cassandra in the future, see CASSANDRA-1684.
If your data can't be easily modelled without transactions, is it worth using a non transcational database? Do you need the scalability?
The standard way to do transaction like things in cassandra is described in this presentation, starting at slide 24. Basically you write something similar to a WAL log entry to 1 row, then perform the actual writes on multiple rows, then delete the WAL log row. On failure, simply read and perform actions in the WAL log. Since all cassandra writes have a user supplied time stamp, all writes can be made idempotent, just store the time stamp of your write with the WAL log entry.
This strategy gives you the Atomic and Durable in ACID, but you do not get Consistency and Isolation. If you are working at scale that requires something like cassandra, you probably need to give up full ACID transactions anyway.
You may want to try AppScale or TyphoonAE for hosting applications built for App Engine on your own hardware.
If you are developing under Python, you have very interesting debugging options with the Werkzeug debugger.

Queues against Tables in messaging systems [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I've been experiencing the good and the bad sides of messaging systems in real production environments, and I must admit that a well organized table or schema of tables simply beats every time any other form of messaging queue, because:
Data are permanently stored on a table. I've seen so many java (jms) applications that lose or vanish messages on their way for uncaught exceptions or other bugs.
Queues tend to fill up. Db storage is virtually infinite, instead.
Tables are easily accessible, while you have to use esotic instruments to read from a queue.
What's your opinion on each approach?
The phrase beats every time totally depends on what your requirements were to begin with. Certainly its not going to beat every time for everyone.
If you are building a single system which is already using a database, you don't have very high performance throughput requirements and you don't have to communicate with any other teams or systems then you're probably right.
For simple, low thoughput, mostly single threaded stuff, database are a totally fine alternative to message queues.
Where a message queue shines is when
you want a high performance, highly concurrent and scalable load balancer so you can process tens of thousands of messages per second concurrently across many servers/processes (using a database table you'd be lucky to process a few hundred a second and processing with multiple threads is pretty hard as one process will tend to lock the message queue table)
you need to communicate between different systems using different databases (so don't have to hand out write access to your systems database to other folks in different teams etc)
For simple systems with a single database, team and fairly modest performance requirements - sure use a database. Use the right tool for the job etc.
However where message queues shine is in large organisations where there are lots of systems that need to communicate with each other (and so you don't want a business database to be a central point of failure or place of version hell) or when you have high performance requirements.
In terms of performance a message queue will always beat a database table - as message queues are specifically designed for the job and don't rely on pessimistic table locks (which are required for a database implementation of a queue - to do the load balancing) and good message queues will perform eager loading of messages to queues to avoid the network overhead of a database.
Similarly - you'd never use a database to do load balancing of HTTP requests across your web servers - as it'd be too slow - if you have high performance requirements for your load balancer you'd not use a database either.
I've used tables first, then refactor to a full-fledged msg queue when (and if) there's reason - which is trivial if your design is reasonable.
The biggest benefits are a.) it's easier, (b. it's a better audit trail because you have the other tables to join to, c.) if you know the database tools really well, they are easier to use than the Message Queue tools, d.) it's generally a bit easier to set up a test/dev environment in a context that already exists for your app (if same familiarity applies).
Oh, and e.) for perhaps you and others, it's not another product to learn, install, configure, administer, and support.
IMPE, it's just as reliable, disconnectable, and you can convert if it needs more scalable.
Data are permanently stored on a table. I've seen so many java (jms) applications that loose or vanish messages on their way for uncaught exceptions or other bugs.
Which JMS implementation? Sun sells reliable queue which can't lose messages. Perhaps you just purchased a cheesy JMS-compliant product. IBM's MQ is extremely reliable, and there are JMS libraries to access it.
Queues tend to fill up. Db storage is virtually infinite, instead.
Ummm... If your queue fills up, it sounds like something is broken. If your apps crash, that's not a good thing, and queues have little to do with that. If you've purchased a really poor JMS implementation, I can see where you might be unhappy with it. It's a competitive market-place. Find a better queue manager. Sun's JCAPS has a really good queue manager, formerly the SeeBeyond message queue.
Tables are easily accessible, while you have to use esotic instruments to read from a queue.
That doesn't fit with my experience. Tables are accessed through this peculiar "other language" (SQL), and requires that I be aware of structure mappings from tables to objects and data type mappings from VARCHAR2 to String. Further, I have to use some kind of access layer (JDBC or an ORM which uses JDBC). That seems very, very complex. A queue is accessed through MessageConsumers and MessageProducers using simple sends and receives.
It sounds as though the problems you've experienced are not inherent to messaging, but rather are artifacts of poorly-implemented messaging systems. Is building messaging systems harder than building database systems? Yes, if all you ever do is build database systems.
Losing messages to uncaught exceptions? That's hardly the fault of the message queue. The applications you're using are poorly engineered. They're removing messages from the queue before processing completes. They're not using transactions, or journalling.
Message queues fill up while DB storage is "virtually infinite"? You talk as though managing disk space were something that databases didn't require. Message queue servers require administration, just like database servers do.
Esoteric instruments to read from a queue? Maybe if you find asynchronous methods esoteric. Maybe if you find serialization and deserialization esoteric. (At least, those are the things I found esoteric when I was learning messaging. Like many seemingly-esoteric technologies, they're actually quite mundane once you understand them, and understanding them is an important part of the seasoned developer's education.)
Aspects of messaging that make it superior to databases:
Asynchronous processing. Message queues notify waiting processes when new messages arrive. To accomplish this functionality in a database, the waiting processes have to poll the database.
Separation of concerns. The communications channel is decoupled from the implementation details of the message content. Only the sender and the receiver need to know anything about the format of the data stream within a given message.
Fault-tolerance.. Messaging can function when connections between servers are intermittent. Message queues can store messages locally and only forward them to remote servers when the connection is live.
Systems integration. In the Windows world, at least, messaging is built into the operating system. It uses the OS's security model, it's managed through the OS's tools, etc.
If you don't need these things, you probably don't need messaging.
Here's a simple example of an application for messaging: I'm building a system right now where users, distributed across multiple networks, are entering fairly intricate sets of transactions that are used to produce printed output. Output generation is computationally expensive and not part of their workflow; i.e. the users don't care when the output gets generated, just that it does.
So we serialize the transactions into a message and drop it in a queue. A process running on a server grabs messages from the queue, produces the output, and stores the output in an imaging system.
If we used a database as our message store, we'd have to come up with a schema to store a transaction format that right now only the sender and receiver care about, we'd need to make sure every workstation on the network had permanent persistent connections to the database server, we'd have no capacity to distribute this transaction load across multiple servers, and our output server would have to query the database thousands of times a day waiting to see if there were new jobs to process.
Queues provide reliable messaging. The store-and-forward, disconnected nature of queueing make it much more scalable than databases, not to mention more robust.
And queues shouldn't really be used for permanent storage of information - it is best to think of them as temporary inboxes, unlike databases.

Transactions best practices [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
How much do you rely on database transactions?
Do you prefer small or large transaction scopes ?
Do you prefer client side transaction handling (e.g. TransactionScope in .NET) over server
side transactions or vice-versa?
What about nested transactions?
Do you have some tips&tricks related to transactions ?
Any gotchas you encountered working with transaction ?
All sort of answers are welcome.
I always wrap a transaction in a using statement.
using(IDbTransaction transaction )
{
// logic goes here.
transaction.Commit();
}
Once the transaction moves out of scope, it is disposed. If the transaction is still active, it is rolled back. This behaviour fail-safes you from accidentally locking out the database. Even if an unhandled exception is thrown, the transaction will still rollback.
In my code I actually omit explicit rollbacks and rely on the using statement to do the work for me. I only explicitly perform commits.
I've found this pattern has drastically reduced record locking issues.
Personally, developing a website that is high traffic perfomance based, I stay away from database transactions whenever possible. Obviously they are neccessary, so I use an ORM, and page level object variables to minimize the number of server side calls I have to make.
Nested transactions are an awesome way to minimize your calls, I steer in that direction whenever I can as long as they are quick queries that wont cause locking. NHibernate has been a savior in these cases.
I use transactions on every write operation to the database.
So there are quite a few small "transactions" wrapped in a larger transaction and basically there is an outstanding transaction count in the nesting code. If there are any outstanding children when you end the parent, its all rolled back.
I prefer client-side transaction handling where available. If you are relegated to doing sps or other server side logical units of work, server side transactions are fine.
Wow! Lots of questions!
Until a year ago I relied 100% on transactions. Now its only 98%. In special cases of high traffic websites (like Sara mentioned) and also high partitioned data, enforcing the need of distributed transactions, a transactionless architecture can be adopted. Now you'll have to code referential integrity in the application.
Also, I like to manage transactions declaratively using annotations (I'm a Java guy) and aspects. That's a very clean way to determine transaction boundaries and it includes transaction propagation functionality.
Just as an FYI... Nested transactions can be dangerous. It simply increases the chances of getting deadlock. So, though it is good and necessary, the way it is implemented is important in higher volume situation.
Server side transactions, 35,000 transactions per second, SQL Server: 10 lessons from 35K tps
We only use server side transactions:
can start later and finish sooner
not distributed
can do work before and after
SET XACT_ABORT ON means immediate rollback on error
client/OS/driver agnostic
Other:
we nest calls but use ##TRANCOUNT to detect already started TXNs
each DB call is always atomic
We deal with millions of INSERT rows per day (some batched via staging tables), full OLTP, no problems. Not 35k tps though.
As Sara Chipps said, transaction is overkill for high traffic applications. So we should avoid it as much as possible. In other words, we use a BASE architecture rather than ACID. Ebay is a typical case. Distributed transaction is not used at all in Ebay architecture. But for eventual consistency, you have to do some sort of trick on your own.

Resources