how to handle rollback situation in Cassandra - database

Cassandra doesn't provide rollback transaction and ACID properties like traditional RDBMS, instead, it provides durable transactions.
This is fine if you're OK with eventual consistency or you can even tune it if want stronger consistency.
My situation is that during a long transaction there might be some exceptions due to application-level logic, in that case, I can just abort the transaction in RDBMS and the database will do the rollback for me.
If I do want the rollback feature, do I have to write the rollback code in my application on my own? Is there any tunable setting in Cassandra to achieve the same result?

You are correct. There is no rollback feature in Cassandra and you will need to manage it in your code.
A straight DELETE would be easy to implement client-side but if you want to rollback to a previous version of the data, that will be challenging since you'll need to implement a read-before-write, i.e. read the existing version so you know what to rollback to before writing.
But in a distributed architecture with no locks/isolation, it will be difficult to know if another client updated the data after you read it. Cheers!

Related

Changing a column to nullable in a live database

We have a requirement to change a column in a table, from not nullable to nullable. The table crosses service domains and is being split up in line with our SOA needs. Seems simple enough, but there are potentially huge consequences and impacts to our customers.
What possible ways can we rollback if there are any problems after we have run the scripts to make these changes?
If we were to rollback and there were null values how would you suggest to get things back into a decent state?
Given that we will be processing high volumes of transactions what strategies might be worth considering?
Test, test again, test some more, and only let it hit production when you are absolutely sure.
In terms of rollback
are there any defaults that could sensibly be used if you had to revert to not null?
would you be able to restore a backup of the table and replay actions?

SQL Server Bi-Directional Transactional Replication - Is it a good use-case?

We're having a problem with scaling out with SQL server. This is largely because of a few reasons: 1) poorly designed data structures, 2) heavy lifting and business/processing logic is all done in T-SQL. This was verified by a Microsoft SQL guy from Redmond we hired to perform an analysis on our server. We're literally solving issues by continually increasing the command timeout, which is ridiculous, and not a good long term solution. We have since put together the following strategy and set of phases:
Phase 1: Throw hardware/software at the issue to stop the bleeding.
This includes a few different things like a caching server, but what I'd like to ask everyone here about is specifically related to implementing bi-directional transactional replication on a new SQL server. We have two use-cases for wanting to implement this:
We were thinking of running the long running (and table/row locking) SELECTs on this new SQL "processing box" and throwing them into a caching layer and having the UI read them from the cache. These SELECTs are generating reports and also returning results on the web.
Most of the business logic is in SQL. We have some LONG running queries for SELECTs, INSERTs, UPDATEs, and DELETEs which perform processing logic. The end result is really just a hand-full of INSERTs, UPDATEs, and DELETEs after the processing is complete (lots of cursors). The thought would be to balance the load between these two servers.
I have some questions:
Are these good use-cases for bi-directional transactional replication?
I need to ensure that this solution is going to "just work" and not have to worry about conflicts. Where would conflicts arise within this solution? I have read a few articles about resetting the increment on your identity seed in order to prevent collisions, which makes sense, but how does it handle UPDATEs/DELETEs or other places where conflicts might occur?
What other issues might I run into and we need to watch out for?
Is there a better solution to this problem?
Phase 2: Rewrite the logic into .NET, where it should be, and optimize SQL stored procedures to perform only set-based operations, as it should also be.
This will obviously take a while, which is why we wanted to see if there were some preliminary steps we could take to stop the pain our users are experiencing.
Thanks.
Imho bidirectional replication is very very far from 'it will just work'. Preventing update conflicts requires exquisite planning, ensuring that all that 'processing' is carefully orchestrated never to work on overlapping data. Master-master replication is one of the most difficult solution to pull off.
Consider this: you envision a solution that is providing a cheap 2x scale out with nearly no code modification. such a solution would be quite useful, one would expect to see it deployed everywhere. Yet is nowhere to be seen.
I recommend you search for the many blogs and articles describing gotchas and warnings about (the much more popular) MySQL master-master deployments (eg. If You Must Deploy Multi-Master Replication, Read This First), judge for yourself is the trouble is worth it.
I don't have all the details you do, but I would focus on the application. If you want to just throw money at the problem short term I would make sure that the cheap scale-up is exhausted before considering scale-out (SSD/Fusion drives, more RAM). Also investigate snapshot isolation level/read committed snapshot first, if locking is the main issue.

entity framework and dirty reads

I have Entity Framework (.NET 4.0) going against SQL Server 2008. The database is (theoretically) getting updated during business hours -- delete, then insert, all through a transaction. Practically, it's not going to happen that often. But, I need to make sure I can always read data in the database. The application I'm writing will never do any types of writes to the data -- read-only.
If I do a dirty read, I can always access the data; the worst that happens is I get old data (which is acceptable). However, can I tell Entity Framework to always use dirty reads? Are there performance or data integrity issues I need to worry about if I set up EF this way? Or should I take a step back and see about rewriting the process that's doing the delete/insert process?
TransactionScope is your friend:
Entity Framework with NOLOCK
Don't use dirty reads. "The worst" isn't that you see old data. The worst is that you see uncommitted data. Stack Overflow uses snapshots rather than dirty reads to solve this problem. That's what I'd do, too.
From the previous link, I found this, which also answers the question.
http://www.hanselman.com/blog/GettingLINQToSQLAndLINQToEntitiesToUseNOLOCK.aspx

On the google app engine, how do I implement database Transactions?

I know that the way to handle DB transactionality on the app engine is to give different entities the same Parent(Entity Group) and to use db.run_in_transaction.
However, assume that I am not able to give two entities the same parent. How do I ensure that my DB updates occur in a transaction?
Is there a technical solution? If not, is there a pattern that I can apply?
Note: I am using Python.
As long as the entities belong to the same Group, this is not an issue. From the docs:
All datastore operations in a
transaction must operate on entities
in the same entity group. This
includes querying for entities by
ancestor, retrieving entities by key,
updating entities, and deleting
entities. Notice that each root entity
belongs to a separate entity group, so
a single transaction cannot create or
operate on more than one root entity.
For an explanation of entity groups,
see Keys and Entity Groups.
There is also a nice article about Transaction Isolation in App Engine.
EDIT: If you need to update entities with different parents in the same transaction, you will need to implement a way to serialize the changes that were made by yourself and rollback manually if an exception is raised.
If you want cross-entity-group transactions, you'll have to implement them yourself, or wait for a library to do them. I wrote an article a while ago about how to implement cross-entity-group transactions in the 'bank transfer' case; it may apply to your use-case too.
Transactions in the AppEngine datastore act differently to the transactions you might be used to in an SQL database. For one thing, the transaction doesn't actually lock the entities it's operating on.
The Translation Isolation in App Engine article explains this in more detail.
Because of this, you'll want to think differently about transactions - you'll probably find that in most of the cases where you're wanting to use a transaction it's either unnecessary - or it wouldn't achieve what you want.
For more information about entity groups and the data store model, see How Entities and Indexes are Stored.
Handling Datastore Errors talks about things that could cause a transaction to not be committed and how to handle the problems.
One possibility is to implement your own transaction handling as you have mentioned. If you are thinking about doing this, it would be worth your time to explore the previous work on this problem.
http://danielwilkerson.com/dist-trans-gae.html
Dan Wilkerson also gave a talk on it at Google IO. You should be able to find a video of the talk.
erick armbrust has implemented daniel wilkerson's distributed transaction design mentioned earlier, in java: http://code.google.com/p/tapioca-orm/

Why is READ_COMMITTED_SNAPSHOT not on by default?

Simple question?
Why is READ_COMMITTED_SNAPSHOT not on by default?
I'm guessing either backwards compatibility, performance, or both?
[Edit] Note that I'm interested in the effect relating to the READ_COMMITTED isolation level, and not the snapshot isolation level.
Why would this be a breaking-change, as it holds less locks, and still doesn't read non-committed rows?
Turning snapshot on by default would break the vast majority of applications
It is unclear to me if it will break the "vast majority" of applications. Or, if it will break many applications in ways that are hard to identify and/or hard to work around. The SQL Server documentation states that READ COMMITTED and READ COMMITTED SNAPSHOT both satisfy the ANSI definition of READ COMMITTED. (Stated here: http://msdn.microsoft.com/en-us/library/ms189122.aspx) So, as long as your code does not rely on anything beyond the literal ANSI-required behavior, in theory, you will be okay.
A complication is that the ANSI specification doesn't capture everything that people commonly think things like dirty read, fuzzy/non-repeatable read, etc. mean in practice. And, there are anomalies (permitted by the ANSI definitions) that can occur under READ COMMITTED SNAPSHOT that cannot occur under READ COMMITTED. For an example, see http://www.jimmcleod.net/blog/index.php/2009/08/27/the-potential-dangers-of-the-read-committed-snapshot-isolation-level/.
Also see http://social.msdn.microsoft.com/Forums/en-US/sqldatabaseengine/thread/d1b3d46e-2642-4bc7-a68a-0e4b8da1ca1b.
For deep information on the differences between the isolation levels, start with http://www.cs.umb.edu/cs734/CritiqueANSI_Iso.pdf (READ_COMMITTED_SNAPSHOT was not around when this paper was written, but the other levels are covered by it).
Both. Mostly compatibility.
Turning snapshot on by default would break the vast majority of applications that expect the old, blocking, behavior. Snapshot makes heavy use of tempdb for the version store and its impact on performance is quite measurable.
It changes the default locking strategy from how the Sybase/SQL Server family have worked forever. It'd break all my applications, all the application I know of at my shop, and corrupt a lot of important data.
Read the Wikipedia article completely: do you want the code behind your banking app to use this isolation model?
In general, therefore, snapshot
isolation puts some of the problem of
maintaining non-trivial constraints
onto the user, who may not appreciate
either the potential pitfalls or the
possible solutions. The upside to this
transfer is better performance.
It's a compromise like most database designs. In my case, I can deal with the locking waits/deadlocks (rare) as a price for easier and more "out of the box" data integrity. I've yet to come across a problem or issue where I see snapshot isolation as a solution.

Resources