Relying on db transaction rollback in sunshine scenario - database

In a financial system I am currently maintaining, we are relying on the rollback mechanism of our database to simulate the results of running some large batch jobs - rolling back the transaction or committing it at the end, depending on whether we were doing a test run.
I really cannot decide what my opinion is. In a way I think this is great, because then there is no difference between the simulation and a live run - on the other hand it just feels kind of icky, like e.g. depending on code throwing exceptions to carry out your business logic.
What is your opinion on relying on database transactions to carry out your business logic?
EXAMPLE
Consider an administration system having 1000 mortgage deeds in a database. Now the user wants to run a batch job that creates the next term invoice on each deed, using a couple of advanced search criteria that decides which deeds are to be invoiced.
Before she actually does this, she does a test run (implemented by doing the actualy run but ending in a transaction rollback), creating a report on which deeds will be invoiced. If it looks satisfactory, she can choose to do the actual run, which will end in a transaction commit.
Each invoice will be stamped with a batch number, allowing us to revert the changes later if it is needed, so it's not "dangereous" to do the batch run. The users just feel that it's better UX to be able to simulate the results first.
CLARIFICATION
It is NOT about testing. We do have test and staging environments for that. It's about a regular user using our system wanting to simulate the results of a large operation, that may seem "uncontrollable" or "irreversible" even though it isn't.
CONCLUSION
Doesn't seem like anyone has any real good arguments against our solution. As always, context means everything, so in the context of complex functional requirements exceeding performance requirements, using db rollback to implement batch job simulations seems a viable solution.
As there is no real answer to this question, I am not choosing an answer - instead I upvoted those who actually put forth an argument.

I think it's an acceptable approach, as long as it doesn't interfere with regular processing.
The alternative would be to build a query that displays the consequences for review, but we all have had the experience of taking such an approach and not quite getting it right; or finding that the context changed between query and execution.
At the scale of 1000 rows, it's unlikely the system load is burdensome.

Before she actually does this, she does a test run (implemented by doing the actualy run but ending in a transaction rollback), creating a report on which deeds will be invoiced. If it looks satisfactory, she can choose to do the actual run, which will end in a transaction commit.
That's wrong, prone to failure, and must be hell on your database logs. Unless you wrap your simulation and the actual run in a single transaction (which, judging by the timeline necessary to inspect 1000 deeds, would lead to a lot of blocked users) then there's no guaranteed consistency between test run and real run. If somebody changed data, added rows, etc. then you could end up with a different result - defeating the entire purpose of the test run.
A better pattern to do this would be for the test run to tag the records, and the real run to pick up the tagged records and process them. Or, if you have a thick client app, you can pull down the records to the client, show the report, and - if approved - push them back up.

We can see what the user needs to do, quite a reasonable thing. I mean how often do we get a regexp right first time? Refining a query till it does exactly what you want is not unusual.
The business consequences of not catching errors may be quite high, so doing a trial run makes sense.
Given a blank sheet of paper I'm sure we can devise an clean implementation expressed in formal behaviours of the system rather than this somewhat back-door appraoch.
How much effort would I put into fixing that now? Depends on whether the current approach is actually hurting. We can imagine that in a heaviliy used system it could lead to contention in the database.

What I wrote about the PRO FORMA environment in that bank I worked in was also entirely a user thing.

I'm not sure exactly what you're asking here. Taking you literally
What is your opinion on relying on
database transactions to carry out
your business logic?
Well, that's why we have transactions. We do rely on them. We hit an error and abort a transaction and rely on work done in that transaction scope to be rolled-back. So exploiting the transactional beahviours of our systems is a jolly good thing, and we'd need to hand-roll the same thing ourselves if we didn't.
But I think your question is about testing in a live system and relying on not commiting in order to do no damage. In an ideal world we have a live system and a test system and we don't mess with live systems. Such ideals are rarely seen. Far more common is "patch the live system. testing? what do you mean testing?" So in fact you're ahead of the game compared with some.
An alternative is to have dummy data in the live system, so that some actions can actually run all the way through. Again, error prone.
A surprisingly high proportion of systems outage are due to finger trouble, it's the humans who foul up.

It works - as you say. I'd worry about the concurrency of the system since the transaction will hold locks, possibly many locks. It means that your tests will hold up any live action on the system (and any live action operations will hold up your tests). It is generally better to use a test system for testing. I don't like it much, but if the risks from running a test but forgetting to roll it back are not a problem, and neither is the interference, then it is an effective way of attaining a 'what if' type calculation. But still not very clean.

When I was working in a company that was part of the "financial system", there was a project team that had decided to use the production environment to test their conversion procedure (and just rollback instead of commit at the end).
They almost got shot for it. With afterthought, it's a pity they weren't.
Your test environments that you claim you have are for the IT people's use. Get a similar "PRO-FORMA" environment of which you can guarantee your users that it is for THEIR use exclusively.
When I worked in that bank, creating such a PRO-FORMA environment was standard procedure at every year closure.

"But it's not about testing, it's about a user wanting to simulate the results of a large job."
Paraphrased : "it's not about testing, it's about simulation".
Sometimes I wish Fabian Pascal was still in business.
(Oh, and in case anyone doesn't understand : both are about "not perpetuating the results".)

Related

To NOLOCK or not to NOLOCK?

I've worked in a very large organization where they had to use NOLOCK on most queries - because data was commonly updated via ETL processes during the day and locking the application for 40% of the working day was certainly not an option.
Out of habit, in my next place I went on to automatically use NOLOCK everywhere. But since reading warnings and risks, I've been gradually undoing this, to have no table hints specified and let SQL Server do it's thing.
However, I'm still not comfortable I'm doing the right thing. In the place where we used NOLOCK, I never once saw data get doubled up, or corrupt data.. I was there for many years. Ever since removing NOLOCK I am running into the obvious obstacle of rowlocks slowing / pausing queries which gives the illusion that my DB is slow or flakey. When in actuality it's just somebody running a lengthy save somewhere (the ability for them to do so is a requirement of the application).
I would be interested to hear from anyone who has actually experienced data corruption or data duplication from NOLOCK in practise, rather than people who are going by what they've read about it on the internet. I would be especially appreciative if anybody can provide replication steps to see this happen. I am trying to gauge just how risky is it, and do the risks outweigh the obvious benefit of being able to run reports parallel to updates?
I've seen corruptions, duplication and bad results with NOLOCK. Not once, not seldom. Every deployment that relied on NOLOCK and I had a chance to look at had correctness issues. True that many (most?) were not aware and did not realize, but the problems were there, always.
You have to realize that NOLOCK problems do no manifest as hard corruptions (the kind DBCC CHECKDB would report), but 'soft' corruption. And the problems are obvious only on certain kind of workloads, mostly on analytic types (aggregates). They would manifest as an incorrect value in a report, a balance mismatch in a ledger, a wrong department headcount and similar. These problems are visible as problems only when carefully inspected by a qualified human. And they may well vanish mysteriously on a simple Refresh of the page. So you may well have all these problems and not be aware of them. Your users might accept that 'sometimes the balance is wrong, just ask for the report again and will be OK' and never report you the issue.
And there are some workloads that are not very sensitive to NOLOCK issues. If you display 'posts' and 'comments' you won't see much of NOLOCK issues. Maybe the 'unanswered count' is off by 2, but who will notice?
Ever since removing NOLOCK I am running into the obvious obstacle of rowlocks slowing / pausing queries
I would recommend evaluating SNPASHOT isolation models (including READ_COMMITTED_SNAPSHOT). You may get a free lunch.
I see you've read a lot about it, but allow me to point you to a very good explanation on the dangers of using NOLOCK (that's it READ UNCOMMITTED isolation level): SQL Server NOLOCK Hint & other poor ideas.
Apart from this, I'll make some citations and comments. The worst part of NOLOCK is this:
It creates “incredibly hard to reproduce” bugs.
The problem is that when you read uncommited data, most of the time is commited, so everything is alright. But it will randomly fail if the transaction is not comitted. And that doesn't usually happen. Right? Nope: first, a single error is a very bad thing (your customer don't like it). And second, things can get much worse, LO:
The issue is that transactions do more than just update the row. Often they require an index to be updated OR they run out of space on the data page. This may require new pages to be allocated & existing rows on that page to be moved, called a PageSplit. It is possible for your select to completely miss a number of rows &/or count other rows twice. More info on this in the linked article
So, that means that even if the uncommited transaction you've read is committed, you can still read bad data. And, this will happen at random times. That's ugly, very ugly!
What about corruption?
As Remus Rusanu said, it's not "hard" but "soft" corruption. And it affects specially aggregates, because you're reading what you shouldn't when updating them. This can lead for example to a wrong account balance.
Haven't you heard of big LOB apps that have procedures to rebuild account balances? Why? They should be correctly updated inside transactions! (That can be acceptable if the balances are rebuilt at critical moments, for example while calcultaing taxes).
What can I do without corrupting data (and thus is relatively safe)?
Let's say it's "quite safe" to read uncommited data when you're not using it to update other existing data on the DB. I.e. if you use NOLOCK only for reporting purposes (without write-back) you're on the "quite safe" side. The only "tiny trouble" is that the report can show the wrong data, but, at least, the data in the DB will keep consistent.
To consider this safe depends on the prupose of what you're reading. If it's something informational, which is not going to be used to make decissions, that's quite safe (for example it's not very bad to have some errors on a report of the best customers, or the most sold products). But if you're getting this information to make decissions, things can be much worse (you can make a decission on a wrong basis!)
A particular experience
I worked on the development of a 'crowded' application, with some 1,500 users which used NOLOCK for reading data, modifying it an updating it on the DB (a HHRR/TEA company). And (apparently) there were no problems. The trick was that each employee read "atomic data" (an employee's data) to modify it, and it was nearly impossible that two people read and modified the same data at the same time. Besides this "atomic data" didn't influence any aggregate data. So everything was fine. But from time to time there were problems on the reporting area, which read "aggregated data" with NOLOCK. So, the critical reports had to be scheduled for moments where noone was working in the DB. The small deviations on non-critical reports was overlooked and admittable.
Now you know it. You have no excuses. You decide, to NOLOCK or not to NOLOCK

Testing database transactions

How can I write and run automated tests that check that my database transaction strategy is removing race conditions? At the moment all I do is test it in development by putting a breakpoint in the code and sending two requests, I can then see in slow motion what happens. This is not something I can automate though, it's not even testing really, just part of development.
Your test can spawn threads and run two or more threads making the same request isolated by the transaction.
Perform a load test with a realistic work-load. Unfortunately, this is not easy to do. Race conditions are hard to discover on any platform. I know of no systematic way to find such bugs.
Sometimes you can exclude the possibility of inconsistencies by construction. For example:
A transaction running under SERIALIZABLE behaves as if it was the only transaction in the system. Therefore, there are never data races.
A read-only transaction under SNAPSHOT behaves the same way. Total data consistency.
A UNIQUE INDEX will never violate its integrity guarantees.
As you can see you can sometimes make your code safe by construction so that there is minimal need to test.

Distributed transactions - why do we save tranlogs to file system?

All transaction managers (Atomikos, Bitronix, IBM WebSphere TM etc) save some "transaction logs" into 'tranlogs' folder to file system.
When something terrible happens and server gets down sometimes tranlogs become broken.
They require some manual recovery procedure.
I've been told that by simply clearing broken tranlogs folder I risk to have an inconsistent state of resources that participated in transactions.
As a "dumb" developer I feel more comfortable with simple concepts. I want to think that distributed transaction management should be alike the regular transaction management:
If something went wrong at any party (network, app error, timeout) - I expect the whole multi-resource transaction not to be committed in any part of it. All leftovers should be cleaned up sooner or later automatically.
If transaction managers fails (file system fault, power supply fault) - I expect all the transactions under this TM to be rollbacked (apparently, at DB timeout level).
File storage for tranlogs is optional if I don't want to have any automatic TX recovery (whatever it would mean).
Questions
Why can't I think like this? What's so complicated about 2PC?
What are the exact risks when I clear broken tranlogs?
If I am wrong and I really need all the mess with 2PC file system state. Don't you feel sick about the fact that TX manager can actually break storage state in an easy and ugly manner?
When I was first confronted with 2 phase commit in real life in 1994 (initially on a larger Oracle7 environment), I had a similar initial reaction. What a bloody shame that it is not generally possible to make it simple. But looking back at algorithm books of university, it become clear that there is no general solution for 2PC.
See for instance how to come to consensus in a distributed environment
Of course, there are many specific cases where a 2PC commit of a transaction can be resolved more easy to either complete or roll back completely and with less impact. But the general problem stays and can not be solved.
In this case, a transaction manager has to decide at some time what to do; a transaction can not remain open forever. Therefor, as an ultimate solution they will always need to have go back to their own transaction logs, since one or more of the other parties may not be able to reliably communicate status now and in the near future. Some transaction managers might be more advanced and know how to resolve some cases more easily, but the need for an ultimate fallback stays.
I am sorry for you. Fixing it generally seems to be identical to "Falsity implies anything" in binary logic.
Summarizing
On Why can't I think like this? and What's so complicated about 2PC: See above. This algorithmetic problem can't be solved universally.
On What are the exact risks when I clear broken tranlogs?: the transaction manager has some database backing it. Deleting translogs is the same problem in general relational database software; you loose information on the transactions in process. Some db platforms can still have somewhat or largely integer files. For background and some database theory, see Wikipedia.
On Don't you feel sick about the fact that TX manager can actually break storage state in an easy and ugly manner?: yes, sometimes when I have to get a lot of work done by the team, I really hate it. But well, it keeps me having a job :-)
Addition: to 2PC or not
From your addition I understand that you are thinking whether or not to include 2PC in your projects.
In my opinion, your mileage may vary. Our company has as policy for 2PC: avoid it whenever possible. However, in some environments and especially with legacy systems and complex environments such a found in banking you can not work around it. The customer requires it and they may be not willing to allow you to perform a major change in other infrastructural components.
When you must do 2PC: do it well. I like a clean architecture of the software and infrastructure, and something that is so simple that even 5 years from now it is clear how it works.
For all other cases, we stay away from two phase commit. We have our own framework (Invantive Producer) from client, to application server to database backend. In this framework we have chosen to sacrifice elements of ACID when normally working in a distributed environment. The application developer must take care himself of for instance atomicity. Often that is possible with little effort or even doesn't require thinking about. For instance, all software must be safe for restart. Even with atomicity of transactions this requires some thinking to do it well in a massive multi user environment (for instance locking issues).
In general this stupid approach is very easy to understand and maintain. In cases where we have been required to do two phase commit, we have been able to just replace some plug-ins on the framework and make some changes to client-side code.
So my advice would be:
Try to avoid 2PC.
But encapsulate your transaction logic nicely.
Allowing to do 2PC without a complete rebuild, but only changing things where needed.
I hope this helps you. If you can tell me more about your typical environments (size in #tables, size in GB persistent data, size in #concurrent users, typical transaction mgmt software and platform) may be i can make some additions or improvements.
Addition: Email and avoiding message loss in 2PC
Regarding whether suggesting DB combining with JMS: No, combining DB with JMS is normally of little use; it will itself already have some db, therefor the original question on transaction logs.
Regarding your business case: I understand that per event an email is sent from a template and that the outgoing mail is registered as an event in the database.
This is a hard nut to crack; I've been enjoying doing security audits and one of the easiest security issues to score was checking use of email.
Email - besides not being confidential and tampersafe in most situations like a postcard - has no guarantees for delivery and/or reading without additional measures. For instance, even when email is delivered directly between your mail transfer agent and the recipient, data loss can occur without the transaction monitor being informed. That even gets worse when multiple hops are involved. For instance, each MTA has it's own queueing mechanism on which a "bomb can be dropped" leading to data loss. But you can also think of spam measures, bad configuration, mail loops, pressing delete file by accident, etc. Even when you can register the sending of the email without any loss of transaction information using 2PC, this gives absolutely no clue on whether the email will arrive at all or even make it across the first hop.
The company I work for sells a large software package for project-driven businesses. This package has an integrated queueing mechanism, which also handles email events. Typically combined in most implementation with Exchange nowadays. A few months we've had a nice problem: transaction started, opened mail channel, mail delivered to Exchange as MTA, register that mail was handled... transaction aborted, since Oracle tablespace full. On the next run, the mail was delivered again to Exchange, again abort, etc. The algorithm has been enhanced now, but from this simple example you can see that you need all endpoints to cooperate in your 2PC, even when some of the endpoints are far away in an organisation receiving and displaying your email.
If you need measures to ensure that an email is delivered or read, you will need to supplement it by additional measures. Please pick one of application controls, user controls and process controls from literature.

Should you test an external system prior to using it?

Note: This is not for unit testing or integration testing. This is for when the application is running.
I am working on a system which communicates to multiple back end systems, which can be grouped into three types
Relational database
SOAP or WCF service
File system (network share)
Due to the environment this will run in, there are no guarantees that any of those will be available at run time. In fact some of them seem pretty brittle and go down multiple times a day :(
The thinking is to have a small bit of test code which runs before the actual code. If there is a problem then persist the request and poll until the target system until it is available. Tests could possibly be rerun within the code to check it is still available at logical points. The ultimate goal is to have a very stable system, regardless of the stability (or lack thereof) of the systems it communicates to.
My questions around this design are:
Are there major issues with it? (small things like the fact it may fail between the test completing and the code running are understandable)
Are there better ways to implement this sort of design?
Would using traditional exception handling and/or transactions be better?
Updates
The system needs to talk to the back end systems in a coordinated way.
The system is very async in nature so using things like queuing technologies is fine.
The system must run even if one or more backend systems are down as others may be up and processing of some information is possible.
You will be needing that traditional exception handling no matter what, since as you point out there's always the chance that things'll fail between your last check and the actual request. So I really think any solution you find should try to interact smoothly with this.
You are not stating if these flaky resources need to interact in some kind of coordinated manner, which would indicate that you should probably be using a transaction manager of some sort to do this. I do not believe you want to get into the footwork of transaction management in application code for most needs.
Sometimes I have also seen people use AOP to encapsulate retry logic to back-end systems that fail (for instance due to time-out issues). Used sparingly this may be a decent solution.
In some cases you can also use message queuing technology to alleviate unstable back-ends. You could for instance commit to a message queue as part of a transaction, and only pop off the queue when successful. But this design is normally only possible when you're able to live with an asynchronous process.
And as always, real stability can only be achieved by attacking the root cause of the problem. I had a 25-year old bug fixed in a mainframe TCP/IP stack fixed because we were overrunning it, so it is possible.
The Microsoft Smartclient framework provides a ConnectionMonitor class. Should be easy to use or duplicate.
Our approach to this kind of issue was to run a really basic 'sanity tester' prior to bringing up our main application. This was thick client so we could run the test every time the app started. This sanity test would go out and check things like database availability, and external network (extranet) access, and it could have been extended to do webservices as well.
If there was a failure, the user was informed, and crucially an email was also sent to the support/dev team. These emails soon became unweildy as so many were being created, but we then setup filters, so we knew when somethings really bad was happening. Overall the approach worked pretty well, our biggest win was being able to tell users that the system was down, before they had entered data, and got part way through a long winded process. They absolutely loved it.
At a technica level the sanity was written in C#, it used exception handling in a conventional way not to find the problems it was looking for. The sanity program became a mini app in its own right, and it was standalone from the main app. If I were doing it again I'd using a logging framework to capture issues, which is more flexible then our hard coded approach.

If you were asked if a system could sustain double growth, what 3 things would you do to answer?

Let's say at your job your boss says,
That system over there, which has lost all institutional knowledge but seems to run pretty good right now, could we dump double the data in it and survive?
You're completely unfamiliar with the system.
It's in SQL Server 2000 (primarily a database app).
There's no test environment.
You might be able to hijack it on the weekends if you needed to run a benchmark.
What would be the 3 things you'd do to convince yourself and then your manager that you could take on that extra load. And if you couldn't do it, on the same hardware... the extra hardware (measured in dollars) it would take to satisfy that request.
To address the response from doofledorfer, you assumptions are almost all 180 degrees off. But that's my fault for an ambiguous question.
One of the main servers runs 7x24 at 70% base and spikes from there and no one knows what it is doing.
This isn't an issue of buy-in or whining... Our company may not have much of a choice in the matter.
Because this is being externally mandated, delays in implementation could result in huge fines. So large meeting to assess risk are almost impossible. There is one risk, that dumping double the data would take the system down for the existing customers.
I was hoping someone would say something like, see if you take the system off line Sunday night at midnight and run SQLIO tests to see how close the storage subsystem is to saturation. Things like that.
Set up a test environment, even if I have to do it on my laptop.
Enable some kind of logging on the production system to get an idea of the volume of transactions in addition to the volume of data.
Read the source code as I run stress tests on my laptop with increasing amounts of data.
Having said that, I sympathize with this assignment, because it's unfair. It's like asking someone in a boat if the boat can float with twice the cargo -- but you can't get out of the boat or take it out of its regular service.
You've just described a typical Agile project. Your answer should be:
I don't know, and I won't be able to tell without testing.
In addition to data volume, there might be issues with usage patterns, application interactions, database and server tuning, etc.
So let's work through a basic list of risk factors, and how we might resolve them.
Once we've done that, let's work through them in inverse order of risk; and make a stop/continue decision as we develop the results.
etc.
Without management buy-in and participation at least at that level, any other answer you might give is high-risk wishing, and "3 most important" is a non sequitur.
I'd be optimistic unless your current system is substantially loaded already. Most servers should run at less than 50% capacity on all resources, or else be on life-support.
And I expect you wouldn't be having the conversation if the existing server were already dealing with load issues; although "seems to run pretty good right now" is imprecise enough to be worrisome.
it mostly depends on its current level. If doubling is going from 2GB to 4GB just do it. If it's going from 1TB to 2TB you've got some planning to do.
I'd collect some info using Performance Monitor and provide it to help make an educated decision.
It depends what you mean by "double the data".
If that is going to affect one table only (say product table) then you are probably safe as most queries that are referring to that one table are most likely to double the time of execution (that assumes that you do not reference the same time twice in a query).
The problem will arise if you double the amount of data in all the tables as the execution time may grow in exponential fashion then and it can lead to some serious issues.
But in general I would support the answer by doofledorfer

Resources