What (kind of) data is lost when using REPAIR_ALLOW_DATA_LOSS?

What (kind of) data is lost when using REPAIR_ALLOW_DATA_LOSS? - sql-server

after having some trouble with my sql server 2012 I could only fix the data inconsistencies using DBCC CHECKDB (xxx, REPAIR_ALLOW_DATA_LOSS). The option's name implies, that there will (possibly) be a data loss, when the database is repaired.
What is the data that is lost and how harmfull is the loss?
For example, take a look at this log message:
The off-row data node at page (1:705), slot 0, text ID 328867287793664 is not referenced.
If that node is not referenced and it is this node, that causes the inconsistency, delete it. That shouldn't hurt anyone. Is that the kind of data loss MS is talking about?
Best regards,
Sascha

Check Paul Randal's blog post for some additional insight into the implications of running DBCC CHECKDB with REPOAIR_ALLOW_DATA_LOSS:
REPAIR_ALLOW_DATA_LOSS is the repair level that DBCC CHECKDB recommends when it finds corruptions. This is because fixing nearly
anything that’s not a minor non-clustered index issue requires
deleting something to repair it. So, REPAIR_ALLOW_DATA_LOSS will
delete things. This means it will probably delete some of your data as
well. If, for instance it finds a corrupt record on a data page, it
may end up having to delete the entire data page, including all the
other records on the page, to fix the corruption. That could be a lot
of data. For this reason, the repair level name was carefully chosen.
You can’t type in REPAIR_ALLOW_DATA_LOSS without realizing that you’re
probably going to lose some data as part of the operation.
I’ve been asked why this is. The purpose of repair is not to save user
data. The purpose of repair is to make the database structurally
consistent as fast as possible (to limit downtime) and correctly (to
avoid making things worse). This means that repairs have to be
engineered to be fast and reliable operations that will work in every
circumstance. The simple way to do this is to delete what’s broken and
fix up everything that linked to (or was linked from) the thing being
deleted – whether a record or page. Trying to do anything more
complicated would increase the chances of the repair not working, or
even making things worse.
The ramifications of this are that running REPAIR_ALLOW_DATA_LOSS can
lead to the same effect on your data as rebuilding a transaction log
with in-flight transactions altering user data – your business logic,
inherent and constraint-enforced relationships between tables, and the
basic logical integrity of your user data could all be broken. BUT,
the database is now structurally consistent and SQL Server can run on
it without fear of hitting a corruption that could cause a crash.
To continue the contrived example from above, imagine your bank
checking and savings accounts just happen to be stored on the same
data page in the bank’s SQL Server database. The new DBA doesn’t
realize that backups are necessary for disaster recovery and data
preservation and so isn’t taking any. Disaster strikes again in the
form of the work crew outside the data-center accidentally cutting the
power and the machine hosting SQL Server powers down. This time, one
of the drives has a problem while powering down and a page write
doesn’t complete – causing a torn page. Unfortunately, it’s the page
holding your bank accounts. As the DBA doesn’t have any backups, the
only alternative to fix the torn-page is to run
REPAIR_ALLOW_DATA_LOSS. For this error, it will delete the page, and
does so. In the process, everything else on the page is also lost,
including your bank accounts!!
Source: Corruption: Last resorts that people try first…

Related

Concurrent editing of same data

I recently came up with a case that makes me wonder if I'm a newbie or something trivial has escaped to me.
Suppose I have a software to be run by many users, that uses a table. When the user makes login in the app a series of information from the table appears and he has just to add and work or correct some information to save it. Now, if the software he uses is run by many people, how can I guarantee is he is the only one working with that particular record? I mean how can I know the record is not selected and being worked by 2 or more users at the same time? And please I wouldn't like the answer use “SELECT FOR UPDATE... “
because for what I've read it has too negative impact on the database. Thanks to all of you. Keep up the good work.

This is something that is not solved primarily by the database. The database manages isolation and locking of "concurrent transactions". But when the records are sent to the client, you usually (and hopefully) closed the transaction and start a new one when it comes back.
So you have to care yourself.
There are different approaches, the ones that come into my mind are:
optimistic locking strategies (first wins)
pessimistic locking strategies
last wins
Optimistic locking: you check whether a record had been changed in the meanwhile when storing. Usually it does this by having a version counter or timestamp. Some ORMs and frameworks may help a little to implement this.
Pessimistic locking: build a mechanism that stores the information that someone started to edit something and do not allow someone else to edit the same. Especially in web projects it needs a timeout when the lock is released anyway.
Last wins: the second person storing the record just overwrites the first changes.
... makes me wonder if I'm a newbie ...
That's what happens always when we discover that very common stuff is still not solved by the tools and frameworks we use and we have to solve it over and over again.

Now, if the software he uses is runed by many people how can I guarantee is he
is the only one working with that particular record.
Ah...
And please I wouldn't like the answer use “SELECT FOR UPDATE... “ because for
what I've read it has too negative impact on the database.
Who cares? I mean, it is the only way (keep a lock on a row) to guarantee you are the only one who can change it. Yes, this limits throughput, but then this is WHAT YOU WANT.
It is called programming - choosing the right tools for the job. IN this case impact is required because of the requirements.
The alternative - not a guarantee on the database but an application server - is an in memory or in database locking mechanism (like a table indicating what objects belong to what user).
But if you need to guarantee one record is only used by one person on db level, then you MUST keep a lock around and deal with the impact.
But seriously, most programs avoid this. They deal with it either with optimistic locking (second user submitting changes gets error) or other programmer level decisions BECAUSE the cost of such guarantees are ridiculously high.

Oracle is different from SQL server.
In Oracle, when you update a record or data set the old information is still available because your update is still on hold on the database buffer cache until commit.
Therefore who is reading the same record will be able to see the old result.
If the access to this record though is a write access, it will be a lock until commit, then you'll have access to write the same record.
Whenever the lock can't be resolved, a deadlock will pop up.
SQL server though doesn't have the ability to read a record that has been locked to write changes, therefore depending which query you're running, you might lock an entire table
First you need to separate queries and insert/updates using a data-warehouse database. Which means you could solve slow performance in update that causes locks.
The next step is to identify what is causing locks and work out each case separately.
rebuilding indexes during working hours could cause very nasty locks. Push them to after hours.

Replicate a database using snapshots and transaction logs

For learning purposes, I want to write my own database, that is able to replicate itself. I have made some progress, but now I am facing a problem that I can not solve. Supposed I have a database (let's call this source) that I would like to replicate to another database (let's call this target).
The basic principle is easy: In the source you don't store actual tables, but instead a log of transactions. It's easy to send over the transaction log to the target, where the database then rebuilds itself. If you want to update the target, you simply request the part of the transaction log that has changed ever since. Basically this is what almost every database does.
While this works, it has one major drawback: If a table already exists for a long time, the transaction log is very long, and hence replicating the table requires lots of time…
To avoid this you can store the current state as well. This means you have an up-to-date snapshot that you can copy fast. Additionally, the target has to subscribe to the transaction log of the source. Once it contains additional entries, the target applies them to its copied table. This works well, too, and it's way better in terms of performance and transferred volume.
But now I am facing a problem: Supposed the snapshot is large, then it may happen that changes are made to it while it is being delivered. That means that the copied snapshot contains some old and some new data. Now, how do I get the target database in a consistent state? Even if I know from where to start the transaction log, I either have to apply a change that was already applied to some of the records, or I have to leave it out, but then a change is not applied at all to some other records.
Of course I could use the isolation level sequential, but then performance drops. Of course I could do what e.g. CouchDB does and remember the current table revision in every record, and keep a copy of every record for every revision. But then the required space grows enormously.
So, what shall I do?
Everything that I was able to find on the web always either relies on the idea of replaying the entire transaction log, or by using a process as in CouchDB which takes up huge amounts of space.
Any ideas?

Your snapshot needs to be consistent and you need to know at what time (in regards to the tx log) it is consistent. You then apply any transactions that have been committed since this point.
Obtaining a consistent snapshot can be done with exclusive locking, which may delay other transactions from committing, or using row versions (MVCC).
Good luck with your project.

To NOLOCK or not to NOLOCK?

I've worked in a very large organization where they had to use NOLOCK on most queries - because data was commonly updated via ETL processes during the day and locking the application for 40% of the working day was certainly not an option.
Out of habit, in my next place I went on to automatically use NOLOCK everywhere. But since reading warnings and risks, I've been gradually undoing this, to have no table hints specified and let SQL Server do it's thing.
However, I'm still not comfortable I'm doing the right thing. In the place where we used NOLOCK, I never once saw data get doubled up, or corrupt data.. I was there for many years. Ever since removing NOLOCK I am running into the obvious obstacle of rowlocks slowing / pausing queries which gives the illusion that my DB is slow or flakey. When in actuality it's just somebody running a lengthy save somewhere (the ability for them to do so is a requirement of the application).
I would be interested to hear from anyone who has actually experienced data corruption or data duplication from NOLOCK in practise, rather than people who are going by what they've read about it on the internet. I would be especially appreciative if anybody can provide replication steps to see this happen. I am trying to gauge just how risky is it, and do the risks outweigh the obvious benefit of being able to run reports parallel to updates?

I've seen corruptions, duplication and bad results with NOLOCK. Not once, not seldom. Every deployment that relied on NOLOCK and I had a chance to look at had correctness issues. True that many (most?) were not aware and did not realize, but the problems were there, always.
You have to realize that NOLOCK problems do no manifest as hard corruptions (the kind DBCC CHECKDB would report), but 'soft' corruption. And the problems are obvious only on certain kind of workloads, mostly on analytic types (aggregates). They would manifest as an incorrect value in a report, a balance mismatch in a ledger, a wrong department headcount and similar. These problems are visible as problems only when carefully inspected by a qualified human. And they may well vanish mysteriously on a simple Refresh of the page. So you may well have all these problems and not be aware of them. Your users might accept that 'sometimes the balance is wrong, just ask for the report again and will be OK' and never report you the issue.
And there are some workloads that are not very sensitive to NOLOCK issues. If you display 'posts' and 'comments' you won't see much of NOLOCK issues. Maybe the 'unanswered count' is off by 2, but who will notice?
Ever since removing NOLOCK I am running into the obvious obstacle of rowlocks slowing / pausing queries
I would recommend evaluating SNPASHOT isolation models (including READ_COMMITTED_SNAPSHOT). You may get a free lunch.

I see you've read a lot about it, but allow me to point you to a very good explanation on the dangers of using NOLOCK (that's it READ UNCOMMITTED isolation level): SQL Server NOLOCK Hint & other poor ideas.
Apart from this, I'll make some citations and comments. The worst part of NOLOCK is this:
It creates “incredibly hard to reproduce” bugs.
The problem is that when you read uncommited data, most of the time is commited, so everything is alright. But it will randomly fail if the transaction is not comitted. And that doesn't usually happen. Right? Nope: first, a single error is a very bad thing (your customer don't like it). And second, things can get much worse, LO:
The issue is that transactions do more than just update the row. Often they require an index to be updated OR they run out of space on the data page. This may require new pages to be allocated & existing rows on that page to be moved, called a PageSplit. It is possible for your select to completely miss a number of rows &/or count other rows twice. More info on this in the linked article
So, that means that even if the uncommited transaction you've read is committed, you can still read bad data. And, this will happen at random times. That's ugly, very ugly!
What about corruption?
As Remus Rusanu said, it's not "hard" but "soft" corruption. And it affects specially aggregates, because you're reading what you shouldn't when updating them. This can lead for example to a wrong account balance.
Haven't you heard of big LOB apps that have procedures to rebuild account balances? Why? They should be correctly updated inside transactions! (That can be acceptable if the balances are rebuilt at critical moments, for example while calcultaing taxes).
What can I do without corrupting data (and thus is relatively safe)?
Let's say it's "quite safe" to read uncommited data when you're not using it to update other existing data on the DB. I.e. if you use NOLOCK only for reporting purposes (without write-back) you're on the "quite safe" side. The only "tiny trouble" is that the report can show the wrong data, but, at least, the data in the DB will keep consistent.
To consider this safe depends on the prupose of what you're reading. If it's something informational, which is not going to be used to make decissions, that's quite safe (for example it's not very bad to have some errors on a report of the best customers, or the most sold products). But if you're getting this information to make decissions, things can be much worse (you can make a decission on a wrong basis!)
A particular experience
I worked on the development of a 'crowded' application, with some 1,500 users which used NOLOCK for reading data, modifying it an updating it on the DB (a HHRR/TEA company). And (apparently) there were no problems. The trick was that each employee read "atomic data" (an employee's data) to modify it, and it was nearly impossible that two people read and modified the same data at the same time. Besides this "atomic data" didn't influence any aggregate data. So everything was fine. But from time to time there were problems on the reporting area, which read "aggregated data" with NOLOCK. So, the critical reports had to be scheduled for moments where noone was working in the DB. The small deviations on non-critical reports was overlooked and admittable.
Now you know it. You have no excuses. You decide, to NOLOCK or not to NOLOCK

What does a database log keep track of?

I'm quite new to SQL Server and was wondering what the difference between the SQL Server log is and a custom log (in my case, using log4net)? I guess there's more choice on what to log using log4net, but what things are automatically logged by the database? For example, if a user signs up to my site, would I have to manually log that transaction, or would that be recorded in the database's log automatically? I'm currently starting a project and would like to figure out exactly what I should bother logging.
Thanks

Apples and Oranges.
Log4net and other custom 'logging' is just a way to capture events an application is reporting. 'Log' in this context reffers to whatever store is used by this infrastucture to persist information about these events.
The database log on the other hand is something compeltely different. In order to maintain consistency and atomicity databases use a so called Write-Ahead-Log protocol. In WAL all changes are first durable written into a journal, or log, before being applied to the data. This allows recovery to replay the log (the journal) and get the data back into a consistent state, by rolling back any uncommited work.
Database logs have absolutely nothing to do with your application code. Any database update will be automatically logged by the engine, simply because this is how any data is updated in a database. You cannot modify that, nor do you have any access to what's written in the log (strictly speaking you can look into the log, but you won't find any usefull information for your application).

SQL log handles tansaction logging for rolling back or comiting data. They are usually only dealt with by someone who knows what they are doing restoring backups or shipping the logs to use for backups.
The log4net and other logging framweworks handle in code logging of exceptions, warning, or debug level info that you would like to output for your own info. They can be sent to a table in a database, command window, flat file or web service. Common logging scenarios are catching unhandled exceptions at the application level to help track down bugs, or in any try catch statements writing out the stack trace.

It keeps track of the transactions so it can roll them back or replay in case of a crash. Quite more involved than simple logging.

The two are almost completely unrelated.
A database log is used to rollback transactions, recover from crashes, etc. All good things to ensure database consistency. It has updates/inserts/deletes in it--not really anything about intent or what your app is trying to do unless it directly affects data in the database.
The application log on the other hand (with Log4Net) can be extremely useful when building and debugging your application. It is driven by you and should contain information that traces what your app is doing. This is something that can safely be turned off or reduced (by toggling the log level) when you no longer need it.

The SQL Server log file is actually used for maintaining it's own stability, but it's not terribly useful for normal developers. It's not what you think (and I what I thought), a list of SQL statements that have been run. It's a propriety format designed to help SQL recover from a crash or roll back transactions.
If you need to track what's going on in the system, the SQL transaction log won't be helpful, and it would be very difficult to get that information back out. Instead, I would suggest adding triggers on your tables that write information off to another table, or add some code in your data layer that saves off a log of what's going on. It could be as simple as wrapping the SQL command object with your own implementation, which saved SQL statements off to log4net in addition to whatever normal code it was executing.

It is the mechanism by which the RMDBS can assure atomicity and consistency, see ACID.

What is the worst database accident that happened to you in production? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
For example: Updating all rows of the customer table because you forgot to add the where clause.
What was it like, realizing it and reporting it to your coworkers or customers?
What were the lessons learned?

I think my worst mistake was
truncate table Customers
truncate table Transactions
I didnt see what MSSQL server I was logged into, I wanted to clear my local copy out...The familiar "OH s**t" when it was taking significantly longer than about half a second to delete, my boss noticed I went visibily white, and asked what I just did. About half a mintue later, our site monitor went nuts and started emailing us saying the site was down.
Lesson learned? Never keep a connection open to live DB longer than absolutly needed.
Was only up till 4am restoring the data from the backups too! My boss felt sorry for me, and bought me dinner...

I work for a small e-commerce company, there's 2 developers and a DBA, me being one of the developers. I'm normally not in the habit of updating production data on the fly, if we have stored procedures we've changed we put them through source control and have an officially deployment routine setup.
Well anyways a user came to me needing an update done to our contact database, batch updating a bunch of facilities. So I wrote out the query in our test environment, something like
update facilities set address1 = '123 Fake Street'
where facilityid in (1, 2, 3)
Something like that. Ran it in test, 3 rows updated. Copied it to clipboard, pasted it in terminal services on our production sql box, ran it, watched in horror as it took 5 seconds to execute and updated 100000 rows. Somehow I copied the first line and not the second, and wasn't paying attention as I CTRL + V, CTRL + E'd.
My DBA, an older Greek gentleman, probably the grumpiest person I've met was not thrilled. Luckily we had a backup, and it didn't break any pages, luckily that field is only really for display purposes (and billing/shipping).
Lesson learned was pay attention to what you're copying and pasting, probably some others too.

A junior DBA meant to do:
delete from [table] where [condition]
Instead they typed:
delete [table] where [condition]
Which is valid T-Sql but basically ignores the where [condition] bit completely (at least it did back then on MSSQL 2000/97 - I forget which) and wipes the entire table.
That was fun :-/

About 7 years ago, I was generating a change script for a client's DB after working late. I had only changed stored procedures but when I generated the SQL I had "script dependent objects" checked. I ran it on my local machine and all appeared to work well. I ran it on the client's server and the script succeeded.
Then I loaded the web site and the site was empty. To my horror, the "script dependent objects" setting did a DROP TABLE for every table that my stored procedures touched.
I immediately called the lead dev and boss letting them know what happened and asking where the latest backup of the DB could be located. 2 other devs were conferenced in and the conclusion we came to was that no backup system was even in place and no data could be restored. The client lost their entire website's content and I was the root cause. The result was a $5000 credit given to our client.
For me it was a great lesson, and now I am super-cautious about running any change scripts, and backing up DBs first. I'm still with the same company today, and whenever the jokes come up about backups or database scripts someone always brings up the famous "DROP TABLE" incident.

Something to the effect of:
update email set processedTime=null,sentTime=null
on a production newsletter database, resending every email in the database.

I once managed to write an updating cursor that never exited. On a 2M+ row table. The locks just escalated and escalated until this 16-core, 8GB RAM (in 2002!) box actually ground to a halt (of the blue screen variety).

update Customers set ModifyUser = 'Terrapin'
I forgot the where clause - pretty innocent, but on a table with 5000+ customers, my name will be on every record for a while...
Lesson learned: use transaction commit and rollback!

We were trying to fix a busted node on an Oracle cluster.
The storage management module was having problems, so we clicked the un-install button with the intention of re-installing and copying the configuration over from another node.
Hmm, it turns out the un-install button applied to the entire cluster, so it cheerfully removed the storage management module from all the nodes in the system.
Causing every node in the production cluster to crash. And since none of the nodes had a storage manager, they wouldn't come up!
Here's an interesting fact about backups... the oldest backups get rotated off-site, and you know what your oldest files on a database are? The configuration files that got set up when the system was installed.
So we had to have the offsite people send a courier with that tape, and a couple of hours later we had everything reinstalled and running. Now we keep local copies of the installation and configuration files!

I thought I was working in the testing DB (which wasn't the case apparently), so when I finished 'testing' I run a script to reset all data back to the standard test data we use... ouch!
Luckily this happened on a database that had backups in place, so after figuring out I did something wrong we could easily bring back the original database.
However this incident did teach the company I worked for to realy seperate the production and the test environment.

I don't remember all the sql statements that ran out of control but I have one lesson learned - do it in a transaction if you can (beware of the big logfiles!).
In production, if you can, proceed the old fashioned way:
Use a maintenance window
Backup
Perform your change
verify
restore if something went wrong
Pretty uncool, but generally working and even possible to give this procedure to somebody else to run it during their night shift while you're getting your well deserved sleep :-)

I did exactly what you suggested. I updated all the rows in a table that held customer documents because I forgot to add the "where ID = 5" at the end. That was a mistake.
But I was smart and paranoid. I knew I would screw up one day. I had issued a "start transaction". I issued a rollback and then checked the table was OK.
It wasn't.
Lesson learned in production: despite the fact we like to use InnoDB tables in MySQL for many MANY reasons... be SURE you haven't managed to find one of the few MyISAM tables that doesn't respect transactions and you can't roll back on. Don't trust MySQL under any circumstances, and habitually issuing a "start transaction" is a good thing. Even in the worst case scenario (what happened here) it didn't hurt anything and it would have protected me on the InnoDB tables.
I had to restore the table from a backup. Luckily we have nightly backups, the data almost never changes, and the table is a few dozen rows so it was near instantaneous. For reference, no one knew that we still had non-InnoDB tables around, we thought we converted them all long ago. No one told me to look out for this gotcha, no one knew it was there. My boss would have done the same exact thing (if he had hit enter too early before typing the where clause too).

I discovered I didn't understand Oracle redo log files (terminology? it was a long time ago) and lost a weeks' trade data, which had to be manually re-keyed from paper tickets.
There was a silver lining - during the weekend I spent inputting, I learned a lot about the useability of my trade input screen, which improved dramatically thereafter.

Worst case scenario for most people is production data loss, but if they're not running nightly backups or replicating data to a DR site, then they deserve everything they get!
#Keith in T-SQL, isn't the FROM keyword optional for a DELETE? Both of those statements do exactly the same thing...

The worst thing that happened to me was that a Production server consume all the space in the HD. I was using SQL Server so I see the database files and see that the log was about 10 Gb so I decide to do what I always do when I want to trunc a Log file. I did a Detach the delete the log file and then attach again. Well I realize that if the log file is not close properly this procedure does not work. so I end up with a mdf file and no log file. Thankfully I went to the Microsoft site I get a way to restore the database as recovery and move to another database.

Updating all rows of the customer table because you forgot to add the where clause.
That was exactly i did :| . I had updated the password column for all users to a sample string i had typed onto the console. The worst part of it was i was accessing the production server and i was checking out some queries when i did this. My seniors then had to revert an old backup and had to field some calls from some really disgruntled customers. Ofcourse there is another time when i did use the delete statement, which i don't even want to talk about ;-)

I dropped the live database and deleted it.
Lesson learned: ensure you know your SQL - and make sure that you back up before you touch stuff.

This didn't happen to me, just a customer of ours whos mess I had to clean up.
They had a SQL server running on a RAID5 disk array - nice hotswap drives complete with lighted disk status indicators. Green = Good, Red = Bad.
One of their drives turned from green to red and the genius who was told to pull and replace the (Red) bad drive takes a (Green) good one out instead. Well this didn't quite manage to bring down the raid set completely - opting for the somewhat readable (Red) vs unavaliable (Green) for several minutes.. after realizing the mistake and swapping the drives back any data blocks that were written during this time became jyberish as disk synchronization was lost) ... 24-straight hours later writing meta programs to recover readable data and reconstruct a medium sized schema they were back up and running.
Morals of this story include...Never use RAID5, always maintain backups, careful who you hire.
I made a major mistake on a customers production system once -- luckily while wondering why the command was taking so long to execute realized what I had done and canceled it before the world came to an end.
Moral of this story include ... always start a new transaction before changing ANYTHING, test the results are what you expect and then and only then commit the transaction.
As a general observation many classes of rm -rf / type errors can be prevented by properly defining foreign key constraints on your schema and staying far away from any command labled 'CASCADE'

Truncate table T_DAT_STORE
T_DAT_STORE was the fact table of the department I work in. I think I was connected to the development database. Fortunately, we have a daily backup, which hasn't been used until that day, and the data was restored in six hours.
Since then I revise everything before a truncate, and periodically I ask for a backup restoration of minor tables only to check the backup is doing well (Backup isn't done by my department)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight