Delete 200G ( 465025579 records ) of data from sql server

Delete 200G ( 465025579 records ) of data from sql server - sql-server

I have database with one of the tables that got over populated ( 465025579 records ), what is the best way to delete the records and keep only 3 months of the records, without the device to hang?

Delete them in batches based on date earliest first. Sure it'll take some time, but it's safer (as you are defining which to delete) and not so resource intensive. It also means you can shrink the database in batches too, instead of one big hit (which is quite resource intensive).
Yeah, it might fragment the database a little, but until you've got the actual data down to a manageable level, there isn't that much you can do.
To be fair, 200G of data isn't that much on a decent machine these days.
All this said, I'm presuming you want the database to remain 'online'

If you don't need the database to be available whilst you're doing this, the easiest thing to do is usually to select the rows that you want to keep into a different table, run a TRUNCATE on this table, and then copy the saved rows back in.
From TRUNCATE:
TRUNCATE TABLE is similar to the DELETE statement with no WHERE clause; however, TRUNCATE TABLE is faster and uses fewer system and transaction log resources.

Related

Faster SQL Performance

I have to insert one record per tables across 30 tables. The data coming from some other System. I have to insert data in the tables for the first time, then if any update happened, then I need to update tables in the SQL Server. I have two options:
a) I can check timestamp for individual table rows and update if the timestamp is greater.
b) Everytime I can stateway delete records and insert data.
Which one will be faster in SQL Server Database? Is there any other option to address the situatation?

If you are not changing the index fields of the record, the stategy of trying to update first and then insert is usually faster than drop/insert as you don't force the database into updating a bunch of index info.
If using Sql2008+ you should be using the merge command, as it explictly handles the update/insert condition cleanly and clearly
ADDED
I should also add that is you know the usage pattern in rarely update (i.e., 90% insert), you may have a case when drop/insert in faster than update/insert -- depends on lots of details. Regardless, merge is the clear winner if using 2008+

I generally like drop and re-insert. I find it to be cleaner and easier to code. However, if this is happening very frequently and you're worried about concurrency issues, you're probably better off with option 1.
Also, another thing to factor in is how often does the timestamp check fail (where you don't have to insert nor update). If 99% of data is redundant/outdated data, you're probably better off with option 1 regardless.

How to update and add new records at the same time

I have a table that contains over than a million records (products).
Now, daily, I need to either update existing records, and/or add new ones.
Instead of doing it one-by-one (takes couple of hours), I managed to use SqlBulkCopy to work with bunch of records and managed to do my inserts in the matter of seconds, but it can handle only new inserts. So I am thinking about creating a new table that contains new records and old records; and then use that temporary table (on the SQL end) to update/add to the main table.
Any advice how can I perform that update?

One of the better ways to handle this is with the MERGE command in SQL. Mssqltips has a good tutorial on it, it can be a bit trickier to use than some of the other commands.
Also, due to locking you may want to break this up into multiple smaller transactions, unless you know you can tolerate blocking during the update.

We handle this situation in our code in the way you described; we have a temp table, then run an update where the ID in the temp table matches the table to be updated, then run an insert where the ID in the table to be updated is null. We normally do this for updates to library/program settings, though, so it is only run infrequently, on smaller tables. Performance may not be up to par for that many records, or daily runs.
The main "gotcha" I've encountered with this method is that for the update, we did a comparison to make sure at least one of several fields changed before actually running the update. (Our initial reason for this was to avoid overwriting some defaults, which could affect server behavior. Your reason for this might be performance, if your temp table could contain records that haven't actually changed). We encountered a case where we did actually want to update one of the defaults, but our old script didn't catch that. So if you do any comparisons to determine which products you want to update, make sure it is either complete from the start, or document well any fields you don't compare, and why.

SQL Server Optimize Large Changing Table

I have reports that perform some time consuming data calculations for each user in my database, and the result is 10 to 20 calculated new records for each user. To improve report responsiveness, a nightly job was created to run the calculations and dump the results to a snapshot table in the database. It only runs for active users.
So with 50k users, 30k of which are active, the job "updates" 300k to 600k records in the large snapshot table. The method it currently uses is it deletes all previous records for a given user, then inserts the new set. There is no PK on the table, only a business key is used to group the sets of data.
So my question is, when removing and adding up to 600k records every night, are there techniques to optimize the table to handle this? For instance, since the data can be recreated on demand, is there a way to disable logging for the table as these changes are made?
UPDATE:
One issue is I cannot do this in batch because the way the script works, it's examining one user at a time, so it looks at a user, deletes the previous 10-20 records, and inserts a new set of 10-20 records. It does this over and over. I am worried that the transaction log will run out of space or other performance issues could occur. I would like to configure the table to now worry about data preservation or other items that could slow it down. I cannot drop the indexes and all that because people are accessing the table concurrently to it being updated.

It's also worth noting that indexing could potentially speed up this bulk update rather than slow it down, because UPDATE and DELETE statements still need to be able to locate the affected rows in the first place, and without appropriate indexes it will resort to table scans.
I would, at the very least, consider a non-clustered index on the column(s) that identify the user, and (assuming you are using 2008) consider the MERGE statement, which can definitely avoid the shortcomings of the mass DELETE/INSERT method currently employed.
According to The Data Loading Performance Guide (MSDN), MERGE is minimally logged for inserts with the use of a trace flag.
I won't say too much more until I know which version of SQL Server you are using.

This is called Bulk Insert, you have to drop all indexes in destination table and send insert commands in large packs (hundreds of insert statements) separated by ;
Another way is to use BULK INSERT statement http://msdn.microsoft.com/en-us/library/ms188365.aspx
but it involves dumping data to file.
See also: Bulk Insert Sql Server millions of record

It really depends upon many things
speed of your machine
size of the records being processed
network speed
etc.
Generally it is quicker to add records to a "heap" or an un-indexed table. So dropping all of your indexes and re-creating them after the load may improve your performance.
Partitioning the table may see performance benefits if you partition by active and inactive users (although the data set may be a little small for this)
Ensure you test how long each tweak adds or reduces your load and work from there.

Is deleting all records in a table a bad practice in SQL Server?

I am moving a system from a VB/Access app to SQL server. One common thing in the access database is the use of tables to hold data that is being calculated and then using that data for a report.
eg.
delete from treporttable
insert into treporttable (.... this thing and that thing)
Update treportable set x = x * price where (...etc)
and then report runs from treporttable
I have heard that SQL server does not like it when all records from a table are deleted as it creates huge logs etc. I tried temp sql tables but they don't persists long enough for the report which is in a different process to run and report off of.
There are a number of places where this is done to different report tables in the application. The reports can be run many times a day and have a large number of records created in the report tables.
Can anyone tell me if there is a best practise for this or if my information about the logs is incorrect and this code will be fine in SQL server.

If you do not need to log the deletion activity you can use the truncate table command.
From books online:
TRUNCATE TABLE is functionally
identical to DELETE statement with no
WHERE clause: both remove all rows in
the table. But TRUNCATE TABLE is
faster and uses fewer system and
transaction log resources than DELETE.
http://msdn.microsoft.com/en-us/library/aa260621(SQL.80).aspx

delete from sometable
Is going to allow you to rollback the change. So if your table is very large, then this can cause a lot of memory useage and time.
However, if you have no fear of failure then:
truncate sometable
Will perform nearly instantly, and with minimal memory requirements. There is no rollback though.

To Nathan Feger:
You can rollback from TRUNCATE. See for yourself:
CREATE TABLE dbo.Test(i INT);
GO
INSERT dbo.Test(i) SELECT 1;
GO
BEGIN TRAN
TRUNCATE TABLE dbo.Test;
SELECT i FROM dbo.Test;
ROLLBACK
GO
SELECT i FROM dbo.Test;
GO
i
(0 row(s) affected)
i
1
(1 row(s) affected)

You could also DROP the table, and recreate it...if there are no relationships.
The [DROP table] statement is transactionally safe whereas [TRUNCATE] is not.
So it depends on your schema which direction you want to go!!
Also, use SQL Profiler to analyze your execution times. Test it out and see which is best!!

The answer depends on the recovery model of your database. If you are in full recovery mode, then you have transaction logs that could become very large when you delete a lot of data. However, if you're backing up transaction logs on a regular basis to free the space, this might not be a concern for you.
Generally speaking, if the transaction logging doesn't matter to you at all, you should TRUNCATE the table instead. Be mindful, though, of any key seeds, because TRUNCATE will reseed the table.
EDIT: Note that even if the recovery model is set to Simple, your transaction logs will grow during a mass delete. The transaction logs will just be cleared afterward (without releasing the space). The idea is that DELETE will create a transaction even temporarily.

Consider using temporary tables. Their names start with # and they are deleted when nobody refers to them. Example:
create table #myreport (
id identity,
col1,
...
)
Temporary tables are made to be thrown away, and that happens very efficiently.
Another option is using TRUNCATE TABLE instead of DELETE. The truncate will not grow the log file.

I think your example has a possible concurrency issue. What if multiple processes are using the table at the same time? If you add a JOB_ID column or something like that will allow you to clear the relevant entries in this table without clobbering the data being used by another process.

Actually tables such as treporttable do not need to be recovered to a point of time. As such, they can live in a separate database with simple recovery mode. That eases the burden of logging.

There are a number of ways to handle this. First you can move the creation of the data to running of the report itself. This I feel is the best way to handle, then you can use temp tables to temporarily stage your data and no one will have concurency issues if multiple people try to run the report at the same time. Depending on how many reports we are talking about, it could take some time to do this, so you may need another short term solutio n as well.
Second you could move all your reporting tables to a difffernt db that is set to simple mode and truncate them before running your queries to populate. This is closest to your current process, but if multiple users are trying to run the same report could be an issue.
Third you could set up a job to populate the tables (still in separate db set to simple recovery) once a day (truncating at that time). Then anyone running a report that day will see the same data and there will be no concurrency issues. However the data will not be up-to-the minute. You also could set up a reporting data awarehouse, but that is probably overkill in your case.

How to copy large set of data in SQLServer db

I have a requirement to take a "snapshot" of a current database and clone it into the same database, with new Primary Keys.
The schema in question consists of about 10 tables, but a few of the tables will potentially contain hundreds of thousands to 1 million records that need to be duplicated.
What are my options here?
I'm afraid that writing a SPROC will require a locking of the database rows in question (for concurrency) for the entire duration of the operation, which is quite annoying to other users. How long would such an operation take, assuming that we can optimize it to the full extent sqlserver allows? Is it going to be 30 seconds to 1 minute to perform this many inserts? I'm not able to lock the whole table(s) and do a bulk insert, because there are other users under other accounts that are using the same tables independently.
Depending on performance expectations, an alternative would be to dump the current db into an xml file and then asynchronously clone the db from this xml file at leisure in the background. The obvious advantage of this is that the db is only locked for the time it takes to do the xml dump, and the inserts can run in the background.
If a good DBA can get the "clone" operation to execute start to finish in under 10 seconds, then it's probably not worth the complexity of the xmldump/webservice solution. But if it's a lost cause, and inserting potentially millions of rows is likely to balloon out in time, then I'd rather start out with the xml approach right away.
Or maybe there's an entirely better approach altogether??
Thanks a lot for any insights you can provide.

I would suggest backing the up database, and then restoring it as new db on your server. You can use that new DB as your source.
I will definitely recommend against the xml dump idea..

Does it need to be in the exact same tables? You could make a set of "snapshots" tables where all these records go, you would only need a single insert + select, like
insert into snapshots_source1 (user,col1, col2, ..., colN)
select 'john', col1, col2, ..., colN from source1
and so on.
You can make snapshots_* to have an IDENTITY column that will create the 'new PK' and that can also preserve the old one if you so wished.
This has (almost) no locking issues and looks a lot saner.
It does require a change in the code, but shouldn't be too hard to make the app to point to the snapshots table when appropriate.
This also eases cleaning and maintenance issues
---8<------8<------8<---outdated answer---8<---8<------8<------8<------8<---
Why don't you just take a live backup and do the data manipulation (key changing) on the destination clone?
Now, in general, this snapshot with new primary keys idea sounds suspect. If you want a replica, you have log shipping and cluster service, if you want a copy of the data to generate a 'new app instance' a backup/restore/manipulate process should be enough.
You don't say how much your DB will occupy, but you can certainly backup 20 million rows (800MB?) in about 10 seconds depending on how fast your disk subsystem is...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight