Updating a large (280M rows) table with anonymised data

Updating a large (280M rows) table with anonymised data - sql-server

We have a production database which is being migrated to a new environment, and the customer requires data in certain columns and certain tables to be anonymised while the project is in the development phase.
The supplier has provided a script which replaces the data - for example:
UPDATE ThisTable SET Description = 'Anonymised ' + TableKey
Now the problem is that several of the tables have millions of rows. The biggest is 284,000,000 rows.
The above statement will, of course, never work for such a table due to Locks, TempDb and row versions, log files, etc. etc.
I have a script which I've used before which in essence does the following:
Current version of how i am doing it:
1. Creates a temp table of the source table's PK (and creates an index on the PK).
2. Selects top n PKs from the temp table and processes the appropriate rows in source table.
3. Deletes the top n PKs from the temp table
4. Repeats from step 2
This works well - it gives reasonable performance (and does some metrics to be able to predict end time). However, running it on the large table gives a predicted run time of 4 days!
Other measures I've taken are to put the database in simple recovery mode.
We have exclusive access to the server, and can 'do what we want' with it.
The core problem is that we're talking large numbers of rows. One thought is BCP OUT to text file(s), process offline, and BCP in. However, then we're still into processing a text file with 284,000,000 lines!
ASK:
So - any other thoughts on how to achieve the above? Am I missing a 'simple' way to do this?

Step 1 Crate same table structure in with name ie tablename+temp
Step 2 Now do insert into tablename+temp from select from tablename.
ie insert into tablenametemp
select colunns 'Anonymised ' + TableKey as Description from tablename.
Step 3 Rename tablename to tablename1 and tablename+temp to tablename
Step 4 drop tablename1 (after verification)
Note if you have constrain create do rename them too.

Related

MS SQL Trigger for ETL vs Performance

I would need information what might be the impact for production DB of creating triggers for ~30 Production tables that capture any Update,Delete and Insert statement and put following information "PK", "Table Name", "Time of modification" to separate table.
I have limited ability to test it as I have read only permissions to both Prod and Test environment (and I can get one work day for 10 end users to test it).
I have estimated that number of records inserted from those triggers will be around ~150-200k daily.
Background:
I have project to deploy Data Warehouse for database that is very customized + there are jobs running every day that manipulate the data. Updated on Date column is not being maintain (customization) + there are hard deletes occurring on tables. We decided to ask DEV team to add triggers like:
CREATE TRIGGER [dbo].[triggerName] ON [dbo].[ProductionTable]
FOR INSERT, UPDATE, DELETE
AS
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM inserted
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM deleted
on core ~30 production tables.
Based on this table we will pull delta from last 24 hours and push it to Data Warehouse staging tables.
If anyone had similar issue and can help me estimate how it can impact performance on production database I will really appreciate. (if it works - I am saved, if not I need to propose other solution. Currently mirroring or replication might be hard to get as local DEVs have no idea how to set it up...)
Other ideas how to handle this situation or perform tests are welcome (My deadline is Friday 26-01).

First of all I would suggest you code your table name into a smaller variable and not a character one (30 tables => tinyint).
Second of all you need to understand how big is the payload you are going to write and how:
if you chose a correct clustered index (date column) then the server will just need to out data row by row in a sequence. That is a silly easy job even if you put all 200k rows at once.
if you code the table name as a tinyint, then basically it has to write:
1byte (table name) + PK size (hopefully numeric so <= 8bytes) + 8bytes datetime - so aprox 17bytes on the datapage + indexes if any + log file . This is very lightweight and again will put no "real" pressure on sql sever.
The trigger itself will add a small overhead, but with the amount of rows you are talking about, it is negligible.
I saw systems that do similar stuff on a way larger scale with close to 0 effect on the work process, so I would say that it's a safe bet. The only problem with this approach is that it will not work in some cases (ex: outputs to temp tables from DML statements). But if you do not have these kind of blockers then go for it.
I hope it helps.

Fastest way to compare multiple column values in sql server?

I have a Table in sql server consisting of 200 million records in two different servers. I need to move this table from Server 1 to Server 2.
Table in server 1 can be a subset or a superset of the table in server 2. Some of the records(around 1 million) in server 1 are updated which I need to update in server 2. So currently I am following this approach :-
1) Use SSIS to move data from server 1 to staging database in server 2.
2) Then compare data in staging with the table in server 2 column by column. If any of the column is different, I update the whole row.
This is taking a lot of time. I tried using hashbytes inorder to compare rows like this:-
HASHBYTES('sha',CONCAT(a.[account_no],a.[transaction_id], ...))
<>
HASHBYTES('sha',CONCAT(b.[account_no],b.[transaction_id], ...))
But this is taking even more time.
Any other approach which can be faster and can save time?

This is a problem that's pretty common.
First - do not try and do the updates directly in SQL - the performance will be terrible, and will bring the database server to its knees.
In context, TS1 will be the table on Server 1, TS2 will be the table on Server 2
Using SSIS - create two steps within the job:
First, find the deleted - scan TS2 by ID, and any TS2 ID that does not exist in TS1, delete it.
Second, scan TS1, and if the ID exists in TS2, you will need to update that record. If memory serves, SSIS can inspect for differences and only update if needed, otherwise, just execute the update statement.
While scanning TS1, if the ID does not exist in TS2, then insert the record.
I can't speak to performance on this due to variations in schemas as servers, but it will be compute intensive to analyze the 200mm records. It WILL take a long time.
For on-going execution, you will need to add a "last modified date" timestamp to each record and a trigger to update the field on any legitimate change. Then use that to filter out your problem space. The first scan will not be terrible, as it ONLY looks at the IDs. The insert/update phase will actually benefit from the last modified date filter, assuming the number of records being modified is small (< 5%?) relative to the overall dataset. You will also need to add an index to that column to aid in the filtering.
The other option is to perform a burn and load each time - disable any constraints around TS2, truncate TS2 and copy the data into TS2 from TS1, finally reenabling the constraints and rebuild any indexes.
Best of luck to you.

Efficient DELETE TOP?

Is it more efficient and ultimately FASTER to delete rows from a DB in blocks of 1000 or 10000? I am having to remove approx 3 million rows from many tables. I first did the deletes in blocks of 100K rows but the performance wasn't looking good. I changed to 10000 and seem to be removing faster. Wondering if even smaller like 1K per DELETE statement is even better.
Thoughts?
I am deleting like this:
DELETE TOP(10000)
FROM TABLE
WHERE Date < '1/1/2012'

Yes, it is. It all depends on your server though. I mean, last time I did that i was using this approeach to delete things in 64 million increments (on a table that had at that point around 14 billion rows, 80% Of which got ultimately deleted). I got a delete through every 10 seconds or so.
It really depends on your hardware. Going moreg granular is more work but it means less waiting for tx logs for other things operating on the table. You have to try out and find where you are comfortable - there is no ultimate answer because it is totally dependend on usage of the table and hardware.

We used Table Partitioning to remove 5 million rows in less than a sec but this was from just one table. It took some work up-front but ultimately was the best way. This may not be the best way for you.
From our document about partitioning:
Let’s say you want to add 5 million rows to a table but don’t want to lock the table up while you do it. I ran into a case in an ordering system where I couldn’t insert the rows without stopping the system from taking orders. BAD! Partitioning is one way of doing it if you are adding rows that don’t overlap current data.
WHAT TO WATCH OUT FOR:
Data CANNOT overlap current data. You have to partition the data on a value. The new data cannot be intertwined within the currently partitioned data. If removing data, you have to remove an entire partition or partitions. You will not have a WHERE clause.
If you are doing this on a production database and want to limit the locking on the table, create your indexes with “ONLINE = ON”.
OVERVIEW OF STEPS:
FOR ADDING RECORDS
Partition the table you want to add records to (leave a blank partition for the new data). Do not forget to partition all of your indexes.
Create new table with the exact same structure (keys, data types, etc.).
Add a constraint to the new table to limit that data so that it would fit into the blank partition in the old table.
Insert new rows into new table.
Add indexes to match old table.
Swap the new table with the blank partition of the old table.
Un-partition the old table if you wish.
FOR DELETING RECORDS
Partition the table into sets so that the data you want to delete is all on partitions by itself (this could be many different partitions).
Create a new table with the same partitions.
Swap the partitions with the data you want to delete to the new table.
Un-partition the old table if you wish.

Yes, no, it depends on the usage of table due to locking. I would try to delete the records in a slower pace. So the opposite of the op's question.
set rowcount 10000
while ##rowcount > 0
begin
waitfor delay '0:0:1'
delete
from table
where date < convert(datetime, '20120101', 112)
end
set rowcount 0

Copy data from one column to another in oracle table

My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!

It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.

SQL Query to copy tables between servers with auto insert / update depending on Identity column

Let's say I have 2 servers, and one identical table per server. In each tables, I have identity increment on (by 1 if u ask), and there is 'time' column to note when was the record updated/inserted.
so kinda like this:
ID Content Time
1 banana 2011-01-01 09:59:23.000
2 apple 2011-01-02 12:41:01.000
3 pear 2011-04-05 04:05:44.000
I want to copy (insert/update) all the contents from one table to another periodically with this requirements:
a. copy (insert/update) only before certain MONTH. i.e before August 2011. this is easy though.
b. insert only if records is really new (maybe if the ID isn't exist?)
c. update if you find the 'Time' column is newer (basically that means there is an update at that record) than the last performed copy (I save the date/time of last copy too)
I could do all that by building a program and check it record by record, but with hundred thousands records, it would be pretty slow.
Could it be done using just query?
Btw I'm using this query to copy between servers and I'm using SQL Server 2005
INSERT OPENQUERY(TESTSERVER, 'SELECT * FROM Table1')
SELECT * FROM Table1
thx for da help :)

My strategy in a situation like this is:
first do an outer join and determine the state of the data into a temp table
e.g.
#temp
id state
1 update
2 copy
3 copy
4 insert
Then run n statements joining the three tables together, one for each of the states. Sometimes you can eliminate the multiple statements by entering empty rows in the target with the correct keys. Then you only need to do a more complex update/copy.
However since these are cross server - I'd suggest the following strategy - copy the source table over entirely to the other server, and then do the above on the other server.
Only after this do I do performance analysis to optimise if needed.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight