I'm currently changing the Id field of table to be an IDENTITY field. This is simple: Create a temp-table, copy all the data to the temp-table, adjust all the references from and to the table to point from and to the new temp-table, drop the old table, rename the temp-table to the original name.
Now I've got the problem that the copy step is taking too long. Actually the table doesn't have too many entries (~7.5 million rows), but it still takes multiple hours to do this.
I'm currently moving the data with a query like this:
SET IDENTITY_INSERT MyTable_Temp ON
INSERT INTO MyTable_Temp ([Fields]) SELECT [Fields] FROM MyTable
SET IDENTITY_INSERT MyTable_Temp OFF
I've had a look at bcp in combination with cmdshell and a following BULK INSERT, but I don't like the solution of first writing the data to a temp-file and afterwards dumping it back into the new table.
Is there a more efficient way to copy or move the data from the old to the new table? And can this be done in "pure" T-SQL?
Keep in mind, the data is correct (no external sources involved) and no changes are being made to the data during transfer.
Your approach seems fair, but the transaction generated by the insert command is too large and that is why it takes so long.
My approach when dealing with this in the past, was to use a cursor and a batching mechanism.
Perform the operation for only 100000 rows at a time, and you will see major improvements.
After the copy is made you can rebuild your references and eventually remove the old table... and so on. Be careful to reseed your new table accordingly after the data is copied.
Related
I am testing different strategies for a incoming breaking change. The problem is that each experiment would carry some costs in Azure.
The data is huge, and can have some inconsistencies due to many years with fixes and transactions before I even knew the company.
I need to change a column in a table with million of records and dozens of indexes. This will have a big downtime.
ALTER TABLE X ALTER COLUMN A1 decimal(15, 4) --The original column is int
One of the initial ideas (Now I know this is not possible) is to have a secondary replica, do the changes there, and, when changes finish, swap primary with secondary... zero or almost zero downtime. I am referring to a "live", redundant replica, not just a "copy"
EDIT:
Throwing new ideas:
Variations to what have been mentioned in one of the answers: Create a table replica (not the whole DB, just the table), apply a INSERT INTO... SELECT and swap the tables at the end of the process. Or... do the swap early to minimize downtime in trade of a delay during the post-addition of all records from the source
I have tried this, but takes AGES to complete. Also, some null and FK violations make the process to fail after processing for several hours.
"Resuming" could be an option but it makes the process slower with each execution. Without some kind of "Resume", each failure have to be repeated from scratch
An acceptable improvement could be to IGNORE the errors (but create logs, of course) and apply fixes after migration. But afaik, AzureSql (nor SqlServer) doesn't offer an "ignore" option
Drop all indexes, constraints and dependencies to the column that needs to be modified, modify the column and apply all indexes, constraints and dependencies again.
Also tried this one. Some indexes take AGES to complete. But for now seems to be the best bet.
There is a possible variation by applying ROW COMPRESSION before the datatype change, but I think it will not improve the real deal: index re-creation
Create a new column with the target datatype, copy the data from the source column, drop the old column and rename the new one.
This strategy also requires to drop and regenerate indexes, so it will not offer lot of gain (if any) with regards #2.
A friend thought of a variation on this, which is to duplicate the needed indexes ONLINE for the column copy. In the meanwhile, trigger all changes on source column to the column copy.
For any of the mentioned strategies, some gain can be obtained by increasing the processing power. But, anyway, we consider to increase the power with any of the approaches, therefore this is common for all solutions
When you need to update A LOT of rows as a one-time event, maybe it's more effective to use the following migration technique :
create a new target table
use INSERT INTO SELECT to fill the new table with correct / updated values
rename the old and new table
create indexes for the new table
After many tests and backups, we finally used the following aproach:
Create a new column [columnName_NEW] with the desired format change. Allow NULLS
Create a trigger for INSERTS to update the new column with the value in the column to be replaced
Copy the old column value to the new column by batches
This operation is very time consuming. We ran a batch every day in a maintenance window (2h during 4 days). Our batch filled the values taking oldest rows first, we counted on the trigger filling the new ones
Once #3 is complete, don't allow NULLS anymore on the new column, but set a default value to avoid the INSERT trigger to crash
Create all the needed indexes and views on the new column. This is very time consuming but can be done ONLINE
Allow NULLS on the old column
Remove the insert trigger - start downtime now!
Rename the old column to [columnName_OLD], the new to [columnName]. This requires few downtime seconds!
--> You can consider it is finally done!
After some safe time, you can backup the result and remove [columnName_OLD] with all of its dependencies
I selected the other answer, because I think it could be also useful in most situations. This one has more steps but has a very little downtime and is reversible at any step but the last.
Sorry for the longish description... but here we go...
We have a fact table somewhat flattened with a few properties that you might have put in a dimension in a more "classic" data warehouse.
I expect to have billions of rows in that table.
We want to enrich these properties with some cleansing/grouping that would not change often, but would still do from time to time.
We are thinking of keeping this initial fact table as the "master" that we never update or delete from, and making an "extended fact" table copy of it where we just add the new derived properties.
The process of generating these extended property values requires mapping to some fort of lookup table, from which we get several possibilities for each row, and then select the best one (one per initial row).
This is likely to be processor intensive.
QUESTION (at last!):
Imagine my lookup table is modified and I want to re-assess the extended properties for only a subset of my initial fact table.
I would end up with a few million rows I want to MODIFY in the target extended fact table.
What would be the best way to achieve this update? (updating a couple of million rows within a couple of billion rows table)
Should I write an UPDATE statement with a join?
Would it be better to DELETE this million rows and INSERT the new ones?
Any other way, like creating a new extended fact table with only the appropriate INSERTs?
Thanks
Eric
PS: I come from a SQL Server background where DELETE can be slow
PPS: I still love SQL Server too! :-)
Write performance for Snowflake vs. traditional RDBS behaves quite differently. All your tables persist in S3, and S3 does not let you rewrite only select bytes of an existing object; the entire file object must be uploaded and replaced. So, while in say SQL server where data and indexes are modified in place, creating new pages as necessary, an UPDATE/DELETE in snowflake is a full sequential scan on the table file, creating an immutable copy of the original with applicable rows filtered out (deleted) or modified (update), which then replaces the file just scanned.
So, whether updating 1 row, or 1M rows, at minimum the entirety of the micro-partitions that the modified data exists in will have to be rewritten.
I would take a look at the MERGE command, which allows you to insert, update, and delete all in one command (effectively applying the differential from table A into table B. Among other things, it should keep your Time-Travel costs down vs constantly wiping and rewriting tables. Another consideration is that since snowflake is column oriented, a column update in theory should only require operations on the S3 files for that column, whereas an insert/delete would replace all S3 files for all columns, which would lower performance.
I'm using tracing to log all delete or update queries run through the system. The problem is, if I run a query like DELETE FROM [dbo].[Artist] WHERE ArtistId>280, I know how many rows were deleted but I'm unable to find out which rows were deleted (the data they had).
I'm thinking of doing this as a logging system so it would be useful to see which rows were affected and what data they had if at all possible. I don't really want to use triggers for this job but I will if I have to (and if it's feasible).
If you need the original data and are planning on storing all the deleted data in a separate table why not just logically delete the original data rather than physically delete it? i.e.
UPDATE dbo.Artist SET Artist_deleted = 1 WHERE ArtistId>280
Then you only need add one column to your current table rather than creating new tables and scripts to support these. You could then partition the current table based on the deleted flag if you are worried about disk space/performance etc.
I've a running system where data is inserted periodically into MS SQL DB and web application is used to display this data to users.
During data insert users should be able to continue to use DB, unfortunatelly I can't redesign the whole system right now. Every 2 hours 40k-80k records are inserted.
Right now the process looks like this:
Temp table is created
Data is inserted into it using plain INSERT statements (parameterized queries or stored proceuders should improve the speed).
Data is pumped from temp table to destination table using INSERT INTO MyTable(...) SELECT ... FROM #TempTable
I think that such approach is very inefficient. I see, that insert phase can be improved (bulk insert?), but what about transfering data from temp table to destination?
This is waht we did a few times. Rename your table as TableName_A. Create a view that calls that table. Create a second table exactly like the first one (Tablename_B). Populate it with the data from the first one. Now set up your import process to populate the table that is not being called by the view. Then change the view to call that table instead. Total downtime to users, a few seconds. Then repopulate the first table. It is actually easier if you can truncate and populate the table becasue then you don't need that last step, but that may not be possible if your input data is not a complete refresh.
You cannot avoid locking when inserting into the table. Even with BULK INSERT this is not possible.
But clients that want to access this table during the concurrent INSERT operations can do so when changing the transaction isolation level to READ UNCOMMITTED or by executing the SELECT command with the WITH NOLOCK option.
The INSERT command will still lock the table/rows but the SELECT command will then ignore these locks and also read uncommitted entries.
I have a table that have 40million records.
What's best (faster)? Create a column directly in that table or create another table with identity column and insert data from first?
If I create an identity column in the table that have 40million records, is it possible estimate how long does it take to create it?
This kind of depends. Creating an identity column won't take that long (well ok this is relative to the size of the table), assuming you appended it to the end of the table. If you didn't, the server has to create a new table with the identity column at the desired position, export all the rows to the new table, and then change the table name. I am guessing that is what is taking so long.
I'm guessing it's blocked - did you use the GUI or a query window (do you know the SPID it's running under?)
Try these - let us know if they give results and you're not sure what to do:
USE master
SELECT * FROM sysprocesses WHERE blocked <> 0
SELECT * FROM sysprocesses WHERE status = 'runnable' AND spid <> ##SPID
If you used ALTER TABLE [...] ADD ... in a query window, it is pretty fast, in fact it would have finished long ago. If you used the Management Studio table designer it is copying the table into a new table, dropping the old one, then renaming the new table as the old one. It will take a while, specially if you did not pre-grow the database and the log to accommodate the extra space needed. Because is all one single transaction, it would take about another 16 hours to rollback if you stop it now.
Isn't it something you'll only have to do once, and therefore isn't really a problem how long it takes? (Assuming it doesn't take days...)
Can you not create a test copy of the database and create the column on that to see how long it takes?
I think a lot depends upon the hardware and which DBMS you are in. In my environment, creating a new table and copying the old data into it would take about 3 or 4 hours. I would expect the addition of an identity column to take around the same amount of time, just based on other experiences. I'm on Oracle with multiple servers on a SAN, so things can run faster than in a single server environment. You may just have to sit back and wait.