I am testing different strategies for a incoming breaking change. The problem is that each experiment would carry some costs in Azure.
The data is huge, and can have some inconsistencies due to many years with fixes and transactions before I even knew the company.
I need to change a column in a table with million of records and dozens of indexes. This will have a big downtime.
ALTER TABLE X ALTER COLUMN A1 decimal(15, 4) --The original column is int
One of the initial ideas (Now I know this is not possible) is to have a secondary replica, do the changes there, and, when changes finish, swap primary with secondary... zero or almost zero downtime. I am referring to a "live", redundant replica, not just a "copy"
EDIT:
Throwing new ideas:
Variations to what have been mentioned in one of the answers: Create a table replica (not the whole DB, just the table), apply a INSERT INTO... SELECT and swap the tables at the end of the process. Or... do the swap early to minimize downtime in trade of a delay during the post-addition of all records from the source
I have tried this, but takes AGES to complete. Also, some null and FK violations make the process to fail after processing for several hours.
"Resuming" could be an option but it makes the process slower with each execution. Without some kind of "Resume", each failure have to be repeated from scratch
An acceptable improvement could be to IGNORE the errors (but create logs, of course) and apply fixes after migration. But afaik, AzureSql (nor SqlServer) doesn't offer an "ignore" option
Drop all indexes, constraints and dependencies to the column that needs to be modified, modify the column and apply all indexes, constraints and dependencies again.
Also tried this one. Some indexes take AGES to complete. But for now seems to be the best bet.
There is a possible variation by applying ROW COMPRESSION before the datatype change, but I think it will not improve the real deal: index re-creation
Create a new column with the target datatype, copy the data from the source column, drop the old column and rename the new one.
This strategy also requires to drop and regenerate indexes, so it will not offer lot of gain (if any) with regards #2.
A friend thought of a variation on this, which is to duplicate the needed indexes ONLINE for the column copy. In the meanwhile, trigger all changes on source column to the column copy.
For any of the mentioned strategies, some gain can be obtained by increasing the processing power. But, anyway, we consider to increase the power with any of the approaches, therefore this is common for all solutions
When you need to update A LOT of rows as a one-time event, maybe it's more effective to use the following migration technique :
create a new target table
use INSERT INTO SELECT to fill the new table with correct / updated values
rename the old and new table
create indexes for the new table
After many tests and backups, we finally used the following aproach:
Create a new column [columnName_NEW] with the desired format change. Allow NULLS
Create a trigger for INSERTS to update the new column with the value in the column to be replaced
Copy the old column value to the new column by batches
This operation is very time consuming. We ran a batch every day in a maintenance window (2h during 4 days). Our batch filled the values taking oldest rows first, we counted on the trigger filling the new ones
Once #3 is complete, don't allow NULLS anymore on the new column, but set a default value to avoid the INSERT trigger to crash
Create all the needed indexes and views on the new column. This is very time consuming but can be done ONLINE
Allow NULLS on the old column
Remove the insert trigger - start downtime now!
Rename the old column to [columnName_OLD], the new to [columnName]. This requires few downtime seconds!
--> You can consider it is finally done!
After some safe time, you can backup the result and remove [columnName_OLD] with all of its dependencies
I selected the other answer, because I think it could be also useful in most situations. This one has more steps but has a very little downtime and is reversible at any step but the last.
Related
Sorry for the longish description... but here we go...
We have a fact table somewhat flattened with a few properties that you might have put in a dimension in a more "classic" data warehouse.
I expect to have billions of rows in that table.
We want to enrich these properties with some cleansing/grouping that would not change often, but would still do from time to time.
We are thinking of keeping this initial fact table as the "master" that we never update or delete from, and making an "extended fact" table copy of it where we just add the new derived properties.
The process of generating these extended property values requires mapping to some fort of lookup table, from which we get several possibilities for each row, and then select the best one (one per initial row).
This is likely to be processor intensive.
QUESTION (at last!):
Imagine my lookup table is modified and I want to re-assess the extended properties for only a subset of my initial fact table.
I would end up with a few million rows I want to MODIFY in the target extended fact table.
What would be the best way to achieve this update? (updating a couple of million rows within a couple of billion rows table)
Should I write an UPDATE statement with a join?
Would it be better to DELETE this million rows and INSERT the new ones?
Any other way, like creating a new extended fact table with only the appropriate INSERTs?
Thanks
Eric
PS: I come from a SQL Server background where DELETE can be slow
PPS: I still love SQL Server too! :-)
Write performance for Snowflake vs. traditional RDBS behaves quite differently. All your tables persist in S3, and S3 does not let you rewrite only select bytes of an existing object; the entire file object must be uploaded and replaced. So, while in say SQL server where data and indexes are modified in place, creating new pages as necessary, an UPDATE/DELETE in snowflake is a full sequential scan on the table file, creating an immutable copy of the original with applicable rows filtered out (deleted) or modified (update), which then replaces the file just scanned.
So, whether updating 1 row, or 1M rows, at minimum the entirety of the micro-partitions that the modified data exists in will have to be rewritten.
I would take a look at the MERGE command, which allows you to insert, update, and delete all in one command (effectively applying the differential from table A into table B. Among other things, it should keep your Time-Travel costs down vs constantly wiping and rewriting tables. Another consideration is that since snowflake is column oriented, a column update in theory should only require operations on the S3 files for that column, whereas an insert/delete would replace all S3 files for all columns, which would lower performance.
I am faced with the dilemma of writing two lines of code that empties my table or 50 lines of code that check for duplicates every time I run my spiders. I am very tempted to do the former but I remember reading somewhere that it was bad practice. I was wondering what other more experienced programmers thought of this dilemma.
i think it depends on how many columns for each time your record is inserted when comparing with the columns you need to update by performing diff check update.
for example:
if your temp table contains 20 columns and you can perform diff check update by just updating 1 column & the table contains a good index for you to find the column to update, then you should go for diff check update.
however, if it required to update as many columns as insertion or it required very complicated condition to find the particular row to update, i think deleting + insert new row is not a bad choice.
also, for deleting the entire table,e you better go for some implemented method instead of delete (for example in sql-server, there is truncate table, which is relatively faster, for sql-lite, you may try VACUUM, but i dont use sqllite often so i cannot sure)
I have to insert one record per tables across 30 tables. The data coming from some other System. I have to insert data in the tables for the first time, then if any update happened, then I need to update tables in the SQL Server. I have two options:
a) I can check timestamp for individual table rows and update if the timestamp is greater.
b) Everytime I can stateway delete records and insert data.
Which one will be faster in SQL Server Database? Is there any other option to address the situatation?
If you are not changing the index fields of the record, the stategy of trying to update first and then insert is usually faster than drop/insert as you don't force the database into updating a bunch of index info.
If using Sql2008+ you should be using the merge command, as it explictly handles the update/insert condition cleanly and clearly
ADDED
I should also add that is you know the usage pattern in rarely update (i.e., 90% insert), you may have a case when drop/insert in faster than update/insert -- depends on lots of details. Regardless, merge is the clear winner if using 2008+
I generally like drop and re-insert. I find it to be cleaner and easier to code. However, if this is happening very frequently and you're worried about concurrency issues, you're probably better off with option 1.
Also, another thing to factor in is how often does the timestamp check fail (where you don't have to insert nor update). If 99% of data is redundant/outdated data, you're probably better off with option 1 regardless.
I have a table that contains over than a million records (products).
Now, daily, I need to either update existing records, and/or add new ones.
Instead of doing it one-by-one (takes couple of hours), I managed to use SqlBulkCopy to work with bunch of records and managed to do my inserts in the matter of seconds, but it can handle only new inserts. So I am thinking about creating a new table that contains new records and old records; and then use that temporary table (on the SQL end) to update/add to the main table.
Any advice how can I perform that update?
One of the better ways to handle this is with the MERGE command in SQL. Mssqltips has a good tutorial on it, it can be a bit trickier to use than some of the other commands.
Also, due to locking you may want to break this up into multiple smaller transactions, unless you know you can tolerate blocking during the update.
We handle this situation in our code in the way you described; we have a temp table, then run an update where the ID in the temp table matches the table to be updated, then run an insert where the ID in the table to be updated is null. We normally do this for updates to library/program settings, though, so it is only run infrequently, on smaller tables. Performance may not be up to par for that many records, or daily runs.
The main "gotcha" I've encountered with this method is that for the update, we did a comparison to make sure at least one of several fields changed before actually running the update. (Our initial reason for this was to avoid overwriting some defaults, which could affect server behavior. Your reason for this might be performance, if your temp table could contain records that haven't actually changed). We encountered a case where we did actually want to update one of the defaults, but our old script didn't catch that. So if you do any comparisons to determine which products you want to update, make sure it is either complete from the start, or document well any fields you don't compare, and why.
In my current project, some tables have a column named "changed", which indicates if the the current line had been changed since the last check. All the insert and update statements includes this column.
Every hour, I run a schedulated task that queries all changed rows, do some stuff with those rows and then sets null to it's "changed" column.
This is potentially a performance issue, since I'm going to do lot's of write and read operations in this column, the index will be constantly being rebuild.
What's the best option for this scenario(rather than not using this kind of mechanism)?
if your table is huge, drop the column and make a dedicated table (with primary key info only) and have triggers insert into this table. you then need to just process this small table, and clear it as you finish rows. you would need to do this for each table you are tracking.
if your tables are small, the column may not be a bad idea, but you may see blocking/locking if you have lots of selects and updates on these tables, and if your scheduled processing loops or is real slow.
if you go with a column, it might be better to have a LastChgDate column, then you just process all rows within a range (you'll need to track the range to process each time), but you won't need to change the LastChgDate to show it is "done". This might be moot if your scheduled process is updating the actual row, but you don't say.
Since the column likely only has two values (null and 1 for changed), an index is probably useless anyway.