Sorry for the longish description... but here we go...
We have a fact table somewhat flattened with a few properties that you might have put in a dimension in a more "classic" data warehouse.
I expect to have billions of rows in that table.
We want to enrich these properties with some cleansing/grouping that would not change often, but would still do from time to time.
We are thinking of keeping this initial fact table as the "master" that we never update or delete from, and making an "extended fact" table copy of it where we just add the new derived properties.
The process of generating these extended property values requires mapping to some fort of lookup table, from which we get several possibilities for each row, and then select the best one (one per initial row).
This is likely to be processor intensive.
QUESTION (at last!):
Imagine my lookup table is modified and I want to re-assess the extended properties for only a subset of my initial fact table.
I would end up with a few million rows I want to MODIFY in the target extended fact table.
What would be the best way to achieve this update? (updating a couple of million rows within a couple of billion rows table)
Should I write an UPDATE statement with a join?
Would it be better to DELETE this million rows and INSERT the new ones?
Any other way, like creating a new extended fact table with only the appropriate INSERTs?
Thanks
Eric
PS: I come from a SQL Server background where DELETE can be slow
PPS: I still love SQL Server too! :-)
Write performance for Snowflake vs. traditional RDBS behaves quite differently. All your tables persist in S3, and S3 does not let you rewrite only select bytes of an existing object; the entire file object must be uploaded and replaced. So, while in say SQL server where data and indexes are modified in place, creating new pages as necessary, an UPDATE/DELETE in snowflake is a full sequential scan on the table file, creating an immutable copy of the original with applicable rows filtered out (deleted) or modified (update), which then replaces the file just scanned.
So, whether updating 1 row, or 1M rows, at minimum the entirety of the micro-partitions that the modified data exists in will have to be rewritten.
I would take a look at the MERGE command, which allows you to insert, update, and delete all in one command (effectively applying the differential from table A into table B. Among other things, it should keep your Time-Travel costs down vs constantly wiping and rewriting tables. Another consideration is that since snowflake is column oriented, a column update in theory should only require operations on the S3 files for that column, whereas an insert/delete would replace all S3 files for all columns, which would lower performance.
Related
I am testing different strategies for a incoming breaking change. The problem is that each experiment would carry some costs in Azure.
The data is huge, and can have some inconsistencies due to many years with fixes and transactions before I even knew the company.
I need to change a column in a table with million of records and dozens of indexes. This will have a big downtime.
ALTER TABLE X ALTER COLUMN A1 decimal(15, 4) --The original column is int
One of the initial ideas (Now I know this is not possible) is to have a secondary replica, do the changes there, and, when changes finish, swap primary with secondary... zero or almost zero downtime. I am referring to a "live", redundant replica, not just a "copy"
EDIT:
Throwing new ideas:
Variations to what have been mentioned in one of the answers: Create a table replica (not the whole DB, just the table), apply a INSERT INTO... SELECT and swap the tables at the end of the process. Or... do the swap early to minimize downtime in trade of a delay during the post-addition of all records from the source
I have tried this, but takes AGES to complete. Also, some null and FK violations make the process to fail after processing for several hours.
"Resuming" could be an option but it makes the process slower with each execution. Without some kind of "Resume", each failure have to be repeated from scratch
An acceptable improvement could be to IGNORE the errors (but create logs, of course) and apply fixes after migration. But afaik, AzureSql (nor SqlServer) doesn't offer an "ignore" option
Drop all indexes, constraints and dependencies to the column that needs to be modified, modify the column and apply all indexes, constraints and dependencies again.
Also tried this one. Some indexes take AGES to complete. But for now seems to be the best bet.
There is a possible variation by applying ROW COMPRESSION before the datatype change, but I think it will not improve the real deal: index re-creation
Create a new column with the target datatype, copy the data from the source column, drop the old column and rename the new one.
This strategy also requires to drop and regenerate indexes, so it will not offer lot of gain (if any) with regards #2.
A friend thought of a variation on this, which is to duplicate the needed indexes ONLINE for the column copy. In the meanwhile, trigger all changes on source column to the column copy.
For any of the mentioned strategies, some gain can be obtained by increasing the processing power. But, anyway, we consider to increase the power with any of the approaches, therefore this is common for all solutions
When you need to update A LOT of rows as a one-time event, maybe it's more effective to use the following migration technique :
create a new target table
use INSERT INTO SELECT to fill the new table with correct / updated values
rename the old and new table
create indexes for the new table
After many tests and backups, we finally used the following aproach:
Create a new column [columnName_NEW] with the desired format change. Allow NULLS
Create a trigger for INSERTS to update the new column with the value in the column to be replaced
Copy the old column value to the new column by batches
This operation is very time consuming. We ran a batch every day in a maintenance window (2h during 4 days). Our batch filled the values taking oldest rows first, we counted on the trigger filling the new ones
Once #3 is complete, don't allow NULLS anymore on the new column, but set a default value to avoid the INSERT trigger to crash
Create all the needed indexes and views on the new column. This is very time consuming but can be done ONLINE
Allow NULLS on the old column
Remove the insert trigger - start downtime now!
Rename the old column to [columnName_OLD], the new to [columnName]. This requires few downtime seconds!
--> You can consider it is finally done!
After some safe time, you can backup the result and remove [columnName_OLD] with all of its dependencies
I selected the other answer, because I think it could be also useful in most situations. This one has more steps but has a very little downtime and is reversible at any step but the last.
I just started my new job and after looking at the DBM I was shocked. Its a huge mess.
Now first thing I wanted to do is get some consistency in the order of table columns. We publish new database versions over a .dacpac. My co-worker told me that changing the order of a column would force MSSQL to create a temporary table which stores all the data. MSSQL then creates a new table and inserts all the data into that table.
So lets say my server only runs 2GB of RAM and has 500MB storage left on the harddrive. The whole database weights 20GB. Is it possible that changing the order of columns will cause trouble (memory related)? Is the statement of my co-worker correct?
I couldnt find any good source for my question.
Thanks!
You should NOT "go one table by one".
You should leave your tables as they are, if you don't like the order of columns of some table just create a view reordering your columns as you want.
Not only changing order of columns will cause your tables to be recreated, all the indexes will be recreated, you'll get problems with FK constraints.
And after all, you'll gain absolutely nothig but do damage only. You'll waste server resources, make your tables temporarily inaccessible and the columns will not be stored as you defind anyway, internally they will be stored in "var-fix" format (divided into fixed-length and variable-length)
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
I am looking for much more better way to update tables using SSIS. Specifically, i wanted to optimize the updates on tables (around 10 tables uses same logic).
The logic is,
Select the source data from staging then inserts into physical temp table in the DW (i.e TMP_Tbl)
Update all data matching by customerId column from TMP_Tbl to MyTbl.
Inserts all non-existing customerId column from TMP_Tbl1 to MyTbl.
Using the above steps, this takes some time populating TMP_Tbl. Hence, i planned to change the logic to delete-insert but according to this:
In SQL, is UPDATE always faster than DELETE+INSERT? this would be a recipe for pain.
Given:
no index/keys used on the tables
some tables contains 5M rows, some contains 2k rows
each table update took up to 2-3 minutes, which took for about (15 to 20 minutes) all in all
these updates we're in separate sequence container simultaneously runs
Anyone knows what's the best way to use, seems like using physical temp table needs to be remove, is this normal?
With SSIS you usually BULK INSERT, not INSERT. So if you do not mind DELETE - reinserting the rows should in general outperform UPDATE.
Considering this the faster approach will be:
[Execute SQL Task] Delete all records which you need to update. (Depending on your DB design and queries, some index may help here).
[Data Flow Task] Fast load (using OLE DB Destination, Data access mode: Table of fiew - fast load) both updated and new records from source into MyTbl. No need for temp tables here.
If you cannot/don't want to DELETE records - your current approach is OK too.
You just need to fix the performance of that UPDATE query (adding an index should help). 2-3 minutes per every record updated is way too long.
If it is 2-3 minutes for updating millions of records though - then it's acceptable.
Adding the correct non-clustered index to a table should not result in "much more time on the updates".
There will be a slight overhead, but if it helps your UPDATE to seek instead of scanning a big table - it is usually well worth it.
I have reports that perform some time consuming data calculations for each user in my database, and the result is 10 to 20 calculated new records for each user. To improve report responsiveness, a nightly job was created to run the calculations and dump the results to a snapshot table in the database. It only runs for active users.
So with 50k users, 30k of which are active, the job "updates" 300k to 600k records in the large snapshot table. The method it currently uses is it deletes all previous records for a given user, then inserts the new set. There is no PK on the table, only a business key is used to group the sets of data.
So my question is, when removing and adding up to 600k records every night, are there techniques to optimize the table to handle this? For instance, since the data can be recreated on demand, is there a way to disable logging for the table as these changes are made?
UPDATE:
One issue is I cannot do this in batch because the way the script works, it's examining one user at a time, so it looks at a user, deletes the previous 10-20 records, and inserts a new set of 10-20 records. It does this over and over. I am worried that the transaction log will run out of space or other performance issues could occur. I would like to configure the table to now worry about data preservation or other items that could slow it down. I cannot drop the indexes and all that because people are accessing the table concurrently to it being updated.
It's also worth noting that indexing could potentially speed up this bulk update rather than slow it down, because UPDATE and DELETE statements still need to be able to locate the affected rows in the first place, and without appropriate indexes it will resort to table scans.
I would, at the very least, consider a non-clustered index on the column(s) that identify the user, and (assuming you are using 2008) consider the MERGE statement, which can definitely avoid the shortcomings of the mass DELETE/INSERT method currently employed.
According to The Data Loading Performance Guide (MSDN), MERGE is minimally logged for inserts with the use of a trace flag.
I won't say too much more until I know which version of SQL Server you are using.
This is called Bulk Insert, you have to drop all indexes in destination table and send insert commands in large packs (hundreds of insert statements) separated by ;
Another way is to use BULK INSERT statement http://msdn.microsoft.com/en-us/library/ms188365.aspx
but it involves dumping data to file.
See also: Bulk Insert Sql Server millions of record
It really depends upon many things
speed of your machine
size of the records being processed
network speed
etc.
Generally it is quicker to add records to a "heap" or an un-indexed table. So dropping all of your indexes and re-creating them after the load may improve your performance.
Partitioning the table may see performance benefits if you partition by active and inactive users (although the data set may be a little small for this)
Ensure you test how long each tweak adds or reduces your load and work from there.