I need to execute the following SQL (SQL Server 2008) in a scheduled job periodically. The Query plan shows 53% cost is sort after the data is pulled from the oracle server. However, I've ordered the data in the openquery. How to force the query not to sort when merge joining?
merge target as t
using (select * from openquery(oracle, '
select * from t1 where UpdateTime > ''....'' order by k1, k2')
) as s on s.k1=t.k1 and s.k2=t.K2 -- the clustered PK of "target" is K1,k2
when matched then ......
when not matched then ......
Is there something like bulk insert's "with (order( { column [ ASC | DESC ] } [ ,...n ] ))"? will it help improve the query plan of the merge statement if it exists?
If the oracle table already have PK on K1,K2, will just using oracle.db.owner.tablename as target better? (will SQL Server figure out the index from oracle meta information?)
Or the best I can do is stored the oracle data in a local temp table and create a clustered primary key on K1,k2? I am trying to avoid to create a temp table because sometime the returned openquery data set can be large.
I think a table is the best way to go because then you can create whatever indexes you need, but there's no reason why it should be temporary; why not create a permanent staging table? A local join using local indexes will probably be much more efficient than a join on the results of a remote query, although the only way to know for sure is to test it and see.
If you're worried about the large number of rows, you can look into only copying over new or changed rows. If the Oracle table already has columns for row creation and update times, that would be quite easy.
Alternatively, you could consider using SSIS instead of a scheduled job. I understand that if you're not already using SSIS you may not want to invest time in learning it, but it's a very powerful tool and it's designed for moving large amounts of data into MSSQL. You would create a package with the following workflow:
Delete existing rows from the staging table (only if you can't populate it incrementally)
Copy the data from Oracle
Execute the MERGE statement
Related
I was facing issues with merge statement over large tables.
The source table for merge is basically clone of the target table after applying some DML.
e.g. In the below example PUBLIC.customer is target and STAGING.customer is the source.
CREATE OR REPLACE TABLE STAGING.customer CLONE PUBLIC.customer;
MERGE INTO STAGING.customer TARGET USING (SELECT * FROM NEW_CUSTOMER) AS SOURCE ON TARGET.ID = SOURCE.ID
WHEN MATCHED AND SOURCE.DELETEFLAG=TRUE THEN DELETE
WHEN MATCHED AND TARGET.ROWMODIFIED < SOURCE.ROWMODIFIED THEN UPDATE SET TARGET.AGE = SOURCE.AGE, ...
WHEN NOT MATCHED THEN INSERT (AGE) VALUES (AGE, DELETEFLAG, ID,...);
Currently, we are simply merging the STAGING.customer back to PUBLIC.customer at the end.
This final merge statement is very costly for some of the large tables.
While looking for a solution to reduce the cost, I discovered Snowflake "CHANGES" mechanism. As per the documentation,
Currently, at least one of the following must be true before change tracking metadata is recorded for a table:
Change tracking is enabled on the table (using ALTER TABLE … CHANGE_TRACKING = TRUE).
A stream is created for the table (using CREATE STREAM).
Both options add hidden columns to the table which store change tracking metadata. The columns consume a small amount of storage.
I assumed that the metadata added to the table is equivalent to the result-set of the select statement using "changes" clause, which doesn't seem to be the case.
INSERT INTO PUBLIC.CUSTOMER(AGE,...) (SELECT AGE,... FROM STAGING.CUSTOMER CHANGES (information => default) at(timestamp => 1675772176::timestamp) where "METADATA$ACTION" = 'INSERT' );
The select statement using "changes" clause is way slower than the merge statement that I am using currently.
I checked the execution plan and found that Snowflake performs a self-join(sort of) on the table at two different timestamp.
Should it really be the behaviour or am I missing something here? I was hoping to get better performance assuming to scan the table one time and then simply inserting the new records which should be faster than the merge statement.
Also, even if it does a self join, why does the merge query perform better than this, the merge query is also doing join on similar volumes.
I was also hoping to use same mechanism for delete/updates on source table.
I have a local SQL Server DB table with about 5 million records.
I snowflake server that has a similar table that is updated daily.
I need to update my local table with the new records that are added on the Snowflake table.
This code works but it takes about an hour to retrieve about 200,000 records. I insert the records into a local temp table and then insert them into my Sql server db.
Is there a faster way to retrieve the records from Snowflake and get them into SQL Server?
TIA
JohnB
SELECT A.*
into #Sale2020New
FROM OPENQUERY(SNOW, 'SELECT * FROM "DATA"."PUBLIC"."Sales" where "Sales"."Date" >= ''1/1/2020'' and "Sales"."Date" <= ''12/31/2020'' ') A
Left JOIN [SnowFlake].[dbo].Sale2020 B
ON B.PrimaryKey = A.PrimaryKey
WHERE
b.PrimaryKey IS NULL;
Does it take 1 hour just retrieving data from Snowflake or the whole process?
To speed up data retrieval from Snowflake, implement clustering on DATE column in snowflake table. This would prune micropartitions and avoid full table scan. You can get more information on clustering here
As for delta load, instead of a join you can apply filter on DATE column to current date and this will avoid a costly join operation and filter data at the start.
SELECT * FROM "SALES"
where "Sales"."Date" = '2020-04-07'
I have an ETL process (CSV to SQL database) that runs daily, but the data in the source sometimes changes, so I want to have it run again the next day with an updated file.
How do I write a SQL statement to find all the differences?
For example, let's say Table_1 has a composite PRIMARY KEY consisting of FK_1, FK_2 and FK_3.
Do I do this in SQL or in the ETL process?
Thanks.
Edit
I realize now this question is too broad. Disregard.
You can use EXCEPT to find which are the IDs which are missing. For example:
SELECT FK_1, FK_2, FK_2
FROM new_data_table
EXCEPT
SELECT FK_1, FK_2, FK_2
FROM current_data_table;
It will be better (in performance prospective) to materialized these IDs and then to join this new table to the new_data_table in order to insert all of the columns.
If you need to do this in one query, you can use simple LEFT JOIN. For example:
INSERT INTO current_data_table
SELECT A.*
FROM new_data_table A
LEFT JOIN current_data_table B
ON A.FK_1 = B.FK_1
AND A.FK_2 = B.FK_2
AND A.FK_3 = B.FK_3
WHRE B.[FK_1] IS NULL;
The idea is to get all records in the new_data_table for which, there is no match in the current_data_table table (WHRE B.[FK_1] IS NULL).
I have a production database and an archive database in a second SQL Server instance.
When I insert or update (NOT DELETE) data in the production database, I need to insert or update the same data in the archive database.
What is the good way for do that?
Thanks
If they are in the same db instance, a trigger would be trivial assuming it's not a lot of tables.
If the size of this grows, you'll probably want to look into SQL Server replication. Microsoft has spent a lot of time and money to do it right.
If you are considering using triggers for this, then you may want to take into account the load sizes for your production database. If it is very intensive database, consider using some high availability solution such as Replication or Mirroring or Log shipping. Depending on your needs, either of the solution could serve you right.
Also at the same time, you should consider your "cold" recovery solutions which would need to be changed in accordance to what you implement.
Replication will replicate your deletions as well. However, not deleting the deletions from your archive database may cause problems down the line on unique indexes, where a value is valid in the production database but not valid in the archive database because the values already exist there. If your design means that this is not an issue, then a simple trigger in the production table will do this for you:
CREATE TRIGGER TR_MyTable_ToArchive ON MyTable FOR INSERT, UPDATE AS
BEGIN
SET ROW_COUNT OFF
-- First inserts
SET IDENTITY_INSERT ArchiveDB..MyTable ON -- Only if identity column is used
INSERT INTO ArchiveDB..MyTable(MyTableKey, Col1, Col2, Col3, ...)
SELECT MyTableKey, Col1, Col2, Col3, ...
FROM inserted i LEFT JOIN deleted d ON i.MyTableKey = d.MyTableKey
WHERE d.MyTableKey IS NULL
SET IDENTITY_INSERT ArchiveDB..MyTable OFF -- Only if identity column is used
-- then updates
UPDATE t SET Col1 = i.col1, col2 = i.col2, col3 = i.col3, ...
FROM ArchiveDB..MyTable t INNER JOIN inserted i ON t.MyTableKey = i.MyTableKey
INNER JOIN deleted d ON i.MyTableKey = d.MyTableKey
END
This assumes that your archive database resides on the same server as your production database. If this is not the case, you'll need to create a linked server entry, and then replace ArchiveDB..MyTable with ArchiveServer.ArchiveDB..MyTable, where ArchiveServer is the name of the linked server.
If there is a lot of load on your production database already, however, bear in mind that this will double it. To circumvent this, you can add an update flag field in each of your tables, and run a scheduled task at a time when the database load is at a minimum, like 1am. Your trigger would then set the field to I for an insert or U for an update in the production database, and the scheduled task would perform then update or insert in the archive database, depending on the value of this field, and then reset the field to NULL once it has finished.
The task is to have SQL Server read an Excel spreadsheet, and import only the new entries into a table. The entity is called Provider.
Consider an Excel spreadsheet like this:
Its target table is like this:
The task is to:
using 2008 Express toolset
import into an existing table in SQL Sever 2000
existing data in the table! Identity with increment is PK. This is used as FK in another table, with references made.
import only the new rows from the spreadsheet!
ignore rows who don't exist in spreadsheet
Question:
How can I use the SQL 2008 toolset (Import and Export wizard likely) to achieve this goal? I suspect I'll need to "Write a query to specify the data to transfer".
Problem being is that I cannot find the query as the tool would be generating to make fine adjustments.
What I'd probably do is bulk load the excel data into a separate staging table within the database and then run an INSERT on the main table to copy over the records that don't exist.
e.g.
INSERT MyRealTable (ID, FirstName, LastName,.....)
SELECT ID, FirstName, LastName,.....
FROM StagingTable s
LEFT JOIN MyRealTable r ON s.ID = r.ID
WHERE r.ID IS NULL
Then drop the staging table when you're done.
You can run some updates on that stage table before you load it, to clean it up, if you need to, as well. UPDATE SET NAME = RTROM(LTRIM(Name))
FROM YOUR.STAGE.TABlE for example