Best method to get todays data through a view snowflake - snowflake-cloud-data-platform

My warehouse details:
warehouse - XS
reading data external tables from s3 into snowflake
Refresh structure: SNS
I have the S3 folder structure as below
S3://eveningdtaa/2022-06-07/files -- contains parquet format
S3://eveningdtaa/2022-06-08/files -- contains parquet format
S3://eveningdtaa/2022-06-09/files -- contains parquet format
I am using external tables to read data from snowflake tables.
So tables- Has historical information
views - Has daily data
My view defination is a below:
create view result_view as (
select * from table1 where date_part=(select max(date_part) from table 1)
)
My question our daily views are running slow and it has only 70k rows. Is there a way to rewrite my view to pick only the latest data instead of max of date? or able to run this view faster through some indexes?
Thanks,
Xi

It may be rewritten using QUALIFY:
create view result_view
as
select *
from table1
qualify date_part=max(date_part) over();
It is also worth adding partition on date: Partitioning Parameters

Related

Updating local Sql Server Database from Cloud Snowflake Server fastest way to get new data

I have a local SQL Server DB table with about 5 million records.
I snowflake server that has a similar table that is updated daily.
I need to update my local table with the new records that are added on the Snowflake table.
This code works but it takes about an hour to retrieve about 200,000 records. I insert the records into a local temp table and then insert them into my Sql server db.
Is there a faster way to retrieve the records from Snowflake and get them into SQL Server?
TIA
JohnB
SELECT A.*
into #Sale2020New
FROM OPENQUERY(SNOW, 'SELECT * FROM "DATA"."PUBLIC"."Sales" where "Sales"."Date" >= ''1/1/2020'' and "Sales"."Date" <= ''12/31/2020'' ') A
Left JOIN [SnowFlake].[dbo].Sale2020 B
ON B.PrimaryKey = A.PrimaryKey
WHERE
b.PrimaryKey IS NULL;
Does it take 1 hour just retrieving data from Snowflake or the whole process?
To speed up data retrieval from Snowflake, implement clustering on DATE column in snowflake table. This would prune micropartitions and avoid full table scan. You can get more information on clustering here
As for delta load, instead of a join you can apply filter on DATE column to current date and this will avoid a costly join operation and filter data at the start.
SELECT * FROM "SALES"
where "Sales"."Date" = '2020-04-07'

Import data from Oracle to SQL server using SSIS

We want import data from Oracle to SQL server using SSIS
I was able to transfer data from Oracle to one table (Staging)in SQL. then I need to transform data and I found that I need to run stored procedure to transform the data from Staging to Actual production data. But I wonder How we can do it.
EDIT #1
Source table has four Columns with one field containing date but its datatype is string
Destination table has also four Columns but two column will not be stored as it is there is mapping between source column and destination Column
This mapping is stored in two table for both two column Like Table one stores SourceFeatureID, DestincationFeatureID similarly second table stores SourcePID, DestincationPID
Data is updated periodically so we need from destination data when it was updated last and get remaining where SourceDate > LastUpdated_destination_date
Update 1: Components that you can use to achieve your goal within a Data Flow Task
Source and Destination
OLEDB Source: Read from staging table, you can use an SQL command to return only data with SourceDate > Destination Date
SELECT * FROM StaggingTable T1 WHERE CAST(SourceDate as Datetime) > (SELECT MAX(DestDate) FROM DestinationTable)
OLEDB Destination: Insert data to production database
Join with other table
Lookup transformation: The Lookup transformation performs lookups by joining data in input columns with columns in a reference dataset. You use the lookup to access additional information in a related table that is based on values in common columns.
Merge Join: The Merge Join transformation provides an output that is generated by joining two sorted datasets using a FULL, LEFT, or INNER join
Convert columns data types
Data Conversion transformation: The Data Conversion transformation converts the data in an input column to a different data type and then copies it to a new output column
Derived Column transformation: The Derived Column transformation creates new column values by applying expressions to transformation input columns. An expression can contain any combination of variables, functions, operators, and columns from the transformation input. The result can be added as a new column or inserted into an existing column as a replacement value. The Derived Column transformation can define multiple derived columns, and any variable or input columns can appear in multiple expressions.
References
Lookup Transformation
Merge Join Transformation
Data Conversion Transformation
Derived Column Transformation
Initial answer
I found that I need to run stored procedure to transform the data from Staging to Actual production data
This is not true, you can perform data transfer using DataFlow Task.
There are many links that you can find detailed solutions:
SSIS. How to copy data of one table into different tables?
Create a Project and Basic Package with SSIS
Fill SQL database from a CSV File (even if the source is CSV but it is very helpful)
Executing stored procedure using SSIS
Anyway, to execute a stored procedure from SSIS you can use an Execute SQL Task
Additional informations:
Execute SQL Task
How to Execute Stored Procedure in SSIS Execute SQL Task in SSIS
I'm not going to go through your comments. I'm just going to post an example of loading StagingTable into TargetTable, with an example of a date conversion, and an example of using a mapping table.
This code creates the stored proc
CREATE PROC MyProc
AS
BEGIN
-- First delete any data that exists
-- in the target table that is already there
DELETE TargetTable
WHERE EXISTS (
SELECT * FROM StagingTable
WHERE StagingTable.SomeKeyColumn = TargetTable.SomeKeyColumn
)
-- Insert some data into the target table
INSERT INTO TargetTable(Col1,Col2,Col3)
-- This is the data we are inserting
SELECT
ST.SoureCol1, -- This goes into Col1
-- This column is converted to a date then loaded into Col2
TRY_CONVERT(DATE, ST.SourceCol2,112),
-- This is a column that has been mapped from another table
-- That will be loaded into Col3
MT.MappedColumn
FROM StagingTable ST
-- This might need to be an outer join. Who knows
INNER JOIN MappingTable MT
-- we are mapping using a column called MapCol
ON ST.MapCol = MT.MapCol
END
This code runs the stored proc that you just created. You put this into an execute SQL task after your data flow in the SSIS package:
EXEC MyProc
With regards to date conversion, see here for the numbered styles:
CAST and CONVERT (Transact-SQL)

is "INSERT INTO SELECT" free from race conditions in redshift

We have a data warehouse system in which we need to load data present on s3 in csv format to redshift tables. The only constraint is that only unique records be inserted into redshift.
To implement this we are using staging table in following manner.
CREATE A TEMPORARY TABLE.
COPY THE S3 FILE INTO THE TEMOPRARY TABLE.
BEGIN TRANSACTION
INSERT INTO {main redshift table} select from {join between staging table and main redshift table on a column which should be unique for a record to be unique}
END TRANSACTION
The join that is used in the select subquery returns those records which are present in the staging table but not in the main redshift table.
Is the above mechanism free from race conditions.
For example consider -
Main redshift table has no rows and an s3 file contains two records.
So when the same s3 file is loaded by two different process/requests. The select query for each request read the main redshift table as empty and the join returns both rows present in the staging table and the two rows are inserted twice, resulting in duplicate rows.
Move processed file in difference s3 location .
ie -
1 Suppose you app is pushing file in destination s1
2 Move file form s1 to your staging s2 ( from this place you have to populate redshift temporary table)
3 Move file form s2 to s3 .
4) Now do
BEGIN TRANSACTION
INSERT INTO {main redshift table} select from {join between staging table and main redshift table on a column which should be unique for a record to be unique}
END TRANSACTION
Sounds like a potential phantom read scenario. You could avoid this by setting the highest transaction isolation level, SERIALIZABLE.
But that can potentially be quite expensive and lead to deadlocks, so perhaps you would prefer changing your loading pipeline to execute load tasks one by one rather than having multiple load tasks execute on one table in parallel.

SQL Update master table with new table data hourly based on no match on Composite PK

Using SQL Server 2008
I have an SSIS task that downloads a CSV file from FTP and renames the file every hour. After that I'm doing a bulk insert of the data into a new table called NEWFTPDATA.
The data in this file is for the current day up to the current hour. The table has a composite primary key consisting of 4 different columns.
The next step I need to complete is, using T-SQL, compare this new table to my existing master archive table and insert any rows that do not already exist based on matching (or rather not-matching on those 4 columns)
Since I'll be downloading this file hourly (for real-time reporting) for each subsequent run there will be duplicate data which I will not want to insert into the master table to avoid duplicating data.
I've found ways to do this based off of the existence of one particular column, but I can't seem to figure out how to do it based off of 4 columns needing to match.
The workflow should be as follows
Update MASTERTABLE from NEWFTPDATA where newftpdata.column1, newftpdata.column2, newftpdata.column3, newftpdata.column4 do not exist in MASTERTABLE
Hopefully I've supplied substantial information for this question. If any further details are required please let me know. Thank you.
you can use MERGE
MERGE MasterTable as dest
using newftpdata as src
on
dest.column1 = src.column1
and
dest.column2 = src.column2
and
dest.column3 = src.column3
and
dest.column4 = src.column4
WHEN NOT MATCHED then
INSERT (column1, column2, ...)
values ( Src.column1, Src.column2,....)

order hint for openquery?

I need to execute the following SQL (SQL Server 2008) in a scheduled job periodically. The Query plan shows 53% cost is sort after the data is pulled from the oracle server. However, I've ordered the data in the openquery. How to force the query not to sort when merge joining?
merge target as t
using (select * from openquery(oracle, '
select * from t1 where UpdateTime > ''....'' order by k1, k2')
) as s on s.k1=t.k1 and s.k2=t.K2 -- the clustered PK of "target" is K1,k2
when matched then ......
when not matched then ......
Is there something like bulk insert's "with (order( { column [ ASC | DESC ] } [ ,...n ] ))"? will it help improve the query plan of the merge statement if it exists?
If the oracle table already have PK on K1,K2, will just using oracle.db.owner.tablename as target better? (will SQL Server figure out the index from oracle meta information?)
Or the best I can do is stored the oracle data in a local temp table and create a clustered primary key on K1,k2? I am trying to avoid to create a temp table because sometime the returned openquery data set can be large.
I think a table is the best way to go because then you can create whatever indexes you need, but there's no reason why it should be temporary; why not create a permanent staging table? A local join using local indexes will probably be much more efficient than a join on the results of a remote query, although the only way to know for sure is to test it and see.
If you're worried about the large number of rows, you can look into only copying over new or changed rows. If the Oracle table already has columns for row creation and update times, that would be quite easy.
Alternatively, you could consider using SSIS instead of a scheduled job. I understand that if you're not already using SSIS you may not want to invest time in learning it, but it's a very powerful tool and it's designed for moving large amounts of data into MSSQL. You would create a package with the following workflow:
Delete existing rows from the staging table (only if you can't populate it incrementally)
Copy the data from Oracle
Execute the MERGE statement

Resources