is "INSERT INTO SELECT" free from race conditions in redshift - database

We have a data warehouse system in which we need to load data present on s3 in csv format to redshift tables. The only constraint is that only unique records be inserted into redshift.
To implement this we are using staging table in following manner.
CREATE A TEMPORARY TABLE.
COPY THE S3 FILE INTO THE TEMOPRARY TABLE.
BEGIN TRANSACTION
INSERT INTO {main redshift table} select from {join between staging table and main redshift table on a column which should be unique for a record to be unique}
END TRANSACTION
The join that is used in the select subquery returns those records which are present in the staging table but not in the main redshift table.
Is the above mechanism free from race conditions.
For example consider -
Main redshift table has no rows and an s3 file contains two records.
So when the same s3 file is loaded by two different process/requests. The select query for each request read the main redshift table as empty and the join returns both rows present in the staging table and the two rows are inserted twice, resulting in duplicate rows.

Move processed file in difference s3 location .
ie -
1 Suppose you app is pushing file in destination s1
2 Move file form s1 to your staging s2 ( from this place you have to populate redshift temporary table)
3 Move file form s2 to s3 .
4) Now do
BEGIN TRANSACTION
INSERT INTO {main redshift table} select from {join between staging table and main redshift table on a column which should be unique for a record to be unique}
END TRANSACTION

Sounds like a potential phantom read scenario. You could avoid this by setting the highest transaction isolation level, SERIALIZABLE.
But that can potentially be quite expensive and lead to deadlocks, so perhaps you would prefer changing your loading pipeline to execute load tasks one by one rather than having multiple load tasks execute on one table in parallel.

Related

Snowflake CHANGES | Why does it need to perform a self join? Why is it slower than join using other unique column?

I was facing issues with merge statement over large tables.
The source table for merge is basically clone of the target table after applying some DML.
e.g. In the below example PUBLIC.customer is target and STAGING.customer is the source.
CREATE OR REPLACE TABLE STAGING.customer CLONE PUBLIC.customer;
MERGE INTO STAGING.customer TARGET USING (SELECT * FROM NEW_CUSTOMER) AS SOURCE ON TARGET.ID = SOURCE.ID
WHEN MATCHED AND SOURCE.DELETEFLAG=TRUE THEN DELETE
WHEN MATCHED AND TARGET.ROWMODIFIED < SOURCE.ROWMODIFIED THEN UPDATE SET TARGET.AGE = SOURCE.AGE, ...
WHEN NOT MATCHED THEN INSERT (AGE) VALUES (AGE, DELETEFLAG, ID,...);
Currently, we are simply merging the STAGING.customer back to PUBLIC.customer at the end.
This final merge statement is very costly for some of the large tables.
While looking for a solution to reduce the cost, I discovered Snowflake "CHANGES" mechanism. As per the documentation,
Currently, at least one of the following must be true before change tracking metadata is recorded for a table:
Change tracking is enabled on the table (using ALTER TABLE … CHANGE_TRACKING = TRUE).
A stream is created for the table (using CREATE STREAM).
Both options add hidden columns to the table which store change tracking metadata. The columns consume a small amount of storage.
I assumed that the metadata added to the table is equivalent to the result-set of the select statement using "changes" clause, which doesn't seem to be the case.
INSERT INTO PUBLIC.CUSTOMER(AGE,...) (SELECT AGE,... FROM STAGING.CUSTOMER CHANGES (information => default) at(timestamp => 1675772176::timestamp) where "METADATA$ACTION" = 'INSERT' );
The select statement using "changes" clause is way slower than the merge statement that I am using currently.
I checked the execution plan and found that Snowflake performs a self-join(sort of) on the table at two different timestamp.
Should it really be the behaviour or am I missing something here? I was hoping to get better performance assuming to scan the table one time and then simply inserting the new records which should be faster than the merge statement.
Also, even if it does a self join, why does the merge query perform better than this, the merge query is also doing join on similar volumes.
I was also hoping to use same mechanism for delete/updates on source table.

Best method to get todays data through a view snowflake

My warehouse details:
warehouse - XS
reading data external tables from s3 into snowflake
Refresh structure: SNS
I have the S3 folder structure as below
S3://eveningdtaa/2022-06-07/files -- contains parquet format
S3://eveningdtaa/2022-06-08/files -- contains parquet format
S3://eveningdtaa/2022-06-09/files -- contains parquet format
I am using external tables to read data from snowflake tables.
So tables- Has historical information
views - Has daily data
My view defination is a below:
create view result_view as (
select * from table1 where date_part=(select max(date_part) from table 1)
)
My question our daily views are running slow and it has only 70k rows. Is there a way to rewrite my view to pick only the latest data instead of max of date? or able to run this view faster through some indexes?
Thanks,
Xi
It may be rewritten using QUALIFY:
create view result_view
as
select *
from table1
qualify date_part=max(date_part) over();
It is also worth adding partition on date: Partitioning Parameters

Optimizer stats on a busy table with large inserts and deletes

Environment: Oracle database 19C
The table in question has a few number data type columns and one column of CLOB data type. The table is properly indexed and there is a nightly gather stats job as well.
Below are the operations on the table-
A PL/SQL batch procedure inserts 4 to 5 million of records from a flat file presented as an external table
After the insert operation, another batch process reads the rows and updates some of the columns
A daily purge process deletes rows that are no longer needed
My question is - should gather stats be triggered immediately after the insert and/or delete operations on the table ?
Per this Oracle doc Online Statistics Gathering for Bulk Loads, bulk loads only gather online statistics automatically when the object is empty. My process will not benefit from it as the table is not empty when I load data.
But online statistics gathering works for insert into select operations on empty segments using direct path. So next I am going to try append hint. Any thoughts... ?
Before Oracle 12c, it was best practise to gather statistics immediately after a bulk load. However, according to Oracle's SQL Tuning Guide, many applications failed to do so, therefore they automated this for certain operations.
I would recommend to have a look at the dictionary views DBA_TAB_STATISTICS, DBA_IND_STATISTICS and DBA_TAB_MODIFICATIONS and see how your table behaves:
CREATE TABLE t AS SELECT * FROM all_objects;
CREATE INDEX i ON t(object_name);
SELECT table_name, num_rows, stale_stats
FROM DBA_TAB_STATISTICS WHERE table_name='T'
UNION ALL
SELECT index_name, num_rows, stale_stats
FROM DBA_IND_STATISTICS WHERE table_name='T';
TABLE_NAME NUM_ROWS STALE_STATS
T 67135 NO
I 67135 NO
If you insert data, the statistics are marked as stale:
INSERT INTO t SELECT * FROM all_objects;
TABLE_NAME NUM_ROWS STALE_STATS
T 67138 YES
I 67138 YES
SELECT inserts, updates, deletes
FROM DBA_TAB_MODIFICATIONS
WHERE table_name='T';
INSERTS UPDATES DELETES
67140 0 0
Likewise for updates and delete:
UPDATE t SET object_id = - object_id WHERE object_type='TABLE';
4,449 rows updated.
DELETE FROM t WHERE object_type = 'SYNONYM';
23,120 rows deleted.
INSERTS UPDATES DELETES
67140 4449 23120
When you gather statistics, stale_stats becomes 'NO' again, and `DBA_TAB_MODIFICATIONS* goes back to zero (or an empty row)
EXEC DBMS_STATS.GATHER_TABLE_STATS(NULL, 'T');
TABLE_NAME NUM_ROWS STALE_STATS
T 111158 YES
I 111158 YES
Please note, that `INSERT /*+ APPEND */ gathers only statistics if the table (or partition) is empty. The restriction is documented here.
So, I would recommend in your code, after the inserts, updates and deletes are done, to check if the table(s) appear in USER_TAB_MODIFICATIONS. If the statistics are stale, I'd gather statistics.
I would also look into partitioning. Check if you can insert, update and gather stats in a fresh new partition, which would be a bit faster. And check if you can purge your data by dropping a whole partition, which would be a lot faster.

Import data from Oracle to SQL server using SSIS

We want import data from Oracle to SQL server using SSIS
I was able to transfer data from Oracle to one table (Staging)in SQL. then I need to transform data and I found that I need to run stored procedure to transform the data from Staging to Actual production data. But I wonder How we can do it.
EDIT #1
Source table has four Columns with one field containing date but its datatype is string
Destination table has also four Columns but two column will not be stored as it is there is mapping between source column and destination Column
This mapping is stored in two table for both two column Like Table one stores SourceFeatureID, DestincationFeatureID similarly second table stores SourcePID, DestincationPID
Data is updated periodically so we need from destination data when it was updated last and get remaining where SourceDate > LastUpdated_destination_date
Update 1: Components that you can use to achieve your goal within a Data Flow Task
Source and Destination
OLEDB Source: Read from staging table, you can use an SQL command to return only data with SourceDate > Destination Date
SELECT * FROM StaggingTable T1 WHERE CAST(SourceDate as Datetime) > (SELECT MAX(DestDate) FROM DestinationTable)
OLEDB Destination: Insert data to production database
Join with other table
Lookup transformation: The Lookup transformation performs lookups by joining data in input columns with columns in a reference dataset. You use the lookup to access additional information in a related table that is based on values in common columns.
Merge Join: The Merge Join transformation provides an output that is generated by joining two sorted datasets using a FULL, LEFT, or INNER join
Convert columns data types
Data Conversion transformation: The Data Conversion transformation converts the data in an input column to a different data type and then copies it to a new output column
Derived Column transformation: The Derived Column transformation creates new column values by applying expressions to transformation input columns. An expression can contain any combination of variables, functions, operators, and columns from the transformation input. The result can be added as a new column or inserted into an existing column as a replacement value. The Derived Column transformation can define multiple derived columns, and any variable or input columns can appear in multiple expressions.
References
Lookup Transformation
Merge Join Transformation
Data Conversion Transformation
Derived Column Transformation
Initial answer
I found that I need to run stored procedure to transform the data from Staging to Actual production data
This is not true, you can perform data transfer using DataFlow Task.
There are many links that you can find detailed solutions:
SSIS. How to copy data of one table into different tables?
Create a Project and Basic Package with SSIS
Fill SQL database from a CSV File (even if the source is CSV but it is very helpful)
Executing stored procedure using SSIS
Anyway, to execute a stored procedure from SSIS you can use an Execute SQL Task
Additional informations:
Execute SQL Task
How to Execute Stored Procedure in SSIS Execute SQL Task in SSIS
I'm not going to go through your comments. I'm just going to post an example of loading StagingTable into TargetTable, with an example of a date conversion, and an example of using a mapping table.
This code creates the stored proc
CREATE PROC MyProc
AS
BEGIN
-- First delete any data that exists
-- in the target table that is already there
DELETE TargetTable
WHERE EXISTS (
SELECT * FROM StagingTable
WHERE StagingTable.SomeKeyColumn = TargetTable.SomeKeyColumn
)
-- Insert some data into the target table
INSERT INTO TargetTable(Col1,Col2,Col3)
-- This is the data we are inserting
SELECT
ST.SoureCol1, -- This goes into Col1
-- This column is converted to a date then loaded into Col2
TRY_CONVERT(DATE, ST.SourceCol2,112),
-- This is a column that has been mapped from another table
-- That will be loaded into Col3
MT.MappedColumn
FROM StagingTable ST
-- This might need to be an outer join. Who knows
INNER JOIN MappingTable MT
-- we are mapping using a column called MapCol
ON ST.MapCol = MT.MapCol
END
This code runs the stored proc that you just created. You put this into an execute SQL task after your data flow in the SSIS package:
EXEC MyProc
With regards to date conversion, see here for the numbered styles:
CAST and CONVERT (Transact-SQL)

SQL Update master table with new table data hourly based on no match on Composite PK

Using SQL Server 2008
I have an SSIS task that downloads a CSV file from FTP and renames the file every hour. After that I'm doing a bulk insert of the data into a new table called NEWFTPDATA.
The data in this file is for the current day up to the current hour. The table has a composite primary key consisting of 4 different columns.
The next step I need to complete is, using T-SQL, compare this new table to my existing master archive table and insert any rows that do not already exist based on matching (or rather not-matching on those 4 columns)
Since I'll be downloading this file hourly (for real-time reporting) for each subsequent run there will be duplicate data which I will not want to insert into the master table to avoid duplicating data.
I've found ways to do this based off of the existence of one particular column, but I can't seem to figure out how to do it based off of 4 columns needing to match.
The workflow should be as follows
Update MASTERTABLE from NEWFTPDATA where newftpdata.column1, newftpdata.column2, newftpdata.column3, newftpdata.column4 do not exist in MASTERTABLE
Hopefully I've supplied substantial information for this question. If any further details are required please let me know. Thank you.
you can use MERGE
MERGE MasterTable as dest
using newftpdata as src
on
dest.column1 = src.column1
and
dest.column2 = src.column2
and
dest.column3 = src.column3
and
dest.column4 = src.column4
WHEN NOT MATCHED then
INSERT (column1, column2, ...)
values ( Src.column1, Src.column2,....)

Resources