Work table, error table and log table in Snowflake? - snowflake-cloud-data-platform

What are the equivalent of the following Teradata tables in Snowflake:
Work tables (WT)
Error Tables(ET)
UV tables- (another error table that stores data with uniqueness violations)
Log Table(LT)
These tables get populated with Teradata TPT, is there an equivalent in Snowflake?

To facilitate analysis of the errors, a COPY INTO statement then unloads the problematic records into a text file so they could be analyzed and fixed in the original data files.
The statement queries the RESULT_SCAN table function to retrieve the records. Note that the statements in this section must be run in succession in order to retrieve the applicable records using the LAST_QUERY_ID function.
copy into mytable
from #mystage/myfile.csv.gz
validation_mode=return_all_errors;
set qid=last_query_id();
copy into #mystage/errors/load_errors.txt from (select rejected_record from table(result_scan($qid)));
Documentation reference: https://docs.snowflake.com/en/user-guide/data-load-bulk-ts.html#step-2-validating-the-data-load

Related

Can we use ADF Lookup activity perform INSERT operation on SNOWFLAKE table

I have created new dataset using snowflake connector and used the same as source dataset in lookup activity.
Then I am trying to INSERT the record into snowflake using following query.
'INSERT INTO SAMPLE_TABLE VALUES('TEST',1,1,CURRENT_TIMESTAMP,'TEST'-- (all values are passed)
Result: The row getting inserted into snowflake but my pipeline got failed stating the below error.
Failure happened on 'Source' side. ErrorCode=UserErrorOdbcInvalidQueryString,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The following ODBC Query is not valid: 'INSERT INTO SAMPLE_TABLE VALUES('TEST',1,1,CURRENT_TIMESTAMP,'TEST');'
Could you please share you advise or anylead to solve this problem.
Thanks.
Rajesh
Lookup, as the name suggests, is for searching and retrieving data, not for inserting. However, you can enclose your INSERT code in a procedure and execute it using the Lookup activity.
However, I strongly do not recommend such an action, remember that when inserting data into Snowflake, you create at least one micro-partition with a size of 16MB, if you insert one line at a time, the performance will be terrible and the data will take up a disproportionate amount of space. Remember Snowlfake is not a transaction database (! OLTP).
Instead, it's better to save all the records in an intermediate file and then import the entire file in one move.
You can use the lookup activity to perform operations other than selects, it just HAS to have an output. I've gotten around it with a postgres database doing create tables, truncates, one off inserts by just concatenating a
select current_date;
after the main query.
Note, the sql script activity will definitely be better for this, we are waiting on postgres support in that though.

Oracle stored procedure - tracking validation errors

I'm in the process of converting an MsSQL stored procedure into Oracle, but I've run into an issue that's making me question both implementations.
The Sql Server version creates a temp table inside the stored procedure to track validation errors on what can potentially be a somewhat large dataset (hundreds of thousands of records). Each validation query selects the invalid IDs into the temp table with an appropriate error message (specific to the query). Once all validation is done, the errors are inserted into a real table (which doesn't have a column to store the IDs). I can then easily insert the valid rows by filtering out the IDs from the temp error table.
I hope that makes sense. And just to reiterate, the reason I don't simply use the "real" error table is that it doesn't contain a column for me to store the IDs of the invalid rows (I can't change this).
I know that I can use a normal/global temporary table in Oracle, but the more I read into it, the more it sounds like this is bad practice. What's a good alternative to do this in Oracle? Collections?
Thanks.
Why not just insert the errors into the real target table when the error occurs instead of into a #TEMP table then moving them later? You did not list your Oracle version but you can create a global temporary table on the system then in your stored procedure a private to your session temporary table would be created on insert. However, temp tables are normally not needed in Oracle. The entire process design sounds suspect. It is possible in Oracle to bulk insert the data capturing individual row errors from the bulk operation. If you or some members of you team have a decent understanding how Oracle works then this process seems like a candidate for refactoring (redesign).

Easy way of overwriting old rows in SSIS Package

I've created a SSIS package with a script component that calls data from a JSON API and inserts it into a table in SQL Server. I've setup the logic to add new rows, however I want to find the most appropriate way to delete/overwrite old rows. The data is fetched every 4 hours, so there's an overlap of approximately 1000 rows each time the package is run.
My first thought was to simply add a SQL Task after the Data Flow Task that deletes the duplicate rows (with the smallest ID number). However, I was wondering how to do this inside the Data Flow Task? The API call fetches no more than 5000 rows each time, the destination table has around 1m rows, and the entire project runs in approx. 10 seconds.
My simple Data Flow Task looks like this:
There are two main approaches you can try:
Run Lookup on Row ID. If matched run OLEDB Command Transformation for each line with an UPDATE statement. If not matched - direct rows to OLE DB destination.
Easy to implement, straightforward logic, but multitude of UPDATE statements will create performance problems.
Create an intermediate table in DB, clean it before running Data Flow Task, and store all rows in your Data Flow into this intermediate table. Then on the next task - do either of following:
MERGE intermediate table with the main table. More info on MERGE.
In transaction - drop rows from the main table which exists on the intermediate, then do INSERT INTO <main table> SELECT ... FROM <intermediate table>
I usually prefer the intermediate table approach with MERGE - performant, simple and flexible. MERGE statement can have downside effects when run in concurrent sessions or on clustered columnstore tables, then I use the intermediate table with DELETE...INSERT command
So I figured out that the easiest solution in my case (the case where there's only relatively few rows to update) was to use the OLE DB Component as can be seen below.
In the component I added an Update SQL statement with logic such as the following
UPDATE [dbo].[table]
SET [value1]=?,
[value2]=?,
[value2]=?,
WHERE [value1]=?
Then I mapped the parameters to their corresponding columns, and made sure that my where clause used the lookup match output to update the correct rows. The component makes sure that the "Lookup Match Output" is updated using the columns I use in the Lookup component.

How to load and filter data efficiently in SSIS

Need to loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2016 but it's taking TOO MUCH TIME (like 2/3 hours) just to load data in source then it’s need extra (2/3 hours) time for sort and filter then need similar time to load data in target, the file just has like million rows and it’s not less than 3 GB file approximately. This is driving me crazy, because is affecting the performance of my server.
SSIS package: -My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, that’s all -The Data Access Mode is set to FAST LOAD. -Just have 1 indexes in the table. My destination table has 32 columns
Input file:
Input text file has more than 32 columns, surrogate key data may not unique , referenced columns date may not unique , Need to filter them.
Face two problems one is SSIS FlatFile-Source take huge time to load date another one is sort and filter. What to do?
If you want it to run fast use this pattern:
Load the data exactly as-is into a staging table
Optionally add indexes to the staging table afterwards
Use SQL to perform whatever processing you need (i.e. SELECT DISTINCT, GROUP BY into the final table)
You can do this kind of thing in SSIS but you need to tune it properly etc. it's just easier to do it inside a database which is already well optimised for this
Some Suggestions
1. DROP and Recreate indexes
Add 2 Execute SQL Task; before and after the Data Flow Task, the first drop the index and the second recreate it after that the Data Flow Task is executed successfully
2. Adjust the buffer size
You can read more about buffer sizing in the following Technet article
3. Remove duplicates in SQL server instead of Sort Components
Try to remove the Sort components, and add an Execute SQL Task after the Data Flow Task which run a similar query:
;WITH x AS
(
SELECT col1, col2, col3, rn = ROW_NUMBER() OVER
(PARTITION BY col1, col2, col3 ORDER BY id)
FROM dbo.tbl
)
DELETE x WHERE rn > 1;
4. Use a script component instead of Sort
You have to implement a similar logic to the following answer:
SSIS: Flat File Source to SQL without Duplicate Rows
Some helpful Links
Integration Services: Performance Tuning Techniques
Data Flow Performance Features
Can I delete database duplicates based on multiple columns?
Removing duplicate rows (based on values from multiple columns) from SQL table
Deleting duplicates based on multiple columns
Remove duplicates based on two columns

SQL Server statement with bulk insert

I am working on a project where I have used the bulk insert statement to import a batch .csv file into a table.
The problem I have is that some of the records are duplicates to what is currently in the table I am looking to import data into. Is there a way to run a statement with the bulk insert to check for specific rows that match the file rows based off of certain criteria?
I am sure there is a way to make this work, just nothing I have in mind.
No, the BULK INSERT statement is optimized for raw speed - it just inserts that data as quickly as possible - but it does not allow for inspection or decisions to be made while importing.
The usual approach in such a case is to bulk insert your data into a staging table, and then after that's done, copy only those rows that are not duplicates into the actual data table and discard everything else.
But that's a separate step - cannot be done while bulk inserting ....

Resources