I am trying to setup an ELT pipeline into Snowflake and it involves a transformation after loading.
This transformation will currently create or replace a Table using data queried from a source table in Snowflake after performing some manipulations of JSON data.
My question is, is this the proper way of doing it via create or replace Table everytime the transformation runs or is there a way to update the data in the transformed table incrementally?
Any advise would be greatly appreciated!
Thanks!
You can Insert into the load (soruce) table, and put into a stream, then you can know the rows, ranges of rows that need to be "reviewed" and then upsert into the output transform table.
That is is you doing something like "daily aggregates", thus if in "this batch you have data for the last 4 days, you then read the "last four days" of data from source (space a full read) and then aggregate and upsert via merge command. Thus with the model you can save reads/aggregate/write.
We have also used high water tables, to know last seen data, and/or lowest value in current batch.
Related
I have been searching on the internet for a solution to my problem but I can not seem to find any info. I have a large single text file ( 10 million rows), I need to create an SSIS package to load these records into different tables based on the transaction group assigned to that record. That is Tx_grp1 would go into Tx_Grp1 table, Tx_Grp2 would go into Tx_Grp2 table and so forth. There are 37 different transaction groups in the single delimited text file, records are inserted into this file as to when they actually occurred (by time). Also, each transaction group has a different number of fields
Sample data file
date|tx_grp1|field1|field2|field3
date|tx_grp2|field1|field2|field3|field4
date|tx_grp10|field1|field2
.......
Any suggestion on how to proceed would be greatly appreciated.
This task can be solved with SSIS, just with some experience. Here are the main steps and discussion:
Define a Flat file data source for your file, describing all columns. Possible problems here - different data types of fields based on tx_group value. If this is the case, I would declare all fields as strings long enough and later in the dataflow - convert its type.
Create a OLEDB Connection manager for the DB you will use to store the results.
Create a main dataflow where you will proceed the file, and add a Flat File Source.
Add a Conditional Split to the output of Flat file source, and define there as much filters and outputs as you have transaction groups.
For each transaction group data output - add Data Conversion for fields if necessary. Note - you cannot change data type of existing column, if you need to cast string to int - create a new column.
Add for each destination table an OLEDB Destination. Connect it to proper transaction group data flow, and map fields.
Basically, you are done. Test the package thoroughly on a test DB before using it on a production DB.
I have an SSIS package in which I'm reading the records from a Flat File and storing them in a recordset. Is it possible to compare the values in the recordset with the values in a database table and update the table?
I'm Using SQL Server 2008 R2 and Same version of SSIS.
Leran2002's answer in general is right, the most straight forward way is to have a lookup component set up to Redirect rows to no match output and use a destination and a OLE DB Command afterwards.
However depending on the size of the result sets, this might be slow, since the lookup component will check each row one-by-one and if your destination table has lots of records, this will take some time. Furthermore, depending on your cache settings in the lookup component, it can use lots of memory.
There are two more ways to achieve this:
Merge Join
Using your file source and your destination table as a source, you can use a Merge Join. The logic in the DFT is a bit more complex, but this more a set-based approach and with large result sets it is working better.
You'll have to implement the logic which record has to be updated, inserted, deleted or discarded from the file using a conditional split component.
I highly recommend this question (not exactly your problem, but a good comparison in my opinion): What are the differences between Merge Join and Lookup transformations in SSIS?
Staging table
Another way is to use a staging table to temporarily store the records from a file. In this case, your DFT just loads the records from a file into the staging table, then with one or more Execute SQL Task you can do the merging of the two data sets. (UPDATE, INSERT, DELETE, MERGE, you can use what fits your needs).
Usualy I use Lookup-component with option Redirect rows to no match output.
And after that you can use two rowsets which named Lookup No Match Output and Lookup Match Output.
PS. I have three articles about SSIS, but they in Russian (but there is a lot of SQL-scripts and pictures).
If it's interesting you, you can look the following link - https://habrahabr.ru/post/330618/
We have a large production MSSQL database (mdf appx. 400gb) and i have a test database. All the tables,indexes,views etc. are same eachother. I need to make sure that tha datas in the tables of this two database consistent. so i need to insert all the new rows and update all the updated rows into test db from production every night.
I came up with idea of using SSIS packages to make the data consistent by checking updated rows and new rows in all the tables. My SSIS Flow is ;
I have packages in SSIS for each tables seperately because;
Orderly;
Im getting the timestamp value in the table in order to get last 1 day rows instead of getting whole table.
I get the rows of the table in the production
Then im using 'Lookup' tool to compare this data with the test database table data.
Then im using conditional sprit to get a clue whether the data is new or updated.
If the data is new, i insert this data to the destination
5_2. If the data is updated, then i update the data in the destination table.
Data flow is in the MTRule and STBranch package in the picture
The problem is, im repeating creating all this single flow for each table and i have more than 300 table like this. It takes hours and hours :(
What im asking is;
Is there any way in SSIS to do this dynamically ?
PS: Every single table has its own columns and PK values but my data flow schema is always same. . (Below)
You can look into BiMLScript, which lets you create packages dynamically based on metadata.
I believe the best way to achieve this is to use Expressions. They empower you to dynamically set the source and Destination.
One possible solution might be as follows:
create a table which stores all your table names and PK columns
define a package which Loops through this table and which parses a SQL Statement
Call your main package and pass the stmt to it
Use the stmt as Data Source for your Data Flow
if applicable, pass the Destination Table as Parameter as well (another column in your config table)
This is how I processed several really huge tables: the data had to be fetched from 20 tables and moved to one single table.
You are better off writing a stored procedure that takes the tablename as parameter and doing your CRUD there.
Then call the stored procedure in a FOR EACH component in SSIS.
Why do you need to use SSIS?
You are better off writing a stored procedure that takes the tablename as parameter and doing your CRUD there. Then call the stored procedure in a FOR EACH component in SSIS.
In fact you might be able to do everything using a Stored Procedure and scheduling it in a SQL Agent Job.
Do you know how to transfer only new records between two different databases (ie. Oracle and MSSQL) using SSIS? There is no problem transfering new data only between two tables in the same database and server, but is this possible to do such operation between completely different servers and databases?
Ps. I know about solution using Lookup but it is not very efficient if anybody needs to check and add a lot of records (50k and more) several times per day. I would like to operate with new data only.
You have several options:
Timestamp based solution
If you have a column which stores the insertation time in the source system, you can select only the new records created since the last load. With the same logic, you can transfer modified records too, just mark the records with the timestamp value when it change.
Sequence based solution
If there is a sequence in the source table, you can load the new records based on that sequence. Query the last value from the destination system, then load avarything which is larger than that value.
CDC based solution
If you have CDC (Change Data Capture) in your source system, you can track the changes and you can load them based on the CDC entries.
Full load
This is the most resource hungry solution: you have to copy all data from the source to the destination. If you do not have any column which marks the new records, you should use this solution.
You have several options to achieve this:
TRUNCATE the destination table and reload it from source
Use a Lookup component to determine which records are missing
Load all data from source to a temporary table and write a query which retrieves the new/changed records.
Summary
If you have at least one column, which marks the new/modified records, you can use it to implement a differential/incremental load with SSIS. If you do not have any clue, which columns/rows are changed, you have to load (or at least query) all of them.
There is no solution which enables a one-query (INSERT .. SELECT) solution using multiple servers without transferring all data. (Please note, that a multi-server query using Linked Servers are transfers the data from the source system).
What about variables? Is it possible to use the same variable between different databases and servers in SSIS?
I would like to transfer last id number from a destination table and transfer it to the source table (different server!).
I can set a variable in a database scope like this:
DECLARE #Last int
SET #Last = (SELECT TOP 1 Id FROM dbo.Table_1 ORDER BY Id DESC)
SELECT *
FROM dbo.Table_2
WHERE ID > #Last;
However it works between two tables in the same database (as a SQL command) only. I can create a variable for a entire SSIS package in Variables --> Add variable, but I don't know it is possible to use the variable in a similar way as above - to keep an information about last id in a destination table and pass it to another table on a source server as data limit.
I receive new data files every day. Right now, I'm building the database with all the required tables to import the data and perform the required calculations.
Should I just append each new day's data to my current tables? Each file contains a date column, which would allow for a "WHERE" query in the future if I need to analyze data for one particular day. Or should I be creating a new set of tables for every day?
I'm new to database design (coming from Excel). I will be using SQL Server for this.
Assuming that the structure of the data being received is the same, you should only need one set of tables rather than creating new tables each day.
I'd recommend storing the value of the date column from your incoming data in your database, and also having a 'CreateDate' column in your tables, with a default value of 'GetDate()' so that it automatically gets populated with the current date when the row is inserted.
You may also want to have another column to store the data filename that the row was imported from, but if you're already storing the value of the date column and the date that the row was inserted, this shouldn't really be necessary.
In the past, when doing this type of activity using a custom data loader application, I've also found it useful to create log files to log success/error/warning messages, including some type of unique key of the source data and target database - ie. if coming from an Excel file and going into a database column, you could store the row index from Excel and the primary key of the inserted row. This helps tracking down any problems later on.
You might want to consider having a look at SSIS (SqlServer Integration Services). It's the SqlServer tool for doing ETL activities.
yes, append each day's data to the tables; 1 set of tables for all data.
yes, use a date column to identify the day that the data was loaded.
maybe have another table with a date column and a clob column. The date to contain the load date and the clob to contain the file that you imported.
Good question. You most definitely should have a single set of tables and append the data daily. Consider this: if you create a new set of tables each day, what would, say, a monthly report query look like? A quarterly report query? It would be a mess, with UNIONs and JOINs all over the place.
A single set of tables with a WHERE clause makes the querying and reporting manageable.
You might do a little reading on relational database theory. Wikipedia is a good place to start. The basics are pretty straightforward if you have the knack for it.
I would have the data load into a stage table regardless and append to the main tables after. Once a week i would then refresh all data in the main table to ensure that the data remains correct as per the source.
Marcus