Import data from Oracle to SQL server using SSIS - sql-server

We want import data from Oracle to SQL server using SSIS
I was able to transfer data from Oracle to one table (Staging)in SQL. then I need to transform data and I found that I need to run stored procedure to transform the data from Staging to Actual production data. But I wonder How we can do it.
EDIT #1
Source table has four Columns with one field containing date but its datatype is string
Destination table has also four Columns but two column will not be stored as it is there is mapping between source column and destination Column
This mapping is stored in two table for both two column Like Table one stores SourceFeatureID, DestincationFeatureID similarly second table stores SourcePID, DestincationPID
Data is updated periodically so we need from destination data when it was updated last and get remaining where SourceDate > LastUpdated_destination_date

Update 1: Components that you can use to achieve your goal within a Data Flow Task
Source and Destination
OLEDB Source: Read from staging table, you can use an SQL command to return only data with SourceDate > Destination Date
SELECT * FROM StaggingTable T1 WHERE CAST(SourceDate as Datetime) > (SELECT MAX(DestDate) FROM DestinationTable)
OLEDB Destination: Insert data to production database
Join with other table
Lookup transformation: The Lookup transformation performs lookups by joining data in input columns with columns in a reference dataset. You use the lookup to access additional information in a related table that is based on values in common columns.
Merge Join: The Merge Join transformation provides an output that is generated by joining two sorted datasets using a FULL, LEFT, or INNER join
Convert columns data types
Data Conversion transformation: The Data Conversion transformation converts the data in an input column to a different data type and then copies it to a new output column
Derived Column transformation: The Derived Column transformation creates new column values by applying expressions to transformation input columns. An expression can contain any combination of variables, functions, operators, and columns from the transformation input. The result can be added as a new column or inserted into an existing column as a replacement value. The Derived Column transformation can define multiple derived columns, and any variable or input columns can appear in multiple expressions.
References
Lookup Transformation
Merge Join Transformation
Data Conversion Transformation
Derived Column Transformation
Initial answer
I found that I need to run stored procedure to transform the data from Staging to Actual production data
This is not true, you can perform data transfer using DataFlow Task.
There are many links that you can find detailed solutions:
SSIS. How to copy data of one table into different tables?
Create a Project and Basic Package with SSIS
Fill SQL database from a CSV File (even if the source is CSV but it is very helpful)
Executing stored procedure using SSIS
Anyway, to execute a stored procedure from SSIS you can use an Execute SQL Task
Additional informations:
Execute SQL Task
How to Execute Stored Procedure in SSIS Execute SQL Task in SSIS

I'm not going to go through your comments. I'm just going to post an example of loading StagingTable into TargetTable, with an example of a date conversion, and an example of using a mapping table.
This code creates the stored proc
CREATE PROC MyProc
AS
BEGIN
-- First delete any data that exists
-- in the target table that is already there
DELETE TargetTable
WHERE EXISTS (
SELECT * FROM StagingTable
WHERE StagingTable.SomeKeyColumn = TargetTable.SomeKeyColumn
)
-- Insert some data into the target table
INSERT INTO TargetTable(Col1,Col2,Col3)
-- This is the data we are inserting
SELECT
ST.SoureCol1, -- This goes into Col1
-- This column is converted to a date then loaded into Col2
TRY_CONVERT(DATE, ST.SourceCol2,112),
-- This is a column that has been mapped from another table
-- That will be loaded into Col3
MT.MappedColumn
FROM StagingTable ST
-- This might need to be an outer join. Who knows
INNER JOIN MappingTable MT
-- we are mapping using a column called MapCol
ON ST.MapCol = MT.MapCol
END
This code runs the stored proc that you just created. You put this into an execute SQL task after your data flow in the SSIS package:
EXEC MyProc
With regards to date conversion, see here for the numbered styles:
CAST and CONVERT (Transact-SQL)

Related

Creating PL/SQL procedure to fill intermediary table with random data

As part of my classes on relational databases, I have to create procedures as part of package to fill some of the tables of an Oracle database I created with random data, more specifically the tables community, community_account and community_login_info (see ERD linked below). I succeeded in doing this for tables community and community_account, however I'm having some problems with generating data for table community_login_info. This serves as an intermediary table between the many to many relationship of community and community_account, linking the id's of both tables.
My latest approach was to create an associative array with the structure of the target table community_login_info. I then do a cross join of community and community_account (there's already random data in there) along with random timestamps, bulk collect that result into the variable of the associative array and then insert those contents into the target table community_login_info. But it seems I'm doing something wrong since Oracle returns error ORA-00947 'not enough values'. To me it seems all columns the target table get a value in the insert, what am I missing here? I added the code from my package body below.
ERD snapshot
PROCEDURE mass_add_rij_koppeling_community_login_info
IS
TYPE type_rec_communties_accounts IS RECORD
(type_community_id community.community_id%type,
type_account_id community_account.account_id%type,
type_start_timestamp_login community_account.start_timestamp_login%type,
type_eind_timestamp_login community_account.eind_timestamp_login%type);
TYPE type_tab_communities_accounts
IS TABLE of type_rec_communties_accounts
INDEX BY pls_integer;
t_communities_accounts type_tab_communities_accounts;
BEGIN
SELECT community_id,account_id,to_timestamp(start_datum_account) as start_timestamp_login, to_timestamp(eind_datum_account) as eind_timestamp_login
BULK COLLECT INTO t_communities_accounts
FROM community
CROSS JOIN community_account
FETCH FIRST 50 ROWS ONLY;
FORALL i_index IN t_communities_accounts.first .. t_communities_accounts.last
SAVE EXCEPTIONS
INSERT INTO community_login_info (community_id,account_id,start_timestamp_login,eind_timestamp_login)
values (t_communities_accounts(i_index));
END mass_add_rij_koppeling_community_login_info;
Your error refers to the part:
INSERT INTO community_login_info (community_id,account_id,start_timestamp_login,eind_timestamp_login)
values (t_communities_accounts(i_index));
(By the way, the complete error message gives you the line number where the error is located, it can help to focus the problem)
When you specify the columns to insert, then you need to specify the columns in the VALUES part too:
INSERT INTO community_login_info (community_id,account_id,start_timestamp_login,eind_timestamp_login)
VALUES (t_communities_accounts(i_index).community_id,
t_communities_accounts(i_index).account_id,
t_communities_accounts(i_index).start_timestamp_login,
t_communities_accounts(i_index).eind_timestamp_login);
If the table COMMUNITY_LOGIN_INFO doesn't have any more columns, you could use this syntax:
INSERT INTO community_login_info
VALUE (t_communities_accounts(i_index));
But I don't like performing inserts without specifying the columns because I could end up inserting the start time into the end time and vice versa if I haven't defined the columns in exactly the same order as the table definition, and if the definition of the table changes over time and new columns are added, you have to modify your procedure to add the new column even if the new column goes with a NULL value because you don't fill up that new column with this procedure.
PROCEDURE mass_add_rij_koppeling_community_login_info
IS
TYPE type_rec_communties_accounts IS RECORD
(type_community_id community.community_id%type,
type_account_id community_account.account_id%type,
type_start_timestamp_login community_account.start_timestamp_login%type,
type_eind_timestamp_login community_account.eind_timestamp_login%type);
TYPE type_tab_communities_accounts
IS TABLE of type_rec_communties_accounts
INDEX BY pls_integer;
t_communities_accounts type_tab_communities_accounts;
BEGIN
SELECT community_id,account_id,to_timestamp(start_datum_account) as start_timestamp_login, to_timestamp(eind_datum_account) as eind_timestamp_login
BULK COLLECT INTO t_communities_accounts
FROM community
CROSS JOIN community_account
FETCH FIRST 50 ROWS ONLY;
FORALL i_index IN t_communities_accounts.first .. t_communities_accounts.last
SAVE EXCEPTIONS
INSERT INTO community_login_info (community_id,account_id,start_timestamp_login,eind_timestamp_login)
values (select community_id,account_id,start_timestamp_login,eind_timestamp_login
from table(cast(t_communities_accountsas type_tab_communities_accounts)) a);
END mass_add_rij_koppeling_community_login_info;

SSIS data flow - copy new data or update existing

I queried some data from table A(Source) based on certain condition and insert into temp table(Destination) before upsert into Crm.
If data already exist in Crm I dont want to query the data from table A and insert into temp table(I want this table to be empty) unless there is an update in that data or new data was created. So basically I want to query only new data or if there any modified data from table A which already existed in Crm. At the moment my data flow is like this.
clear temp table - delete sql statement
Query from source table A and insert into temp table.
From temp table insert into CRM using script component.
In source table A I have audit columns: createdOn and modifiedOn.
I found one way to do this. SSIS DataFlow - copy only changed and new records but no really clear on how to do so.
What is the best and simple way to achieve this.
The link you posted is basically saying to stage everything and use a MERGE to update your table (essentially an UPDATE/INSERT).
The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A would be to add a "last updated" timestamp to table A and enforcing that it will always be up to date.
One way to do this is with a trigger; see here for an example.
You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.
Edit: I just saw that you already have a modifiedOn column, so you could use that as described above.
Examples:
There are a few different ways you could do it:
ONE
Include the modifiedOn column on in your final destination table.
You can then build a dynamic query for your data flow source in a SSIS string variable, something like:
"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + #[User::MaxModifiedOnDate] + "')"
#[User::MaxModifiedOnDate] (string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:
SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable
The DATEADD part, as well as the CAST to a certain degree, represent your margin of safety.
TWO
If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:
CREATE TABLE DataLoadHistory
(
DataLoadID int PRIMARY KEY IDENTITY
, DataLoadStart datetime NOT NULL
, DataLoadEnd datetime
, Success bit NOT NULL
)
You would begin each data load with this (Execute SQL Task):
CREATE PROCEDURE BeginDataLoad
#DataLoadID int OUTPUT
AS
INSERT INTO DataLoadHistory
(
DataLoadStart
, Success
)
VALUES
(
GETDATE()
, 0
)
SELECT #DataLoadID = SCOPE_IDENTITY()
You would store the returned DataLoadID in a SSIS integer variable, and use it when the data load is complete as follows:
CREATE PROCEDURE DataLoadComplete
#DataLoadID int
AS
UPDATE DataLoadHistory
SET
DataLoadEnd = GETDATE()
, Success = 1
WHERE DataLoadID = #DataLoadID
When it comes to building your query for table A, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate would come from the following query:
SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1
So the DataLoadHistory table, rather than your destination table.
Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.
THREE
I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).
It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.

SQL Update master table with new table data hourly based on no match on Composite PK

Using SQL Server 2008
I have an SSIS task that downloads a CSV file from FTP and renames the file every hour. After that I'm doing a bulk insert of the data into a new table called NEWFTPDATA.
The data in this file is for the current day up to the current hour. The table has a composite primary key consisting of 4 different columns.
The next step I need to complete is, using T-SQL, compare this new table to my existing master archive table and insert any rows that do not already exist based on matching (or rather not-matching on those 4 columns)
Since I'll be downloading this file hourly (for real-time reporting) for each subsequent run there will be duplicate data which I will not want to insert into the master table to avoid duplicating data.
I've found ways to do this based off of the existence of one particular column, but I can't seem to figure out how to do it based off of 4 columns needing to match.
The workflow should be as follows
Update MASTERTABLE from NEWFTPDATA where newftpdata.column1, newftpdata.column2, newftpdata.column3, newftpdata.column4 do not exist in MASTERTABLE
Hopefully I've supplied substantial information for this question. If any further details are required please let me know. Thank you.
you can use MERGE
MERGE MasterTable as dest
using newftpdata as src
on
dest.column1 = src.column1
and
dest.column2 = src.column2
and
dest.column3 = src.column3
and
dest.column4 = src.column4
WHEN NOT MATCHED then
INSERT (column1, column2, ...)
values ( Src.column1, Src.column2,....)

Bad int8 external representation "6*725" in Netezza

I am getting an error like "Bad int8 external representation "6*725" " in netezza while executing a stored procedure . This stored procedure takes data from a table does some transformations and load into another table.
Can any one please help me .
Thanks,
Brajendra
FYI: May be multiple answers to this question because do not have the query that you ran to get the error.
If you did a direct INSERT command like this, the order of the columns of the table from the select clause DO NOT match the order of the columns of the table from the insert clause. Most database management systems doesn't care what the order is, but Netezza does. The fact that it threw the "Bad int8" just means the first column it couldn't match in the select clause has that data type, and the data type in the insert clause has a different data type.
INSERT INTO DB1..TABLE1
SELECT * FROM DB1..TABLE2;
You can fix with one of two methods. Either change the order of the columns by dropping and recreating the table. Or use explicit column names in the INSERT INTO/SELECT command.

SSIS : delete rows after an update or insert

Here is the following situation:
I have a table of StudentsA which needs to be synchronized with another table, on a different server, StudentsB. It's a one-way sync from A to B.
Since the table StudentsA can hold a large number of rows, we have a table called StudentsSync (on the input server) containing the ID of StudentsA which have been modified since the last copy from StudentsA to StudentsB.
I made the following SSIS Data Flow task:
The only problem is that I need to DELETE the row from StudentsSync after a successful copy or update. Something like this:
Any idea how this can be achieved?
It can be achieved using 3 methods
1.If your target table in OutputDB has TimeStamp columns such as Create and modified TimeStamp then rows which have got updated or inserted can be obtained by writing a simple query. You need to write the below query in the execte sql task in Control Flow to delete those rows in Sync Table .
Delete from SyncTable
where keyColumn in (Select primary_key from target
where ModifiedTimeStamp >= GETDATE() or (ModifiedTimeStamp is null
and CreateTimeStamp>=GETDATE()))
I assume StudentsA's primary key is present in Sync table along with primary key of Target table. The above condition basically checks, if a new row is added then CreateTimeStamp column will have current date and modifiedTimeStamp will be null else if the values are updated then the modifiedTimeStamp will have current date
The above query will work if you have TimeStamp columns in your target table which i feel should be there if your loading data into Data Warehouse
2.You can use MERGE syntax to perform the update and insert in Control Flow with Execute SQL Task.No need to use Data Flow Task .The below query can be used even if you don't have any TimeStamp columns
DECLARE #Output TABLE ( ActionType VARCHAR(20), SourcePrimaryKey INT)
MERGE StudentsB AS TARGET
USING StudentsA AS SOURCE
ON (TARGET.CommonColumn = SOURCE.CommonColumn)
WHEN MATCHED
THEN
UPDATE SET TARGET.column = SOURCE.Column,TARGET.ModifiedTimeStamp=GETDATE()
WHEN NOT MATCHED BY TARGET THEN
INSERT (col1,col2,Col3)
VALUES (SOURCE.col1, SOURCE.col2, SOURCE.Col3)
OUTPUT $action,
INSERTED.PrimaryKey AS SourcePrimaryKey INTO #Output
Delete from SyncTable
where PrimaryKey in (Select SourcePrimaryKey from #Output
where ActionType ='INSERT' or ActionType='UPDATE')
The code is not tested as i'm running out of time .but at-least it should give you a fair idea how to proceed . .For furthur detail on MERGE syntax read this and this
3.Use Multicast Component to duplicate the dataset for Insert and Update .Connect a MULTICAST to lookmatch output and another multicast to Lookup No match output
Add a task after "Update existing entry" and after "Insert new entry" to add the student ID to a variable which will contain the list of IDs to delete.
Enclose all of the tasks in a sequence container.
After the sequence container executes add a task to delete all the records from the sync table that are in that variable you've been populating.

Resources