Fill SQL database from a CSV File - sql-server

I need to create a database using a CSV file with SSIS. The CSV file includes four columns:
I need to use the information of that table to populate the three tables I created in SQL below.
I have realized that what I need is to use one column of the Employee Table, EmployeeNumber, and Group Table, GroupID, to populate the EmployeeGroup table. For that, I thought that a Join Merge table is what I needed, but I created the Data Flow Task in SSIS, and the results are the same, no data displayed.
The middle table is the one used to relate the other tables.
I created the package in SSIS and the Employee and Group Tables are populated, but the EmployeeGroup table is not. EmployeeGroup will only show the EmployeeNumber and Group ID columns with no data.
I am new using SSIS, and I really do not know what else to do. I will really appreciate your help.

Overview
Solutions using SSIS
Using 3 Data Flow Tasks
Using 2 Data Flow Tasks
Solutions Using T-SQL
Using Microsoft.Ace.OLEDB
Using Microsoft Text Driver
Solutions Using PowerShell
1st Solution - SSIS
Using 3 Data Flow Tasks
This can be done using only 2 Data Flow Task, but according to what the OP mentioned in the question I am new using SSIS, and I really do not know what else to do, i will provide easiest solution which is 3 DataFlow Task to avoid using more components like MultiCast.
Solution Overview
Because you want to build a relational database and extract relations from the csv, you have to read the csv 3 times -consider it as 3 seperated files -.
First you have to import Employees and Groups Data, Then you have to import the relation table between them.
Each Import step can be done in a seperate Data Flow Task
Detailed Solution
Add a Flat File connection Manager (Csv File)
Add An OLEDB connection Manager (SQL Destination)
Add 3 DataFlow Task like the image below
First Data Flow Task
Add a Flat File Source , a Script Component , OLEDB destination like shown in the image below
In the Script Component choose Group Name column as Input
Select the Output Buffer and change SynchronousInputID Property to None And add an output column OutGroupname with type DT_STR
In the Script section write the following Code:
Imports System.Collections.Generic
Private m_List As New List(Of String)
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Not Row.GroupName_IsNull AndAlso
Not String.IsNullOrEmpty(Row.GroupName.Trim) Then
If Not m_List.Contains(Row.GroupName.Trim) Then
m_List.Add(Row.GroupName.Trim)
CreateOutputRows(Row.GroupName.Trim)
End If
End If
End Sub
Public Sub CreateOutputRows(ByVal strValue As String)
Output0Buffer.AddRow()
Output0Buffer.OutGroupName = strValue
End Sub
On the OLEDB Destination map OutGroupName to GroupName Column
Second Data Flow Task : Import Employees Data
Repeat the same steps done with Groupname Column : with a single difference that is you have to choose the EmployeeID, Employee Name, LoginName columns as Input in the Script Component and Use the ID Column instead of Groupname column in the comparaison
Third Data Flow Task : Import Employees_Group Data
You have to add a Flat File Source , Look Up transformation , OLEDB Destination
In The LookUp Transformation Component select Groups Table as a Lookup table
Map GroupName Columns and Get Group ID as output
Choose Ignore Failure in the Error Output Configuration
In Oledb Destination map columns as following
Note: GroupID must be an Identity (set it in sql server)
Using 2 Data Flow Tasks
You have to do the same steps as the 3 Data Flow Tasks solution, but instead of adding 2 Data Flow Tasks to Group and Employee, just add one Data Flow Task, and after the Flat File Source add a MultiCast component to duplicate the Flow. Then for the first flow use the same Script Component and OLEDB Destination used in the Employee Data Flow Task, and for the second flow use the Script Component and OLEDB Destination related to Group.
2nd Solution - Using TSQL
There are many method to import Flat file to SQL via T-SQL commands
OPENROWSET with Microsoft ACE OLEDB provider
Assuming that the installed version of Microsoft ACE OLEDB is Microsoft.ACE.OLEDB.12.0 and that the csv file location is C:\abc.csv
First Import data into Employee and Group Table
INSERT INTO [GROUP]
([Group Name])
SELECT
[Group Name]
FROM
OPENROWSET
(
'Microsoft.ACE.OLEDB.12.0','Text;Database=C:\;IMEX=1;','SELECT * FROM abc.csv'
) t
INSERT INTO [Employee]
([Employee Number],[Employee Name],[LoginName])
SELECT
[Employee Number],[Employee Name],[LoginName]
FROM
OPENROWSET
(
'Microsoft.ACE.OLEDB.12.0','Text;Database=C:\;IMEX=1;','SELECT * FROM abc.csv'
) t
Import the Employee_Group Data
INSERT INTO [EmployeeGroup]
([Employee Number],[GroupID])
SELECT
t1.[Employee Number],t2.[GroupID]
FROM
OPENROWSET
(
'Microsoft.ACE.OLEDB.12.0','Text;Database=C:\;IMEX=1;','SELECT * FROM abc.csv'
) t1 INNER JOIN GROUP t2 ON t1.[Group Name] = T2.[Group Name]
OPENROWSET with Microsoft Text Driver
First Import data into Employee and Group Table
INSERT INTO [GROUP]
([Group Name])
SELECT
[Group Name]
FROM
OPENROWSET
(
'MSDASQL',
'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=C:\;',
'SELECT * FROM abc.csv'
) t
INSERT INTO [Employee]
([Employee Number],[Employee Name],[LoginName])
SELECT
[Employee Number],[Employee Name],[LoginName]
FROM
OPENROWSET
(
'MSDASQL',
'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=C:\;',
'SELECT * FROM abc.csv'
) t
Import the Employee_Group Data
INSERT INTO [EmployeeGroup]
([Employee Number],[GroupID])
SELECT
t1.[Employee Number],t2.[GroupID]
FROM
OPENROWSET
(
'MSDASQL',
'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=C:\;',
'SELECT * FROM abc.csv'
) t1 INNER JOIN GROUP t2 ON t1.[Group Name] = T2.[Group Name]
Note: You can Import Data to a staging table, then query this table, to avoid connecting many times to the csv File
Solutions Using PowerShell
There are many method to import csv files to SQL server, you can check the following links for additional informations.
Four Easy Ways to Import CSV Files to SQL Server with PowerShell
How to import data from .csv in SQL Server using PowerShell?
References
OPENROWSET (Transact-SQL)
T-SQL – Read CSV files using OpenRowSet
Import error using Openrowset

I think the easiest solution would be to import the csv to a flat staging table and then use some insert into...select statements to populate the target tables.
Assuming you know how to import to a flat table, the rest is quite simple:
INSERT INTO Employee (EmployeeNumber, EmployeeName, LoginName)
SELECT DISTINCT EmployeeNumber, EmployeeName, LoginName
FROM Stage
INSERT INTO [Group] (GroupName)
SELECT DISTINCT GroupName
FROM Stage
INSERT INTO EmployeeGroup(EmployeeNumber, GroupId)
SELECT DISTINCT EmployeeNumber, GroupId
FROM Stage s
INNER JOIN [Group] g ON s.GroupName = g.GroupName
You can see a live demo on rextester.

Since you already know how to import the csv and extract two tables (Employee and Group), I suggest you just populate EmployeeGroup in the same way. And stop using a group_id. If you do that, you'll get sql statements like:
select [Employee Number], [Employee Name], LoginName from Employee
select [Group Name] from Employee
select distinct [Employee Number], [Group Name] from Employee
Most likely, you'll have similar statements already working for Employee and Group. In this option you can make it work in the same way, without using a Join Merge. It's a usefull option, but clearly somewhere in that component something goes wrong.

Related

How to add Temporary Table inside SSIS OLE DB Source

I want to get data from source using OLE DB.
used SQL command to get that.
I tried to set it with WITH clause. It worked. But it took more time to give the output.
WITH Temp
AS
(
SELECT C.*
FROM DimCUSTOMER C
INNER JOIN DimSHOP S
ON C.CUST_ID = S.CUST_ID
)
SELECT *
FROM TEMP
WHERE ADDRESS IS NOT NULL
Then I tried it with # temporary table in SSMS.
it gave less time comparing to with clause.
SQL Code as Below.
SELECT C.*
INTO #Temp
FROM DimCUSTOMER C
INNER JOIN DimSHOP S
ON C.CUST_ID = S.CUST_ID
SELECT *
FROM #TEMP
WHERE ADDRESS IS NOT NULL
Then I set this code inside SSIS package OLE DB Scource.
But I'm getting an error when setting the SQL Code inside it.
First, use an Execute SQL Task into in the design view and rename it Create Temp Table :
IF OBJECT_ID('tempdb..##tmp') IS NOT NULL
DROP TABLE ##tmp
CREATE TABLE ##tmp
(
//your columns
)
INSERT INTO ##tmp
SELECT C.*
FROM DimCUSTOMER C
INNER JOIN DimSHOP S
ON C.CUST_ID = S.CUST_ID
Once the table has been created into the SSIS package. Right click OLE DB Source and choose Edit. Choose your data source and choose SQL command for the Data access mode dropdown. In the SQL command text you will need to create our SQL statement:
SELECT *
FROM ##temp
WHERE ADDRESS IS NOT NULL

MS Access VBA/SQL: Import CSV and Compare to Existing Records

I need to import sales data from an external source into an Access database. The system that generates the sales reports allows me to export data within a specified date range, but that data may change due to updates, late reported data, etc. I want to loop through each line of the CSV and see if that row already exists. If it does, ignore it; if it doesn't add a new record to the sales table.
Unless I'm misunderstanding it, I don't believe I can use DoCmd.TransferText as the data structure does not match the table I'm importing it to - I am only looking at importing several of the columns in the file.
What is my best option (1) access the data within my file to loop through, and (2) to compare the contents of a given row against a given table to see if it already exists?
Consider directly querying the csv file with Access SQL, selecting needed columns and run either of the NOT IN / NOT EXISTS / LEFT JOIN ... NULL queries to avoid duplicates.
INSERT INTO [myTable] (Col1, Col2, Col3)
SELECT t.Col1, t.Col2, t.Col3
FROM [text;HDR=Yes;FMT=Delimited(,);Database=C:\Path\To\Folder].myFile.csv t
WHERE NOT EXISTS
(SELECT 1 FROM [myTable] m
WHERE t.Col1 = m.Col1); -- ADD COMPARISON FIELD(S) IN WHERE CLAUSE

SSIS Stage data from values in a local table

I am staging data from two separate data servers onto a staging server - one server contains the order information and the second server contains shipping information for that order. I am pulling across all orders modified within the last 5 days. I then want to load all the associated shipping information for those modified orders. So there are 3 servers in this scenario - the Order server, the Shipping server and the Staging server.
I tried to do a for-each container based on the list of staged orders - but with a result set in the 1000's it is a VERY slow solution.
I can't make changes to the Order or Shipping server - but I can do whatever I need on the Staging server.
What I really want to do is create a Variable object with a list of order numbers and then in my Shipping server select statement have a a filter like this:
select [shipping fields] from [shipping_table] where order_nbr in [variable_list_of_order_nbrs]
If your shipping information is well-indexed, one thing that can be very quick is a Lookup transformation with 'No Cache' selected.
If you need to pull back more than one row, that can be trickier, but it is doable with an asynchronous Script transformation to create placeholder row numbers, and then a modified Lookup SQL statement (with a ROW_NUMBER() field) to ensure you pull back all required rows and each row only once.
I'm not sure how efficient this solution is, but just came out of my mind with my SQL skills:
Note: This requires change in Order serve and Shipping server fetching logic
Fetch the all record data from source (Order serve) as XML data
DECLARE #Product Table (
ID INT)
INSERT into #Product
select top 10 ROW_NUMBER() over(order by a.name) as Rcount
from sys.all_objects a --used here for demo to generate series of numbers
--Return as XML
SELECT T.ID
FROM #Product T
FOR XML path('')
Then pass this xml data as input parameter to Shipping server. Fetch the records joining the output from XML parameter.
DECLARE #XMlVal XML
set #XMLVal = (SELECT T.ID
FROM #Product T
FOR XML path('')
)
--Read from XML
SELECT
prod.value('(.)[1]', 'int') AS 'ID'
FROM
#XMLVal.nodes('/ID') AS temp(prod)

Save output of multiple queries into file?

I have a SQL query that I have to run against multiple (15) tables in SQL Server Management Studio.
Is it possible to save the result of multiple queries into a file? (.txt, excel sheet?)
Using union is not possible because not all tables have an equal amount of columns.
The queries look somewhat like this
select *
from tableA
where main_id in (select id from maintable where date is null and status ='new')
select *
from tableB
where main_id in (select id from maintable where date is null and status ='new')
select *
from tableC
where main_id in (select id from maintable where date is null and status ='new')
select *
from tableD
where main_id in (select id from maintable where date is null and status ='new')
select *
from tableE
where main_id in (select id from maintable where date is null and status ='new')
Try below:-
Open SQL Server Management Studio
Go to Tools > Options > Query Results > SQL Server > Results To Text
Then on right hand side, change output format to comma delimited.
Run your query and then right click on results and click save results to file.
Once done rename the file from .rpt to .csv
Go to the Query menu > "Results to"... and then pick "to file" or whichever you want.
Change rpt extention to csv.
Be sure to re-run your queries.
If you want multiple queries result to a single file, you can follow below steps:
Create view for every SQL query you have got. This is just for loading purpose.
You can use Import-Export wizard in SQL Server Management Studio. Right click database > Tasks > Export Data.
In the Wizard choose SQL Server database as source and Excel file as destination.
Choose export multiple tables and select views as the source, and in the target excel, a separate sheet will be mentioned as the destination in the excel file.
Go to next steps in the wizard and Finish the wizard
Now the views data will be loaded to separate sheets in the target excel
Now, you can remove the views, as you dont need them
all the above things can be done inside SSMS.
There are many other options to load data from Multiple ways to export data from SSMS into separate files

UPSERT in SSIS

I am writing an SSIS package to run on SQL Server 2008. How do you do an UPSERT in SSIS?
IF KEY NOT EXISTS
INSERT
ELSE
IF DATA CHANGED
UPDATE
ENDIF
ENDIF
See SQL Server 2008 - Using Merge From SSIS. I've implemented something like this, and it was very easy. Just using the BOL page Inserting, Updating, and Deleting Data using MERGE was enough to get me going.
Apart from T-SQL based solutions (and this is not even tagged as sql/tsql), you can use an SSIS Data Flow Task with a Merge Join as described here (and elsewhere).
The crucial part is the Full Outer Join in the Merger Join (if you only want to insert/update and not delete a Left Outer Join works as well) of your sorted sources.
followed by a Conditional Split to know what to do next: Insert into the destination (which is also my source here), update it (via SQL Command), or delete from it (again via SQL Command).
INSERT: If the gid is found only on the source (left)
UPDATE If the gid exists on both the source and destination
DELETE: If the gid is not found in the source but exists in the destination (right)
I would suggest you to have a look at Mat Stephen's weblog on SQL Server's upsert.
SQL 2005 - UPSERT: In nature but not by name; but at last!
Another way to create an upsert in sql (if you have pre-stage or stage tables):
--Insert Portion
INSERT INTO FinalTable
( Colums )
SELECT T.TempColumns
FROM TempTable T
WHERE
(
SELECT 'Bam'
FROM FinalTable F
WHERE F.Key(s) = T.Key(s)
) IS NULL
--Update Portion
UPDATE FinalTable
SET NonKeyColumn(s) = T.TempNonKeyColumn(s)
FROM TempTable T
WHERE FinalTable.Key(s) = T.Key(s)
AND CHECKSUM(FinalTable.NonKeyColumn(s)) <> CHECKSUM(T.NonKeyColumn(s))
The basic Data Manipulation Language (DML) commands that have been in use over the years are Update, Insert and Delete. They do exactly what you expect: Insert adds new records, Update modifies existing records and Delete removes records.
UPSERT statement modifies existing records, if a records is not present it INSERTS new records.
The functionality of UPSERT statment can be acheived by two new set of TSQL operators. These are the two new ones
EXCEPT
INTERSECT
Except:-
Returns any distinct values from the query to the left of the EXCEPT operand that are not also returned from the right query
Intersect:-
Returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operand.
Example:- Lets say we have two tables Table 1 and Table 2
Table_1 column name(Number, datatype int)
----------
1
2
3
4
5
Table_2 column name(Number, datatype int)
----------
1
2
5
SELECT * FROM TABLE_1 EXCEPT SELECT * FROM TABLE_2
will return 3,4 as it is present in Table_1 not in Table_2
SELECT * FROM TABLE_1 INTERSECT SELECT * FROM TABLE_2
will return 1,2,5 as they are present in both tables Table_1 and Table_2.
All the pains of Complex joins are now eliminated :-)
To use this functionality in SSIS, all you need to do add an "Execute SQL" task and put the code in there.
I usually prefer to let SSIS engine to manage delta merge. Only new items are inserted and changed are updated.
If your destination Server does not have enough resources to manage heavy query, this method allow to use resources of your SSIS server.
We can use slowly changing dimension component in SSIS to upsert.
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/transformations/configure-outputs-using-the-slowly-changing-dimension-wizard?view=sql-server-ver15
I would use the 'slow changing dimension' task

Resources