I am working with SSIS and I need to load multiple files with the following (Yellos) format to SQL using SSIS
The problem as you can see is that the files has an horrible format only process / consume records if the column A is populated (e.g: ignoring rows# 14 - X ) and I need to insert the value in D1 into the Date column.
any suggestion?
Regards!
Lets divide this problem into 3 Sub problems:
Get the date value from D1
Start Reading from Row number 4
Ignore all Rows where Column1 is NULL
Solution
1. Get the date value from D1
Create 2 SSIS variables, #[User::FilePath] (of type string) that contains the excel file path, #[User::FileDate] (of type string) that we will use it to store the date value
Add a script Task, choose the script language as Visual Basic
Select #[User::FilePath] as a ReadOnly variable and #[User::FileDate] as a ReadWrite variable
Open the Script Editor and use the following code to retrieve the Date Value and store it into #[User::FileDate]
This will search for the sheet named Refunds and extract the date value from it and store this value into #[User::FileDate]
m_strExcelPath = Dts.Variables.Item("FilePath").Value.ToString
Dim strSheetname As String = String.Empty
Dim strDate as String = String.Empty
m_strExcelConnectionString = Me.BuildConnectionString()
Try
Using OleDBCon As New OleDbConnection(m_strExcelConnectionString)
If OleDBCon.State <> ConnectionState.Open Then
OleDBCon.Open()
End If
'Get all WorkSheets
m_dtschemaTable = OleDBCon.GetOleDbSchemaTable(OleDbSchemaGuid.Tables,
New Object() {Nothing, Nothing, Nothing, "TABLE"})
'Loop over work sheet to get the first one (the excel may contains temporary sheets or deleted ones
For Each schRow As DataRow In m_dtschemaTable.Rows
strSheetname = schRow("TABLE_NAME").ToString
If Not strSheetname.EndsWith("_") AndAlso strSheetname.EndsWith("$") Then
If Not strSheetname.Tolower.Contains("refunds") Then Continue For
Using cmd As New OleDbCommand("SELECT * FROM [" & strSheetname & "A1:D1]", OleDBCon)
Dim dtTable As New DataTable("Table1")
cmd.CommandType = CommandType.Text
Using daGetDataFromSheet As New OleDbDataAdapter(cmd)
daGetDataFromSheet.Fill(dtTable)
'Get Value from column 4 (3 because it is a zero-based index
strDate = dtTable.Rows(0).Item(3).ToString
End Using
End Using
'when the first correct sheet is found there is no need to check others
Exit For
End If
Next
OleDBCon.Close()
End Using
Catch ex As Exception
Throw New Exception(ex.Message, ex)
End Tr
Dts.Variables.Item("FileDate").Value = strDate
Dts.TaskResult = ScriptResults.Success
End Sub
In the DataFlow Task add a Derived Column Transformation, add a derived column with the following expression
#[User::FileDate]
2. Start Reading from Row Number 4
As we assumed that the Excel File Path is stored in #[User::FilePath]
First open the Excel Connection Manager and uncheck the box First row has column names
In the DataFlow Task, double click on the excel source
Set the source to SQL Command
Use the following command: SELECT * FROM [Refunds$A4:D] , so it will start reading from the row number 4
Columns names will be as the following F1 ... F4 , in the excel source you can go to the Columns Tab and give alias to the columns names, so in the data flow task they will be showed with their aliases
3. Ignore all Rows Where Column1 is NULL
Add a conditional split after the Excel Source
Split the Flow based on the following expression
ISNULL([F1]) == False
If you didn't give an alias to F1 otherwise use the alias
Finally, remember that you must add a derived column (as we said in the first sub-problem) that contains the date value
Related
I have a CSV File I am parsing (it's specific, so I can't just dump it in). In one of the fields, there can be a degrees sign. As an example:
TR,TIC593_SP_TREND,,TIC593_SP,0,200,°F,1,YES,NONE,NONE,NONE,,,,NONE,YES,NO,REJECT,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
That is what It looks like in NotePad and I need to be able to replicate that into the database. All of the fields I am inserting into are NVARCHAR(Max).
EDIT
What I need is for the Degrees sign to be stored in SQL Server as a degrees sign, so that when I pull it out, it will go into the newly generated CSV file as a degrees sign instead of funky characters (�F is what I see when I export the same field from SQL)
/EDIT
Code-Wise, I am doing the following:
Iterating through all lines in the CSV File to pull out the lines I want to go into the database
Creating a Datatable:
Dim sqlstr = "select * from Tags where id=-1"
adap = New SqlDataAdapter(sqlstr, conn)
Dim dt As New DataTable
adap.Fill(dt)
Creating a New DataRow
Dim _new As DataRow = dt.NewRow
For i As Integer = 0 To columnsNames.Length - 1
Application.DoEvents()
Dim colName As String = columnsNames(i).Replace("!", "")
Dim colVal As String = columnValues(i)
If colName <> "" Then
_new("NodeName") = nodeName
_new("FileName") = FileName
_new(colName) = colVal
End If
Adding to the Datatable:
dt.Rows.Add(_new)
And, Finally, committing the information to the table:
dim combld as new SQLCommandBuilder(adap)
adap.Update(dt)
I'm sure there has to be an easy way, just haven't come across it with all the searching I've done (which, hopefully, my search techniques are not the problem!).
Thanks in advance.
I have a SSIS data import package that uses a source Excel spreadsheet and then imports data into a SQL Server database table. I have been unsuccessful in automating this process because the Excel file's worksheet name is changed every day. So, I have had to manually change the worksheet name before running the import each day. As a caveat, there will never be any other worksheets.
Can I make a variable for the worksheet name?
Can I use a wildcard character rather than the worksheet name?
Would I be better off creating an Excel macro or similar to change the worksheet name before launching the import job?
I use the follow script task (C#):
System.Data.OleDb.OleDbConnection objConn;
DataTable dt;
string connStr = ""; //Use the same connection string that you have in your package
objConn = new System.Data.OleDb.OleDbConnection(ConnStr);
objConn.Open();
dt = objConn.GetOleDbSchemaTable(System.Data.OleDb.OleDbShemaGuid.Tables,null);
objConn.Close();
foreach(DataRow r in dt.Rows)
{
//for some reason there is always a duplicate sheet with underscore.
string t = r["TABLE_NAME"].ToString();
//Note if more than one sheet exist this will only capture the last one
if(t.Substring(t.Length-1)!="_")
{
Dts.Variables["YourVariable"].Value = t;
}
}
And then in SSIS, I add another variable to build my SQL.
new variable "Select * from [" + "Your Variable" + "]"
Finally set your datasource to that SQL variable in Excel Source.
This works perfectly for me with the same scenario, in case it helps you or someone else:
Required package level string variables 2:
varDirectoryList - You will use this inside SSIS for each loop variable mapping
varWorkSheet - This will hold your changing worksheet name. Since you only have 1, it's perfect.
Set up:
a. Add SSIS For Each Loop
b. Excel Connection Manager (connect to first workbook as you test, then at the end you will go to properties and add inside expression "Excel File Path" your varDirectoryList. Set DelayValidation True as well as your Excel Source task. *This will help it go through each workbook in your folder)
c. Inside your For Each Loop add a Scrip Task C#, title it "Get changing worksheet
name into variable" or your preference.
Data Flow Task with your Excel Source to SQL Table Destination.
In your Scrip Task add this code:
using System.Data.OleDb;
public void Main()
{
// store file name passed into Script Task
string WorkbookFileName = Dts.Variables["User::varDirectoryList"].Value.ToString();
// setup connection string
string connStr = String.Format("Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=\"EXCEL 12.0;HDR=Yes;IMEX=1;\"", WorkbookFileName);
// setup connection to Workbook
using (var conn = new OleDbConnection(connStr))
{
try
{
// connect to Workbook
conn.Open();
// get Workbook schema
using (DataTable worksheets = conn.GetOleDbSchemaTable(OleDbSchemaGuid.Tables, null))
{
// in ReadWrite variable passed into Script Task, store third column in the first
// row of the DataTable which contains the name of the first Worksheet
Dts.Variables["User::varWorkSheet"].Value = worksheets.Rows[0][2].ToString();
//Uncomment to view first worksheet name of excel file. For testing purposes.
MessageBox.Show(Dts.Variables["User::varWorkSheet"].Value.ToString());
}
}
catch (Exception)
{
throw;
}
}
}
After you have this set up and run, you will get a message box displaying the changing worksheet names per workbooks.
If you are using Excel Source SQL Command you will need a 3rd string
variable like: varExcelSQL and inside that an expression like: SELECT
columns FROM ['varWorkSheet$'] which will dynamically change to match
each workbook. You may or may not need the single quotes, change as
needed in varExcelSQL.
If you are not using Excel Source SQL and just loading straight from
the Table; go into Excel Source Properties --> AccessMode -->
OpenRowSet from Variable --> select varWorkSheet.
That should take care of it, as long as the column structures remain the same.
If you happen to get files where it has multi data types in one column; you can use IMEX=1 inside your connection string which forces the datatypes to DT_WSTR's on import.
Hope this helps :-)
If you are using SSIS to import the sheet you could use a script task to find the name of the sheet and then change the name or whatever else you needed to do in order to make it fit the rest of your import. Here is an example of finding the sheet I found here
Dim excel As New Microsoft.Office.Interop. Excel.ApplicationClass
Dim wBook As Microsoft.Office.Interop. Excel.Workbook
Dim wSheet As Microsoft.Office.Interop. Excel.Worksheet
wBook = excel.Workbooks.Open
wSheet = wBook.ActiveSheet()
For Each wSheet In wBook.Sheets
MsgBox(wSheet.Name)
Next
On the MsgBox line is where you could change the name or report it back for another process
I'm a self taught Excel VBA and SQL user. I'm testing out some simple queries before I add complexity. I must be missing something blindingly obvious here...
I am using an ADO connection to run a SQL SELECT statement on a table in the activeworkbook (ThisWorkBook). The Excel Table is named "tbl_QDB" and is on worksheet "MyQDB". The table starts on cell A1, so there are no blank or populated cells above the Table HeaderRowRange.
I have set up an ADO connection to ThisWorkBook and this is working fine. Here's the code:
Sub ConnectionOpen2()
'### UNDER DEVELOPMENT
Dim sconnect As String
Const adUseClient = 3
Const adUseServer = 2
Const adLockOptimistic = 3
Const adOpenKeyset = 1
Const adOpenDynamic = 2
'used to connect to this workbook for SQL runs
On Error GoTo err_OpenConnection2
Set cn2 = CreateObject("ADODB.Connection")
Set rec2 = CreateObject("ADODB.Recordset")
rec2.CursorLocation = adUseClient
rec2.CursorType = adOpenStatic
rec2.LockType = adLockOptimistic
datasource = ThisWorkbook.FullName
sconnect = "Provider=Microsoft.ACE.OLEDB.12.0;" & _
"Data Source=" & datasource & ";" & _
"Extended Properties=""Excel 12.0;HDR=YES;ReadOnly=False;Imex=0"";"
cn2.Open sconnect
'etc, etc...
End Sub
I can run this simplest basic SELECT query:
SQLSTR="SELECT * FROM [MYQDB$]"
rec2.open SQLSTR, cn2
This works and produces 10 records i.e. rec2.recordcount=10.
However, if I try this, it errors:
SQLSTR="SELECT QID_1 FROM [MYQDB$]"
QID_1 is a valid field in the table on worksheet "MyQDB".
It doesn't change the error if I enclose QID_1 in () or [] or ``
I can even replace the field name with a made up field e.g. DonaldDuck and I get the same error.
Why would the SELECT statement work if I use "*" but not if I use any of the field names in the table? This seems so basic that I feel I must have missed a simple but key point.
I really will appreciate if someone can point out the mistake!
The SQL should work - if the field exists. Execute the Select * and dump the field list:
For i = 0 To rec2.Fields.Count - 1
Debug.Print rec2.Fields(i).Name
Next i
Thank you all for your comments.
That suggestion #FunThomas was an eye opener! The results were F1, F2, F3 etc, so the field names (or column names if you prefer) were not being recognised.
This would explain why, after days of trying to join this table with another in a closed, external workbook, it was not working. SQL error messages can be quite obtuse and were not saying it didn't recognise the field name.
I have now fixed that issue. Here's what I can tell / warn others:
I started this table with rows above the header. In 2 of those cells
above I recorded the last connection time and status to another
workbook table. I realised before that these extra rows, with data
populated in ANY cell above the headers, were causing problems with
SQL. Despite having my data in an Excel Table, the SQL "engine" for
Excel looks at the sheet, i.e. [MYQDB$] where the data is stored
(although I am aware that you can specify a sheet and range, but
cannot use the actual table name as the range).
It is ok to have blank rows above the table headerrowrange. So, I
deleted the cells containing the data above the table
headerrowrange. Instead, I placed a Text Box and used a formula to
look at another sheet where the last connection time and status were
now stored to supply the text for the text box.
I can now see that even this Text Box, which occupies no cell, causes a problem for Excel SQL.
Before posting my question here, I made a copy of the workbook and removed the text box and the rows above the table headerrowrange. I still got errors. I still got F1, F2, F3 etc as field names (per #FunThomas's suggestion).
Only after deleting these rows and the text box and then resizing the table (actually, the same range as before) did the Excel SQL recognise the proper field names. I was then even able (just for curiosity) to insert a blank row above the table headerrowrange, and the SQL still worked.
It seems to me that Excel retained in memory the old table definition and only by removing all data above the table headerrowrange and then resizing the table did it refresh that. Perhaps I should be less lazy in future and call the sheetname and range (table address) in the sql: maybe that would ignore data in cells above the headerrowrange?
#PanagiotisKanavos: I was originally trying to compare two tables (actual Excel Tables, not just ranges, hence they have Field Names), one in ThisWorkBook and another in a closed Excel workbook. SQL is the best way to do this. Having failed to get a left join to work between these tables (and this Question might now have revealed why that wouldn't work!) I decided to bring the data from the external workbook into ThisWorkBook and compare there. Then I was going to find the differences, store in a recordset (hence SQL) and then INSERT INTO the external workbook.
Thanks for your help guys!
Every week, my analysts have a spreadsheet of invoices which they need to update with a check number and check date. The checks table exists in SQL server.
I've written them a macro that iterates through each row of the spreadsheet, opens an ADO recordset using a statement like this:
SELECT CheckNumber, CheckDate FROM CHECKS WHERE Invoice_Number = " & cells (i,2)
... and then uses the fields from the recordset to write the number and date to the first two columns of that row in the Excel spreadsheet.
The code performs acceptably for a few hundred rows, but is slow when there are thousands of rows.
Is there a faster way to update an Excel spreadsheet than with a row-by-row lookup using ADO? For example, is there a way to do a SQL join between the spreadsheet and the table in SQL Server?
Edit: In response to Jeeped's questions, here's a bit of clarification.
What I'm really trying to do is find a way to "batch" update an Excel spreadsheet with information from SQL server, instead executing SQL lookups and writing the results a row at a time. Is there a way to do the equivalent of a join and return the entire results set in a single recordset?
The Invoice example above really represents a class of problems that I encounter daily. The end users have a spreadsheet that contains their working data (e.g. invoices) and they want me to add information from a SQL server table to it. For example, "Using the invoice number in column C, add the check number for that invoice in column A, and the check date in column B". Another example might be "For each invoice in column b, add the purchase order number to column a."
The Excel source column would be either a number or text. The "match" column in the SQL table would be of a corresponding data type, either varchar or integer. The data is properly normalized, indexed, etc. The updates would normally affect a few hundred or thousand rows, although sometimes there will be as many as twenty to thirty thousand.
If I can find a way to batch rows, I'll probably turn this into an Excel add-in to simplify the process. For that reason, I'd like to stay in VBA because my power users can extend or modify it to meet their needs--I'd rather not write it in a .NET language because then we need to dedicate developer time to modifying and deploying it. The security of the Excel application is not a concern here because the users already have access to the data through ODBC linked tables in an MS Access database and we have taken appropriate security precautions on the SQL Server.
Moving the process to SSIS would require a repeatability that doesn't exist in the actual business process.
In the past I've had success with pulling in all of the data from SQL server into a client side disconnected ADO recordset. I then looped once through the entire recordset to create a VBA dictionary storing the ID Value (in this case the InvoiceNum) as key, and the recordset bookmark as the pair item. Then loop though each value checking the invoice number against the dictionary using the "Exists" function. If you find a match you can set your recordset to the bookmark and then update the values on the spreadsheet from the recordset. Assuming the Invoice table isn't a few million rows this method should prove speedy.
EDIT: Added batch processing to try to limit returned records from large datasets. (Untested Code Sample)
Public Sub UpdateInvoiceData(invoiceNumRng As Range)
'References: Microsoft ActiveX Data Objects x.x
'References: Microsoft Scripting Runtime
Dim cell As Range
Dim tempCells As Collection
Dim sqlRS As ADODB.Recordset
Dim dict As Scripting.Dictionary
Dim iCell As Range
Dim testInvoiceNum As String
Dim inClause As String
Dim i As Long
i = 1
For Each cell In invoiceNumRng
If i Mod 25 = 0 Or i = invoiceNumRng.cells.Count Then 'break up loop into batches of 25:: Modify batch size here, try to find an optimal size.
inClause = CreateInClause(tempCells) 'limit sql query with our test values
Set sqlRS = GetInvoiceRS(inClause) 'retrieve batch results
Set dict = CreateInvoiceDict(sqlRS) 'create our lookup dictionary
For Each iCell In tempCells
testInvoiceNum = iCell.Value 'get the invoice number to test
If dict.Exists(testInvoiceNum) Then 'test for match
sqlRS.Bookmark = dict.Item(testInvoiceNum) 'move our recordset pointer to the correct item
iCell.Offset(0, 1).Value = sqlRS.Fields("CheckNum").Value
iCell.Offset(0, 2).Value = sqlRS.Fields("CheckDate").Value
End If
Next iCell
'prepare for next batch of cells
Set tempCells = Nothing
Set tempCells = New Collection
Else
tempCells.Add cell
End If
i = i + 1 'our counter to determine batch size
Next cell
End Sub
Private Function CreateInClause(cells As Collection) As String
Dim retStr As String
Dim tempCell As Range
retStr = ""
For Each tempCell In cells
retStr = retStr & "'" & tempCell.Value & "'" & ", " 'assumes textual value, omit single quotes if numeric/int
Next tempCell
If Len(retStr) > 0 Then
CreateInClause = Left(retStr, Len(retStr) - 2) 'trim off last comma value
Else
CreateInClause = "" 'no items
End If
End Function
Private Function GetInvoiceRS(inClause As String) As ADODB.Recordset
'returns the listing of InvoiceData from SQL
Dim cn As ADODB.Connection
Dim rs As ADODB.Recordset
Dim sql As String
Set cn = New ADODB.Connection
cn.ConnectionString = "Your Connection String"
sql = "SELECT * FROM [Invoices] WHERE InvoiceID IN(" & incluase & ")"
cn.Open
rs.CursorLocation = adUseClient 'use clientside cursor since we will want to loop in memory
rs.CursorType = adOpenDynamic
rs.Open sql, cn
Set rs.ActiveConnection = Nothing 'disconnect from connection here
cn.Close
Set GetInvoiceRS = rs
End Function
Private Function CreateInvoiceDict(dataRS As ADODB.Recordset) As Dictionary
Dim dict As Scripting.Dictionary
Set dict = New Scripting.Dictionary
If dataRS.BOF And dataRS.EOF Then
'no data to process
Else
dataRS.MoveFirst 'make sure we are on first item in recordset
End If
Do While Not dataRS.EOF
dict.Add CStr(dataRS.Fields("InvoiceNum").Value), dataRS.Bookmark
dataRS.MoveNext
Loop
Set CreateInvoiceDict = dict
End Function
The best way to do this is to use SSIS and insert the information (through SSIS) into a range in the spreadsheet. Remember that SSIS expects the target range to be empty and one row above the target range should also be empty. If you do this you can schedule the SSIS job through the windows scheduler.
I have created a functioning SSIS package which pulls rows from a flat file into a SQL table. I just need to be able to delete old rows in the table, once they are older than 10 days.
The only thing is, there is no date column and I'm wondering if there is a way to do this, using the DateLastModified property from the source file? I'm not sure if this can be done via a script task or something else?
Your advice would be appreciated. :-)
So I've tried to include the date of the source file by creating a FileDate variable, along with FilePath and SourceFolder variables. I've utilized the FileDate variable by adding a derived column, Date_Imported w/the expression, #[User::FileDate]. The FilePath variable is assigned the location, "d:\inputfiles*.txt", as indicated in the below code. The SourceFolder has been given the value, "D:\InputFiles\".
However, I'm receiving an "Exception has been thrown by the target of an invocation.
System.MissingMemberException: Public member 'GetFiles' on type 'FileSystemObject' not found."
The following is the content of my script task to delete records older than 10 days; please disregard any commented out lines, as I've been trying different things...I appreciate any guidance you can give:
Public Sub Main()
' Add your code here
Dim FilePath As String
'Dim SourceFolder As String
Dim iMaxAge = 10
Dim oFSO = CreateObject("Scripting.FileSystemObject")
Dim myConnection As SqlConnection
Dim myCommand As SqlCommand
myConnection = New SqlConnection("server = localhost; uid=sa; pwd=; database=StampsProj")
FilePath = "d:\inputfiles\*.txt"
'SourceFolder = "d:\inputfiles"
'SourceFolder.ReadOnly = True
'To delete records, older than 10 days from AddUpIn table
'For Each oFile In oFSO.GetFolder(SourceFolder).Files
For Each oFile In oFSO.GetFiles(Dts.Variables("User::SourceFolder"))
Dim FileDate As Date = oFile.DateLastModified
If DateDiff("d", oFile.DateLastModified, Now) > iMaxAge Then
'If DateDiff("d", oFile.FileDate, Now) > iMaxAge Then
myCommand = New SqlCommand("Delete from AddUpIn", myConnection)
End If
Next
End Sub
Sounds like you need to add either a datetime column onto your import table and set its value to the date you run the import. Or create a seperate FileImport table which logs the filename and an identifier, then add the identifier to the import table so you can identify the rows to delete.