SSIS Load CSV with different number of columns on every load - sql-server

We are working on a SSIS job to load the CSV file to a SQL table. This job is to be scheduled for daily load. The problem is this CSV file comes with different columns each day. Structure of the file is as below:
<table border="1">
<tr><td>Date</td><td>New York</td><td>Washington</td><td>London</td></tr>
<tr><td>15-04-2020</td><td>2</td><td>3</td><td>20</td></tr>
<tr><td>16-04-2020</td><td>30</td><td>50</td><td>22</td></tr>
</table>
Date column remains the same where as number of columns for city changes based on the data for that day. It could 1 city column or many more city columns. Each city column means the number of likes from the that city on that day.
I am thinking to convert the structure to a 3 column structure including Date, City Name and Number of Likes.
But how would a flat file source component would handle it and how would I transform it to a new structure?

I will walk you through a script component to handle this:
I am assuming your csv looks like this and not the html above:
Date,New York,Washington,London
15-04-2020,2,3,20
16-04-2020,30,50,22
I named this file likes.txt and saved it on my D:\
Add a dataflow
Add a Script Component (Source)
Go to inputs and outputs and add your outputs (don't forget data types)
Go into the script and paste the following Code into CreateNewOutputRows:
string[] lines = File.ReadAllLines(#"d:\likes.txt");
List<string> cities = new List<string>();
int ctr = 0;
foreach (string line in lines)
{
ctr++;
//skip empty rows
if(string.IsNullOrWhiteSpace(line)) continue;
//Get Cities from Header
if (ctr == 1)//Header row
{
string[] headers = line.Split(',');
for (int i = 1; i < headers.Length; i++)
{
cities.Add(headers[i]);
}
continue; //Go to next line
}
//Work with details
string[] pieces = line.Split(',');
for (int i = 1; i < pieces.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.City = cities[i-1];
Output0Buffer.Date = DateTime.ParseExact(pieces[0].ToString(), "dd-MM-yyyy", CultureInfo.InvariantCulture);
Output0Buffer.Likes = int.Parse(pieces[i]);
}
}
Add the following namespaces to make the code work:
using System.IO;
using System.Collections.Generic;
using System.Globalization;
Here are your results:
There's quite a bit to unpack in that script as it uses Lists, Arrays, File System Tasks, etc. Let me know if you have questions.
PS - This is a Corona Virus Answer (meaning I am bored) without any effort on your end. Please at least show what you tried in the future.

Related

Extra columns in the middle of existing columns pipe delimited text file (How to ignore that column values in flat file source ) SSIS

I have used SSIS package flat file source to read pipe delimited | text file and used column delimiter as | and text qualifier as none. I need to handle that if extra columns in the source file need to skip that column values.
If new columns are in the source file the data get loaded into wrong columns. How to skip the values of that rows?
EDIT: I revised the answer to remove rows with more than the expected number of columns:
Parse it with an SSIS with a script component source. In the script component:
-Select the flat file connection manager under connection managers, below I left it named "Connection"
-Add columns to the output buffer and configure their data types. In my example, they are named "First" and "Second"
-In the script, add a reference to System.IO and do something like the following which looks for an expected number of columns and only adds a row to the buffer if it meets that criteria:
using System.IO;
...
public override void CreateNewOutputRows()
{
var expectedNumOfColumns = 2;
using (StreamReader sr = new StreamReader(Connections.Connection.ConnectionString))
{
string line;
while((line = sr.ReadLine()) != null)
{
var parsedLine = line.Split(',');
if (parsedLine.Length == expectedNumOfColumns)
{
Output0Buffer.AddRow();
Output0Buffer.First = parsedLine[0];
Output0Buffer.Second = parsedLine[1];
}
}
}
}

how to skip a bad row in ssis flat file source

I am reading in a 17-column CSV file into a database.
once in a while the file has a "less then 17-column" row.
I am trying to ignore the row, but even when all columns are set to ignore, I can't ignore that row and the package fails.
How to ignore those rows?
Solution Overview
you can do this by adding one Flat File Connection Manager add only one column with Data type DT_WSTR and a length of 4000 (assuming it's name is Column0) - So all column are considered as one big column
In the Dataflow task add a Script Component after the Flat File Source
In mark Column0 as Input Column and Add 17 Output Columns
In the Input0_ProcessInputRow method split Column0 by delimiter, Then check if the length of array is = 17 then assign values to output columns, Else ignore the row.
Detailed Solution
Add a Flat file connection manager, Select the text file
Go to the Advanced Tab, Delete all Columns except one Column
Change the datatype of the remianing Column to DT_WSTR and length = 4000
Add a DataFlow Task
Inside the Data Flow Task add a Flat File Source, Script Component and OLEDB Destination
In the Script Component Select Column0 as Input Column
Add 17 Output Columns (the optimal output columns)
Change the OutputBuffer SynchronousInput property to None
Select the Script Language to Visual Basic
In the Script Editor write the following Script
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Not Row.Column0_IsNull AndAlso
Not String.IsNullOrEmpty(Row.Column0.Trim) Then
Dim strColumns As String() = Row.Column0.Split(CChar(";"))
If strColumns.Length <> 17 Then Exit Sub
Output0Buffer.AddRow()
Output0Buffer.Column = strColumns(0)
Output0Buffer.Column1 = strColumns(1)
Output0Buffer.Column2 = strColumns(2)
Output0Buffer.Column3 = strColumns(3)
Output0Buffer.Column4 = strColumns(4)
Output0Buffer.Column5 = strColumns(5)
Output0Buffer.Column6 = strColumns(6)
Output0Buffer.Column7 = strColumns(7)
Output0Buffer.Column8 = strColumns(8)
Output0Buffer.Column9 = strColumns(9)
Output0Buffer.Column10 = strColumns(10)
Output0Buffer.Column11 = strColumns(11)
Output0Buffer.Column12 = strColumns(12)
Output0Buffer.Column13 = strColumns(13)
Output0Buffer.Column14 = strColumns(14)
Output0Buffer.Column15 = strColumns(15)
Output0Buffer.Column16 = strColumns(16)
End If
End Sub
Map the Output Columns to the Destination Columns
C# Solution for Loading CSV and skip rows that don't have 17 columns:
Use a Script Component:
On input/output screen add all of your outputs with data types.
string fName = #"C:\test.csv" // Full file path: it should reference via variable
string[] lines = System.IO.File.ReadAllLines(fName);
//add a counter
int ctr = 1;
foreach(string line in lines)
{
string[] cols = line.Split(',');
if(ctr!=1) //Assumes Header row. elim if 1st row has data
{
if(cols.Length == 17)
{
//Write out to Output
Output0Buffer.AddRow();
Output0Buffer.Col1 = cols[0].ToString(); //You need to cast to data type
Output0Buffer.Col2 = int.Parse(cols[1]) // example to cast to int
Output0Buffer.Col3 = DateTime.Parse(cols[2]) // example of datetime
... //rest of Columns
}
//optional else to handle skipped lines
//else
// write out line somewhere
}
ctr++; //increment counter
}
This is for #SidC comment in my other answer.
This lets you work with multiple files:
//set up variables
string line;
int ctr = 0;
string[] files = System.IO.Directory.GetFiles(#"c:/path", "filenames*.txt");
foreach(string file in files)
{
var str = new System.IO.StreamReader(file);
while((line = str.ReadLine()) != null)
{
// Work with line here similar to the other answer
}
}

SSIS - Various number of columns to output to flat file

I am currently creating an SSIS that will gather data from database and output it to a single Comma delimited Flat File. The file will contain order details Format of file is
Order#1 details (51 columns)
Order#1 header (62 columns)
Order#2 details (51 columns)
Order#2 header (62 columns)
etc...
Order header has 62 columns, order details has 51 columns. I need to output this to a flat file and I am running into an issue because SSIS does not handle varying columns. Can someone please help me and given that my source is an OLEDB source with the query, how do I create a script component to output to a file.
Current Package looks like the following:
Get a list of all order. Pass orderid as a variable.
For loop container goes through each orderid, runs a data task flow to get the order details for the order. Run a data task to get order header.
I am just running into an issue to output each line to Flat file.
IF anyone can help that will be immensely appreciated. I have been struggling with this for a week now.If anyone can start me off with what the script component code should look like that would be immensely appreciated.
I have added what I have so far:
http://imgur.com/a/yTxfH
This is what my script looks like:
public void Main()
{
// TODO: Add your code here
DataTable RecordType300 = new DataTable();
DataTable RecordType210 = new DataTable();
DataTable RecordType220 = new DataTable();
DataTable RecordType200 = new DataTable();
OleDbDataAdapter adapter = new OleDbDataAdapter();
adapter.Fill(RecordType300, Dts.Variables["User:rec_type300"].Value);
adapter.Fill(RecordType210, Dts.Variables["User::rec_type_210"].Value);
adapter.Fill(RecordType220, Dts.Variables["User::rec_type_220"].Value);
adapter.Fill(RecordType200, Dts.Variables["User::rec_type200"].Value);
using (StreamWriter outfile = new StreamWriter("C:\\myoutput.csv"))
{
for (var i = 0; i < RecordType300.Rows.Count; i++)
{
var detailFields = RecordType300.Rows[i].ItemArray.Select(field => field.ToString()).ToArray();
// var poBillFields = RecordType210.Rows[i].ItemArray.Select(field => field.ToString()).ToArray();
// var poShipFields = RecordType220.Rows[i].ItemArray.Select(field => field.ToString()).ToArray();
// var poHeaderFields = RecordType200.Rows[i].ItemArray.Select(field => field.ToString()).ToArray();
outfile.WriteLine(String.Join(",", detailFields));
// outfile.WriteLine(string.Join(",", poBillFields));
// outfile.WriteLine(string.Join(",", poShipFields));
// outfile.WriteLine(string.Join(",", poHeaderFields));
}
}
Dts.TaskResult = (int)ScriptResults.Success;
}
But every time I run it, it errors out. Am I missing something here? Also, how would I create a file in the beginning only 1 time. Meaning every time this package is run it will create a file with the datestamp and append to it each time. The next time the package runs, it will create a new file with new date stamp and append each order details based on the order number.
This code/method has not been tested but should give you a good idea of what to do.
Create 2 SSIS variables of type object, one for the headers and one for the detail.
Create 2 Execute SQL tasks and 1 Script Task as outlined here:
Setup your Tasks to handle a full result set, similar to these pics (the Detail version is shown, do similar for Header but map results to the Header object and change your query to point at the header table):
Edit your script task and allow Detail and Header as read only vars:
Edit your actual script now along these lines (this is assuming you have exactly 1 detail row for 1 header row):
using System.IO;
using System.Linq;
using System.Data.OleDb;
// following to be inserted into Main() function
DataTable detailData = new DataTable();
DataTable headerData = new DataTable();
OleDbDataAdapter adapter = new OleDbDataAdapter();
adapter.Fill(detailData, Dts.Variables["User::Detail"].Value);
adapter.Fill(headerData, Dts.Variables["User::Header"].Value);
using (StreamWriter outfile = new StreamWriter("myoutput.csv"))
{
// we are making the assumption that
for (var i = 0; i < detailData.Rows.Count; i++)
{
var detailFields = detailData.Rows[i].ItemArray.Select(field => field.ToString()).ToArray();
var headerFields = headerData.Rows[i].ItemArray.Select(field => field.ToString()).ToArray();
outfile.WriteLine(string.Join(",", detailFields));
outfile.WriteLine(string.Join(",", headerFields));
}
}
Not a complete answer, just something to put you on the track of an alternative approach
SELECT Type, OrderBy, Col
FROM
(
SELECT 'D' As Type, Ord as OrderBy,
Col1 + ',' + CAST(Col2 AS VARCHAR(50)) + ',' + Col3 As Col
FROM Details
UNION ALL
SELECT 'H' As Type, Ord as OrderBy,
Col1 + ',' + CAST(Col2 AS VARCHAR(50)) + ',' + Col3 As Col + ',' + Col4
FROM Header
) S
ORDER BY OrderBy, Type
Its ugly but it works as long as you cast all datatypes to varchar
You can wrap this up in a view or a stored procedure and test it from the database (before you get to the SSIS part). You can even export this using BCP.EXE rather than SSIS
What you have here is one column which happens to contain this kind of data:
A,B,C
D,E,F,G
From a metadata perspective there is consistently one column
From a CSV perspective there are variable columns

SSIS - import file with columns populated like sub-headers (omitted on some rows)

I am importing an Excel file which is formatted like a report - that is some columns are only populated once for each group of rows that it belongs to, such as:
CaseID |Date |Code
157207 | |
|8/1/2012 |64479
|8/1/2012 |Q9967
|8/1/2012 |99203
I need to capture one of these group headers (CaseID, in the example above) and use it for subsequent rows where the field is blank, then save the next value that I encounter.
I have added a variable (User::CurrentCaseId) and a Script transform, with the following code:
public class ScriptMain : UserComponent
{
string newCaseId;
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.CaseIDName_IsNull && Row.CaseIDName.Length > 0)
newCaseId = Row.CaseIDName;
else
newCaseId = "DetailRow";
}
public override void PostExecute()
{
base.PostExecute();
if (newClaimNumber != "DetailRow")
Variables.CurrentCaseId = newCaseId;
}
Basically, I am trying to read the value when present and save it in this variable. I use a conditional split to ditch the rows that only have the CaseID and then use a derived column to put the variable value into a new column to complete the detail row.
Alas, the value is always blank (placed a data viewer after the derived column). I modified the script to always set the variable to a fixed string - the derived column is still blank.
This seemed like a good plan... I received some feedback in the MS forums that you can't set a variable value and use its new value within the same Data Flow Task. If that is so, the only solution I can think of is to write the CaseID out to a table when present and read it back in when absent. I really hate to do that with several million rows (multiple Excel worksheets). Any better ideas?
Best,
Scott
This can be a good starting point for you.
I used the following file as the source. Saved it into C:\Temp\5.TXT
CaseID |Date |Code
157207 | |
|8/1/2012 |64479
|8/1/2012 |Q9967
|8/1/2012 |99203
157208 | |
|9/1/2012 |77779
|9/2/2012 |R9967
|9/3/2012 |11203
Put a DFT on the Control Flow surface.
Put Script Component as Source on the DFT
3.1. Go to Inputs and Outputs section
3.2. Add Output. Change it name to MyOutput.
3.2.1 Add the following output columns - CaseID, Date, Code
3.2.1 The data types are four-byte unsigned integer [DT_UI4], string [DT_STR], string [DT_STR]
Now go to Scripts // Edit Script. Put the following code. Make sure to add
using System.IO;
to the namespace area.
public override void CreateNewOutputRows()
{
string[] lines = File.ReadAllLines(#"C:\temp\5.txt");
int iRowCount = 0;
string[] fields = null;
int iCaseID = 0;
string sDate = string.Empty;
string sCode = string.Empty;
foreach (string line in lines)
{
if (iRowCount == 0)
{
iRowCount++;
}
else
{
fields = line.Split('|');
//trim the field values
for (int i = 0; i < fields.Length; i++)
{
fields[i] = fields[i].Trim();
}
if (!fields[0].Equals(string.Empty))
{
iCaseID = Convert.ToInt32(fields[0]);
}
else
{
MyOutputBuffer.AddRow();
MyOutputBuffer.CaseID = iCaseID;
MyOutputBuffer.Date = fields[1];
MyOutputBuffer.Code = fields[2];
}
}
}
}
}
Testing your code: Add a Union All components right beneath where you put the Script component. Connect the output of the Script component to the Union All component. Put data viewer.
Hopefully this should help you. Please let us know. I responded to a similar question today; please check that one out as well. That may help in solidfying the concept - IMHO.

SSIS column count from a flat file

I'm trying to find a way to count my columns coming from a Flat File. Actually, all my columns are concatened in a signe cell, sepatared with a '|' ,
after various attempts, it seems that only a script task can handle this.
Does anyone can help me upon that ? I've shamely no experience with script in C# ou VB.
Thanks a lot
Emmanuel
To better understand, below is the output of what I want to achieve to. e.g a single cell containing all headers coming from a FF. The thing is, to get to this result, I appended manually in the previous step ( derived column) all column names each others in order to concatenate them with a '|' separator.
Now , if my FF source layout changes, it won't work anymore, because of this manualy process. So I think I would have to use a script instead which basically returns my number of columns (header ) in a variable and will allow to remove the hard coded part in the derived column transfo for instance
This is an very old thread; however, I just stumbled on a similar problem. A flat file with a number of different record "formats" inside. Many different formats, not in any particular order, meaning you might have 57 fields in one line, then 59 in the next 1000, then 56 in the next 10000, back to 57... well, think you got the idea.
For lack of better ideas, I decided to break that file based on the number of commas in each line, and then import the different record types (now bunched together) using SSIS packages for each type.
So the answer for this question is there, with a bit more code to produce the files.
Hope this helps somebody with the same problem.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace OddFlatFile_Transformation
{
class RedistributeLines
{
/*
* This routine opens a text file and reads it line by line
* for each line the number of "," (commas) is counted
* and then the line is written into a another text file
* based on that number of commas found
* For example if there are 15 commas in a given line
* the line is written to the WhateverFileName_15.Ext
* WhaeverFileName and Ext are the same file name and
* extension from the original file that is being read
* The application tests WhateverFileName_NN.Ext for existance
* and creates the file in case it does not exist yet
* To Better control splited records a sequential identifier,
* based on the number of lines read, is added to the beginning
* of each line written independently of the file and record number
*/
static void Main(string[] args)
{
// get full qualified file name from console
String strFileToRead;
strFileToRead = Console.ReadLine();
// create reader & open file
StreamReader srTextFileReader = new StreamReader(strFileToRead);
string strLineRead = "";
string strFileToWrite = "";
string strLineIdentifier = "";
string strLineToWrite = "";
int intCountLines = 0;
int intCountCommas = 0;
int intDotPosition = 0;
const string strZeroPadding = "00000000";
// Processing begins
Console.WriteLine("Processing begins: " + DateTime.Now);
/* Main Loop */
while (strLineRead != null)
{
// read a line of text count commas and create Linde Identifier
strLineRead = srTextFileReader.ReadLine();
if (strLineRead != null)
{
intCountLines += 1;
strLineIdentifier = strZeroPadding.Substring(0, strZeroPadding.Length - intCountLines.ToString().Length) + intCountLines;
intCountCommas = 0;
foreach (char chrEachPosition in strLineRead)
{
if (chrEachPosition == ',') intCountCommas++;
}
// Based on the number of commas determined above
// the name of the file to be writen to is established
intDotPosition = strFileToRead.IndexOf(".");
strFileToWrite = strFileToRead.Substring (0,intDotPosition) + "_";
if ( intCountCommas < 10)
{
strFileToWrite += "0" + intCountCommas;
}
else
{
strFileToWrite += intCountCommas;
}
strFileToWrite += strFileToRead.Substring(intDotPosition, (strFileToRead.Length - intDotPosition));
// Using the file name established above the line captured
// during the text read phase is written to that file
StreamWriter swTextFileWriter = new StreamWriter(strFileToWrite, true);
strLineToWrite = "[" + strLineIdentifier + "] " + strLineRead;
swTextFileWriter.WriteLine (strLineToWrite);
swTextFileWriter.Close();
Console.WriteLine(strLineIdentifier);
}
}
// close the stream
srTextFileReader.Close();
Console.WriteLine(DateTime.Now);
Console.ReadLine();
}
}
}
Please refer my answers in the following Stack Overflow questions. Those answers might give you an idea of how to load a flat file that contains varying number of columns.
Example in the following question reads a file containing data separated by special character Ç (c-cedilla). In your case, the delimiter is Vertical Bar (|)
UTF-8 flat file import to SQL Server 2008 not recognizing {LF} row delimiter
Example in the following question reads an EDI file that contains different sections with varying number of columns. The package reads the file loads it accordingly with parent-child relationships into an SQL table.
how to load a flat file with header and detail parent child relationship into SQL server
Based on the logic used in those answers, you can also count the number of columns by splitting the rows in the file by the column delimiter (Vertical Bar |).
Hope that helps.

Resources