SSIS column count from a flat file - file

I'm trying to find a way to count my columns coming from a Flat File. Actually, all my columns are concatened in a signe cell, sepatared with a '|' ,
after various attempts, it seems that only a script task can handle this.
Does anyone can help me upon that ? I've shamely no experience with script in C# ou VB.
Thanks a lot
Emmanuel
To better understand, below is the output of what I want to achieve to. e.g a single cell containing all headers coming from a FF. The thing is, to get to this result, I appended manually in the previous step ( derived column) all column names each others in order to concatenate them with a '|' separator.
Now , if my FF source layout changes, it won't work anymore, because of this manualy process. So I think I would have to use a script instead which basically returns my number of columns (header ) in a variable and will allow to remove the hard coded part in the derived column transfo for instance

This is an very old thread; however, I just stumbled on a similar problem. A flat file with a number of different record "formats" inside. Many different formats, not in any particular order, meaning you might have 57 fields in one line, then 59 in the next 1000, then 56 in the next 10000, back to 57... well, think you got the idea.
For lack of better ideas, I decided to break that file based on the number of commas in each line, and then import the different record types (now bunched together) using SSIS packages for each type.
So the answer for this question is there, with a bit more code to produce the files.
Hope this helps somebody with the same problem.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace OddFlatFile_Transformation
{
class RedistributeLines
{
/*
* This routine opens a text file and reads it line by line
* for each line the number of "," (commas) is counted
* and then the line is written into a another text file
* based on that number of commas found
* For example if there are 15 commas in a given line
* the line is written to the WhateverFileName_15.Ext
* WhaeverFileName and Ext are the same file name and
* extension from the original file that is being read
* The application tests WhateverFileName_NN.Ext for existance
* and creates the file in case it does not exist yet
* To Better control splited records a sequential identifier,
* based on the number of lines read, is added to the beginning
* of each line written independently of the file and record number
*/
static void Main(string[] args)
{
// get full qualified file name from console
String strFileToRead;
strFileToRead = Console.ReadLine();
// create reader & open file
StreamReader srTextFileReader = new StreamReader(strFileToRead);
string strLineRead = "";
string strFileToWrite = "";
string strLineIdentifier = "";
string strLineToWrite = "";
int intCountLines = 0;
int intCountCommas = 0;
int intDotPosition = 0;
const string strZeroPadding = "00000000";
// Processing begins
Console.WriteLine("Processing begins: " + DateTime.Now);
/* Main Loop */
while (strLineRead != null)
{
// read a line of text count commas and create Linde Identifier
strLineRead = srTextFileReader.ReadLine();
if (strLineRead != null)
{
intCountLines += 1;
strLineIdentifier = strZeroPadding.Substring(0, strZeroPadding.Length - intCountLines.ToString().Length) + intCountLines;
intCountCommas = 0;
foreach (char chrEachPosition in strLineRead)
{
if (chrEachPosition == ',') intCountCommas++;
}
// Based on the number of commas determined above
// the name of the file to be writen to is established
intDotPosition = strFileToRead.IndexOf(".");
strFileToWrite = strFileToRead.Substring (0,intDotPosition) + "_";
if ( intCountCommas < 10)
{
strFileToWrite += "0" + intCountCommas;
}
else
{
strFileToWrite += intCountCommas;
}
strFileToWrite += strFileToRead.Substring(intDotPosition, (strFileToRead.Length - intDotPosition));
// Using the file name established above the line captured
// during the text read phase is written to that file
StreamWriter swTextFileWriter = new StreamWriter(strFileToWrite, true);
strLineToWrite = "[" + strLineIdentifier + "] " + strLineRead;
swTextFileWriter.WriteLine (strLineToWrite);
swTextFileWriter.Close();
Console.WriteLine(strLineIdentifier);
}
}
// close the stream
srTextFileReader.Close();
Console.WriteLine(DateTime.Now);
Console.ReadLine();
}
}
}

Please refer my answers in the following Stack Overflow questions. Those answers might give you an idea of how to load a flat file that contains varying number of columns.
Example in the following question reads a file containing data separated by special character Ç (c-cedilla). In your case, the delimiter is Vertical Bar (|)
UTF-8 flat file import to SQL Server 2008 not recognizing {LF} row delimiter
Example in the following question reads an EDI file that contains different sections with varying number of columns. The package reads the file loads it accordingly with parent-child relationships into an SQL table.
how to load a flat file with header and detail parent child relationship into SQL server
Based on the logic used in those answers, you can also count the number of columns by splitting the rows in the file by the column delimiter (Vertical Bar |).
Hope that helps.

Related

SSIS Load CSV with different number of columns on every load

We are working on a SSIS job to load the CSV file to a SQL table. This job is to be scheduled for daily load. The problem is this CSV file comes with different columns each day. Structure of the file is as below:
<table border="1">
<tr><td>Date</td><td>New York</td><td>Washington</td><td>London</td></tr>
<tr><td>15-04-2020</td><td>2</td><td>3</td><td>20</td></tr>
<tr><td>16-04-2020</td><td>30</td><td>50</td><td>22</td></tr>
</table>
Date column remains the same where as number of columns for city changes based on the data for that day. It could 1 city column or many more city columns. Each city column means the number of likes from the that city on that day.
I am thinking to convert the structure to a 3 column structure including Date, City Name and Number of Likes.
But how would a flat file source component would handle it and how would I transform it to a new structure?
I will walk you through a script component to handle this:
I am assuming your csv looks like this and not the html above:
Date,New York,Washington,London
15-04-2020,2,3,20
16-04-2020,30,50,22
I named this file likes.txt and saved it on my D:\
Add a dataflow
Add a Script Component (Source)
Go to inputs and outputs and add your outputs (don't forget data types)
Go into the script and paste the following Code into CreateNewOutputRows:
string[] lines = File.ReadAllLines(#"d:\likes.txt");
List<string> cities = new List<string>();
int ctr = 0;
foreach (string line in lines)
{
ctr++;
//skip empty rows
if(string.IsNullOrWhiteSpace(line)) continue;
//Get Cities from Header
if (ctr == 1)//Header row
{
string[] headers = line.Split(',');
for (int i = 1; i < headers.Length; i++)
{
cities.Add(headers[i]);
}
continue; //Go to next line
}
//Work with details
string[] pieces = line.Split(',');
for (int i = 1; i < pieces.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.City = cities[i-1];
Output0Buffer.Date = DateTime.ParseExact(pieces[0].ToString(), "dd-MM-yyyy", CultureInfo.InvariantCulture);
Output0Buffer.Likes = int.Parse(pieces[i]);
}
}
Add the following namespaces to make the code work:
using System.IO;
using System.Collections.Generic;
using System.Globalization;
Here are your results:
There's quite a bit to unpack in that script as it uses Lists, Arrays, File System Tasks, etc. Let me know if you have questions.
PS - This is a Corona Virus Answer (meaning I am bored) without any effort on your end. Please at least show what you tried in the future.

Extra columns in the middle of existing columns pipe delimited text file (How to ignore that column values in flat file source ) SSIS

I have used SSIS package flat file source to read pipe delimited | text file and used column delimiter as | and text qualifier as none. I need to handle that if extra columns in the source file need to skip that column values.
If new columns are in the source file the data get loaded into wrong columns. How to skip the values of that rows?
EDIT: I revised the answer to remove rows with more than the expected number of columns:
Parse it with an SSIS with a script component source. In the script component:
-Select the flat file connection manager under connection managers, below I left it named "Connection"
-Add columns to the output buffer and configure their data types. In my example, they are named "First" and "Second"
-In the script, add a reference to System.IO and do something like the following which looks for an expected number of columns and only adds a row to the buffer if it meets that criteria:
using System.IO;
...
public override void CreateNewOutputRows()
{
var expectedNumOfColumns = 2;
using (StreamReader sr = new StreamReader(Connections.Connection.ConnectionString))
{
string line;
while((line = sr.ReadLine()) != null)
{
var parsedLine = line.Split(',');
if (parsedLine.Length == expectedNumOfColumns)
{
Output0Buffer.AddRow();
Output0Buffer.First = parsedLine[0];
Output0Buffer.Second = parsedLine[1];
}
}
}
}

SSIS Flat File Missing First Column

I'm trying to import flat file using SSIS. This is the structure of the flat file:
HEADER001
1|2|3|4
5|6|7|8
I want to skip the header, already set the Header rows to skip to 1 and unchecked the Column names in the first data row.
Somehow the first column will disappear. I've tried to change the row delimiter to {CR} or {LF} but nothing different.
Anybody know how to fix this?
Thanks
Change the Header Row delimiter to {CR}{LF} instead of Vertical Bar. When you say the header row delimiter should be a vertical bar (and it should be skipped) it removes everything before the first |
script component source:
var lines = System.IO.File.ReadAllLines([filePath]);
int ctr=0;
foreach(var line in lines)
{
ctr++;
if(ctr!=1)
{
string[] cols = line.Split('|');
Output0Buffer.AddRow();
OutputBuffer0.Col1 = (int)cols[0];
OutputBuffer0.Col2 = (int)cols[1];
OutputBuffer0.Col3 = (int)cols[2];
OutputBuffer0.Col4 = (int)cols[3];
}
}

how to skip a bad row in ssis flat file source

I am reading in a 17-column CSV file into a database.
once in a while the file has a "less then 17-column" row.
I am trying to ignore the row, but even when all columns are set to ignore, I can't ignore that row and the package fails.
How to ignore those rows?
Solution Overview
you can do this by adding one Flat File Connection Manager add only one column with Data type DT_WSTR and a length of 4000 (assuming it's name is Column0) - So all column are considered as one big column
In the Dataflow task add a Script Component after the Flat File Source
In mark Column0 as Input Column and Add 17 Output Columns
In the Input0_ProcessInputRow method split Column0 by delimiter, Then check if the length of array is = 17 then assign values to output columns, Else ignore the row.
Detailed Solution
Add a Flat file connection manager, Select the text file
Go to the Advanced Tab, Delete all Columns except one Column
Change the datatype of the remianing Column to DT_WSTR and length = 4000
Add a DataFlow Task
Inside the Data Flow Task add a Flat File Source, Script Component and OLEDB Destination
In the Script Component Select Column0 as Input Column
Add 17 Output Columns (the optimal output columns)
Change the OutputBuffer SynchronousInput property to None
Select the Script Language to Visual Basic
In the Script Editor write the following Script
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
If Not Row.Column0_IsNull AndAlso
Not String.IsNullOrEmpty(Row.Column0.Trim) Then
Dim strColumns As String() = Row.Column0.Split(CChar(";"))
If strColumns.Length <> 17 Then Exit Sub
Output0Buffer.AddRow()
Output0Buffer.Column = strColumns(0)
Output0Buffer.Column1 = strColumns(1)
Output0Buffer.Column2 = strColumns(2)
Output0Buffer.Column3 = strColumns(3)
Output0Buffer.Column4 = strColumns(4)
Output0Buffer.Column5 = strColumns(5)
Output0Buffer.Column6 = strColumns(6)
Output0Buffer.Column7 = strColumns(7)
Output0Buffer.Column8 = strColumns(8)
Output0Buffer.Column9 = strColumns(9)
Output0Buffer.Column10 = strColumns(10)
Output0Buffer.Column11 = strColumns(11)
Output0Buffer.Column12 = strColumns(12)
Output0Buffer.Column13 = strColumns(13)
Output0Buffer.Column14 = strColumns(14)
Output0Buffer.Column15 = strColumns(15)
Output0Buffer.Column16 = strColumns(16)
End If
End Sub
Map the Output Columns to the Destination Columns
C# Solution for Loading CSV and skip rows that don't have 17 columns:
Use a Script Component:
On input/output screen add all of your outputs with data types.
string fName = #"C:\test.csv" // Full file path: it should reference via variable
string[] lines = System.IO.File.ReadAllLines(fName);
//add a counter
int ctr = 1;
foreach(string line in lines)
{
string[] cols = line.Split(',');
if(ctr!=1) //Assumes Header row. elim if 1st row has data
{
if(cols.Length == 17)
{
//Write out to Output
Output0Buffer.AddRow();
Output0Buffer.Col1 = cols[0].ToString(); //You need to cast to data type
Output0Buffer.Col2 = int.Parse(cols[1]) // example to cast to int
Output0Buffer.Col3 = DateTime.Parse(cols[2]) // example of datetime
... //rest of Columns
}
//optional else to handle skipped lines
//else
// write out line somewhere
}
ctr++; //increment counter
}
This is for #SidC comment in my other answer.
This lets you work with multiple files:
//set up variables
string line;
int ctr = 0;
string[] files = System.IO.Directory.GetFiles(#"c:/path", "filenames*.txt");
foreach(string file in files)
{
var str = new System.IO.StreamReader(file);
while((line = str.ReadLine()) != null)
{
// Work with line here similar to the other answer
}
}

Unzip from a SQL Server text column to an image column

I have images of various formats (.png, .jpg, .bmp, etc.) stored as compressed text in a text column in a SQL Server 2005 table. I need to read the row, unzip the image and store it in an image column in another table.
I am using the SharpZip library, and all of the examples deal with file sources and destinations. I can't find anything that covers unzipping from a variable to another variable. A code snippet illustrating this or a link to a relevant resource would be much appreciated.
EDIT: A bit more information - the data is stored in a TEXT column. It appears as follows (text column abbreviated for display):
ImageID ImageData
1 FORMAT-ZIPV3 UEsDBBQAAAAIAOV6wzxdTnDvshs...
2 FORMAT-ZIPV3 UEsDBBQAAAAIAAF2yjxGncjOLgA...
3 FORMAT-ZIPV3 UEsDBBQAAAAIAKd6yjyjnQNr6gg...
4 FORMAT-ZIPV3 UEsDBBQAAAAIALdNyzyrPC8EMJw...
5 FORMAT-ZIPV3 UEsDBBQAAAAIAA1rOD1nZY1t0f0...
6 FORMAT-ZIPV3 UEsDBBQAAAAIANZplj2seyJ+VmM...
7 FORMAT-ZIPV3 UEsDBBQAAAAIAC5vhD27LPbPcv8...
8 FORMAT-ZIPV3 UEsDBBQAAAAIAK1qKz5DJNH3xMg...
9 FORMAT-ZIPV3 UEsDBBQAAAAIAHVkEztC3th/9hs...
10 FORMAT-ZIPV3 UEsDBBQAAAAIAEtXKz7DXHUdvow...
What I know for certain is that the images were compressed at some point in the process using SharpZip before being inserted into the table. It appears that the format information was added to the beginning of the data prior to inserting.
Looking at this data, would anyone have any insight on how this image data has been manipulated? Again, I need to get the uncompressed image data into a column of a data type conducive to reading for display on a web page.
EDIT: Ok, I'm stumped. Executing the following code produces the error, "Failed to convert parameter value from a Int32 to a Byte[]". It appears to be placing the length of the byte array into the byte array's value...
commandUncompressed.Connection = connectionUncompressed;
commandUncompressed.Parameters.Add("#Image_k", SqlDbType.VarChar, 10);
commandUncompressed.Parameters.Add("#ImageContents", SqlDbType.Image);
commandUncompressed.CommandText = sqlSaveImage;
connectionUncompressed.Open();
reader = command.ExecuteReader();
if (reader.HasRows)
{
while (reader.Read())
{
Console.WriteLine(reader["Image_k"].ToString()); // Merely for testing
String format = reader["ImageContents_Compressed"].ToString().Substring(0, 12);
var offset = 13; //"FORMAT-ZIPV3 ".Length;
var s = reader["ImageContents_Compressed"].ToString().Substring(offset);
var bytes = Convert.FromBase64String(s);
if (format == "FORMAT-ZIPV2 ")
{
bytes = ConvertStringToBytes(s); // Not a Base-64 encoded string? External conversion function utilized.
}
using (var zis = new ZipInputStream(new MemoryStream(bytes)))
{
ZipEntry zipEntry = zis.GetNextEntry(); // Doesn't seem to work unless an entry has been referenced
byte[] buffer = new byte[zis.Length];
commandUncompressed.Parameters["#Image_k"].Value = reader["Image_k"].ToString();
commandUncompressed.Parameters["#ImageContents"].Value = zis.Read(buffer, 0, buffer.Length);
commandUncompressed.ExecuteNonQuery();
}
}
}
It appears to be reading the data from the source text column just fine. I just cannot figure out how to get that into the image type parameter. The value for buffer variable shows the length of the byte array, rather than the actual bytes. Maybe that's what the value property typically shows for byte arrays? I'm so close and yet so far away. :/
EDIT: Ok, I'm a knucklehead. I made the following correction, and it works!
zis.Read(buffer, 0, buffer.Length)
commandUncompressed.Parameters["#ImageContents"].Value = buffer;
At this point I am only able to process FORMAT-ZIPV3 data, as I haven't figured out how to decode the FORMAT-ZIP2 strings yet. Following is a sampling of the V2 data. If anyone is able to determine the encoding, let me know. Would it be different if zipped using BZIP instead of ZIP format?
ImageID ImageData
1 FORMAT-ZIPV2 504B03041400020008005157422A2E25FDBAF26701008D6901000E...
2 FORMAT-ZIPV2 504B03041400020008009159422A7FC94BA2B2540500D35705000E...
3 FORMAT-ZIPV2 504B0304140002000800685A422A0CAA51F4473A0600B97206000E...
4 FORMAT-ZIPV2 504B03041400020008001D5D422A770BD3ED201902002C4A02000E...
5 FORMAT-ZIPV2 504B0304140002000800325E422A4B6C2FB4045001001C6E01000E...
6 FORMAT-ZIPV2 504B03041400020008006F72422A5F793AC1A1F00200ECF302000E...
7 FORMAT-ZIPV2 504B0304140002000800D572422A1B348A731DE5000085EB00000E...
8 FORMAT-ZIPV2 504B03041400020008003D73422A8AEBB7F855640300DD1B04000E...
9 FORMAT-ZIPV2 504B03041400020008006368D528C5D0A6BA794900004A2502000E...
10 FORMAT-ZIPV2 504B03041400020008008E5B6C2A2D9E9C33D7AF05005CEC05000E...
In response to a similar question, someone on sqlmonster.com provided a nifty VarBinaryStream class. It works with a column type of varbinary(max).
If your data is stored in a varbinary(max), and is in zip format, you could use that class to instantiate a VarBinaryStream, then instantiate a ZipInputStream around that, and ba-da-boom, you're there. Just read from the ZipInputStream.
In C# it might look like this
using (var imageSrc = new VarBinarySource(connection,
"Table.Name",
"Column",
"KeyColName",
1))
{
using (var s = new VarBinaryStream(imageSrc))
{
using(var zis = new ZipInputStream(s))
{
....
}
}
}
If the images are small, then you probably wouldn't want all this streaming stuff. If the column is a binary(n) or a varbinary(n) where n is less than 8000, just use the SqlBinary type and read in all the data into memory, then instantiate a MemoryStream around that. Simpler. In VB.NET it looks something like this:
Dim bytes as Bytes()
bytes = dr.GetSqlBinary(columnNumber)
Using ms As New MemoryStream(bytes)
Using zis As New ZipInputStream(ms)
...
End Using
End Using
Finally, I'm going to question the wisdom of applying zip compression to .jpg images, and similar. The jpg format is already compressed; compressing it again before putting the data into SQL Server won't cause the data to become appreciably smaller. It only increases processing time. If possible, I'd suggest you reconsider your design for storage of compressed images.
ok, with the update you provided, containing the data format, you can draw some conclusions.
The data is an actual string. Suspecting that it is a Base64-encoded string, I did a small test and used Convert.ToBase64String() on a byte stream that contains a zip file. It looks like this: UEsDBBQAAAAIAJJyYyk3M56F+QIAA...
Aha! you have a base64-encoded (string) version of the byte data for a bonafide zip file. To decode it, strip the prefix and then use FromBase64String() to get the byte array, insert into a MemoryStream, then read it with ZipInputStream.
something like this:
var offset = "FORMAT-ZIPV3 ".Length();
var s = sqlReader["CompressedImage"].ToString().Substring(offset);
var bytes = Convert.FromBase64String(s);
using (var zis = new ZipInputStream(new MemoryStream(bytes)))
{
...
zis.Read(...);
...
}
If the data is "really long", you're going to want to stream it out of that table, rather than just read it into a big string and convert it. I don't know how large text columns can be, but supposing that it could be 500mb, you don't want a 500mb string, and you don't want to do a conversion of a 500mb string with Convert.FromBase64String(). In that case You need to use a Base64Stream, or the FromBase64Transform class in the System.Security.Cryptography namespace.
Editorial comment. It is sort of backwards to zip-compress image data. The images are probably compressed already. But to compound that backwardsness by then doing a base64 encode, thereby expanding the data... ??? That is triple backwards. That makes noooooo sense at all. I understand that's how your vendor supplied it.
Ok, with your furhter update, using this as the format:
ImageID ImageData
1 FORMAT-ZIPV2 504B03041400020008005157422A2E25FDBAF26701008D6901000E...
2 FORMAT-ZIPV2 504B03041400020008009159422A7FC94BA2B2540500D35705000E...
That data is still zipfile data, but it is encoded as simple hex digits. You need to convert that to a byte array. Here's some code to do it.
public static class ConvertEx
{
static readonly String prefix= "FORMAT-ZIPV2 ";
public static string ToHexString(byte[] b)
{
System.Text.StringBuilder sb1 = new System.Text.StringBuilder();
int i = 0;
for (i = 0; i < b.Length; i++)
{
sb1.Append(System.String.Format("{0:X2}", b[i]));
}
return sb1.ToString().ToLower();
}
public static byte[] ToByteArray(string s)
{
if (s.StartsWith(prefix))
{
System.Console.WriteLine("removing prefix");
s = s.Substring(prefix.Length);
}
s= s.Trim(); // whitespace
System.Console.WriteLine("length: {0}", s.Length);
var r= new byte[s.Length/2];
for (int i = 0; i < s.Length; i+=2)
{
r[i/2] = (byte) Convert.ToUInt32(s.Substring(i,2), 16);
}
return r;
}
}
You can use that this way:
string s = GetStringContentFromDatabase()
var decoded = ConvertEx.ToByteArray(s);
using (var ms = new MemoryStream(decoded))
{
// use DotNetZip to read the zip file
// SharpZipLib is something similar...
using (var zip = ZipFile.Read(ms))
{
// print out the list of entries in the zipfile
foreach (var e in zip)
{
System.Console.WriteLine("{0}", e.FileName);
}
}
}
The examples on the SharpZip Wiki use Stream objects - while the sample does use a File, you could easily use a MemoryStream object here and the sample would work the same.

Resources