I am creating an SSIS package to import CSV file data into a SQL Server table.
Some of the rows in the CSV files will have missing values.
For example, if a row has the format: value1,value2,value3 and value2 is missing,
then it will render as: value1,,value3 in the csv file.
When the above happens (value2 is missing) in my SSIS package, I want NULL to go into the receiving SQL Server column that would hold value2.
I understand that I can add a "Script" task to my SSIS package to apply this rule. However, I'm concerned that this will drastically reduce the performance of my SSIS package. I'm not an expert on the inner workings of SSIS/SQL Server, but I'm concerned that this script will cause my script to lose "BULK INSERT" capabilities (and other efficiencies) since the script will have to inspect every row and apply the changes as needed.
Can anyone confirm if adding such a script will cause major performance impacts? Or does the SSIS/SQL-Server engine run the script on every row and then bulk-insert? Is there another way I can apply this rule without taking a performance hit?
Firstly, you can use script task when required. Script task will be executed only once for each execution of the whole package not for every row. For every row there is another component called script component. When the other regular SSIS tasks are not enough to achieve what you want you can surely use script component. I don't believe it is a performance killer unless you implement it badly.
Secondly, this particular requirement you can simply use Flat File Source task to import your csv file. It will put the value NULL when there is no value. I'm considering this is a valid csv value and each row has correct number of comma for every fields (total field - 1 actually) even if value is empty or null for some fields.
Related
I will be importing records from thousands of CSV files using SSIS. These CSV files will contain a Postal Code column, which has the format A5A 5A5, where "A" is any letter and "5" is any number from 0 to 9.
There is a space between the "A5A" and "5A5" that I want to remove, so that all Postal Codes appear as "A5A5A5".
I am reviewing the documentation and see several options, and I'm trying to narrow down the best one, i.e. the one that requires the least number of steps. So far I am looking at the Derived Column transformation, but that would involve adding another column to my SQL Table.
Is there a way I can trim the space without having to add an extra column?
As #Larnu answers via comments, a Derived Column is likely the most appropriate component to use here.
The expression you're looking for is a REPLACE. Syntax ought to be
REPLACE([PostalCode], " ", "")
You have 10 columns from your CSV. The Derived Column can either replace and existing or add a new column to row buffer. I would advocate adding a new column. PostalCodeStripped or something like that. At some point, something weird is going to happen with the data and you'll get an A5A 5A5 that didn't get the space stripped. Having both the original and the parsed value available in debugging can help sort out problems (Oh, this has a non-breaking space or a tab instead of a space, or in addition to)
But, just because a column is in the buffer does not mean you need to create a column for that in the destination table. Just unmap the PostalCode from the row buffer and map PostalCodeStripped to the PostalCode column in the database. You'll see what I'm talking about in the destination component. By default, they'll map based on name matching but you're welcome to wire them up however you see fit.
ETL is an alternate option. Bulk load the data into a staging table. Then do a simple select into the destination to do the transformation. I might be tempted to not use SSIS. BCP or Import-DbaCsv (DBATools powershell module) would both be a quick alternates. If you know PowerShell and want to process the files in a pipe, you can pipe the files into Import-DbaCsv. The PowerShell script can also execute Invoke-DbaQuery to run update or insert queries to do the transformation.
SSIS can also just do the bulk load and then run the T-SQL to do the transformations. I don't like the overhead of maintaining and upgrading SSIS packages. I'd take T-SQL jobs over SSIS jobs any day. (We have about 1/2 year for a FTE to upgrade our SSIS packages to SQL 2019. The T-SQL jobs just keep working when moved to a new version.)
Or go the ETL route and do the transformation in the SSIS data flow. A Derived Column transformation between a flat file source and a OLE DB destination should do the trick.
To handle multiple files, you can use the Foreach Loop Container. There's an enumerator for files using a wildcard path. (The initial T-SQL task just truncates the table for testing.)
You'll need to parameterize the thing to get the file source to be each file.
For PowerShell it might be something like (no transformation yet) the script below.
Get-ChildItem 'C:\TestFolder\*.csv' |
import-dbacsv -SqlInstance 'localhost\DEV' -Database 'Test' -Schema 'dbo' -Table 'Test' -AutoCreateTable -verbose
If you run this in the ISE, be aware of a bug where the connection might not be released after calling import-dbacsv that will cause it to hang. This is not an issue in the command line from what I can tell. (If this happens to you, you might have to kill the ISE process - closing it is not enough.)
I have the below within the Data-flow area. The problem I'm experiencing is that even if the result is 0, it is still creating the file.
Can anyone see what I'm doing wrong here?
This is pretty much expected and known annoying behavior.
SSIS will create an empty flat file, even if unchecked: "column names in a first data row".
The workarounds are:
remove such file by a file system task if #RowCountWriteOff = 0 just after the execution of a dataflow.
as alternative, do not start a dataflow if expected number of rows in the source is 0:
Update 2019-02-11:
Issue I have is that I have 13 of these export to csv commands in the
data flow and they are costly queries
Then double querying a source to check a row-count ahead will be even more expensive and perhaps better to reuse a value of variable #RowCountWriteOff.
Initial design has 13 dataflows, adding 13 constraints and 13 filesystem tasks the main control flow will make package more complex and harder to maintain
Therefore, suggestion is to use a OnPostExecute event handler, so cleanup logic is isolated to some certain dataflow:
Update 1 - Adding more details based on OP comments
Based on your comment i will assume that you want to loop over many tables using SQL Commands, check if table contains row, if so then you should export rows to flat files, else you should ignore the tables. I will mention the steps that you need to achieve that and provide links that contains more details for each step.
First you should create a Foreach Loop container to loop over tables
You should add an Execute SQL Task with a count command SELECT COunt(*) FROM ....) and store the Resultset inside a variable
Add a Data Flow Task that import data from OLEDB Source to Flat File Destination.
After that you should add a precedence constraint with expression, to the Data Flow Task, with expression similar to #[User::RowCount] > 0
Also, it is good to check the links i provided because they contains a lot of useful informations and step by step guides.
Initial Answer
Preventing SSIS from creating empty flat files is a common issue that you can find a lot of references online, there are many workarounds suggested and many methods that may solves the issue:
Try to set the Data Flow Task Delay Validation property to True
Create another Data Flow Task within the package, which will be used only to count rows in the Source, if it is bigger than 0 then the precedence constraint should led to the other Data Flow Task
Add a File System Task after the Data Flow Task which delete the output file if RowCount is o, you should set the precedence constraint expression to ensure that.
References and helpful links
How to prevent SSIS package creating empty flat file at the destination
Prevent SSIS from creating an empty flat file
Eliminating Empty Output Files in SSIS
Prevent SSIS for creating an empty csv file at destination
Check for number of rows returned and do not create empty destination file
Set the Data Flow Task Delay Validation property to True
I am trying to use a SSIS package to insert data from a file into a table but only if all the data in the file is good. I have read around and realise that I can split my good data and bad data with a conditional split.
However I cannot come up with a way to not write the good data if there is some bad data rows.
I can solve my problem use a staging table. I just thought I would ask if I am missing a more elegant way to do this within SSIS package rather than load then transform with TSQL.
Thanks
SSIS way allows wrapping actions in a Transaction. According to your task, you need to count bad rows in the dataflow, and if there is at least one bad row - do nothing i.e. rollback.
Below is how I would do it in Pure SSIS. Create a sequence and specify TransactionOption=Required on it, move your dataflow to the sequence. Add Count Rows transformation to your bad rows dataflow and store its result to some variable. After DataFlow inside sequence - create conditional task link where you check whether bad_rowcount variable > 0, and on the next - do little script task which raise an error to roll back transaction.
Pure SSIS - yes! Simpler than using staging table - not sure.
Part of an SSIS package is the data import from an external database via a SQL command embedded into an ADO.NET Source Data Flow Source. Whenever I make even the slightest adjustment to the query (such as changing a column name) it takes ages (in that case 1-2 hours) until the program has finished validation. The query itself returns around 30,000 rows with 20 columns each.
Is there any way to cut these long intervals or is this something I have to live with?
I usually store the source queries in a table and the first part of my package would execute a select and store the query returned from the table in a package variable, which would then be used by the ADO.NET Source Data Flow. So In my package for the default value of the variable I usually have the query that is stored in the database along with a "where 1=2" at the end. Hence during design time it does execute the query but just returns the column metadata. Let me know if you have any questions.
I am working on a generic SSIS package that receives a flat file, add new columns to it, and generate a new flat file.
The problem I have is that the number of new columns varies based on a stored procedure XML parameter. I tried to use the "Execute Process Task" to call BCP, but the XML parameter is too long for the command line.
I search on the web and found that you cannot dynamically change the SSIS package during runtime and that I would have to use a script task to generate the output. I started going trough that path and found that you still have to let the script component know how may columns will be receiving and that is exactly what I do not know at design time.
I found a third party SSIS extension from CozyRoc, but I want to do it without any extensions.
Has anyone done something like this?
Thanks!
If the number of columns is unknown at run time then you will have to do something dynamically, and that means using a script task and/or a script component.
The workflow could be:
Parse the XML to get the number of rows
Save the number of rows in a package variable
Add columns to the flat file based on the variable
This is all possible using script tasks, although if there is no data flow involved, it might be easier to do the whole thing in an external Perl script or C# program and just call that from your package.