SQL Server 2016 SSIS Conditionally Overwrite Text File - sql-server

Image the following scenario:
1. there are N number of jobs
2. the jobs write data to the same file once a day sequentially
3. task setting indicates whether the file should be overwritten or appended to
what i've tried thus far is using a conditional split in my data flow:
to test it out, the Case 1 & 2 are:
what actually happens is, conditional split tries to work out which data rows to send where, ends up sending all rows to one side and 0 rows to the other and both sides end up opening the file (i think), hence the errors:
i get that i'm misusing the conditional split here, but come one, it's 2017 outside, the must be a way to do this without resorting to Script tasks clearing the files?

Your problem - you are misusing Conditional Split; it is designed to manipulate data rows in a data flow and you are trying to manage control flow. Speaking of SSIS, it does not know in advance that you will use only one of Flat File Destinations; it tries to initialize both. By doing so SSIS tries to open the same file from two Destinations and fails with an error.
You can handle the task the SSIS way - manage control flow with tasks. In your case, the destination file should be appended or being overwritten. But the being overwritten can be viewed as being overwritten with zero lines and then being appended. Luckily for you, SSIS overwrites a file event no records are coming from a Data Flow.
So, before your dataflow which should always append data, you create another dataflow which always receives ZERO lines of data (columns in the set can be arbitrary) and Flat File Destination overwriting the file. Then use conditional execution of control flow with precedence constrains to execute this "File Cleanup DataFlow Task". You might also need to set DelayValidation=true on this "File Cleanup DataFlow Task".

Related

Does adding simple script tasks to SSIS packages drastically reduce performance?

I am creating an SSIS package to import CSV file data into a SQL Server table.
Some of the rows in the CSV files will have missing values.
For example, if a row has the format: value1,value2,value3 and value2 is missing,
then it will render as: value1,,value3 in the csv file.
When the above happens (value2 is missing) in my SSIS package, I want NULL to go into the receiving SQL Server column that would hold value2.
I understand that I can add a "Script" task to my SSIS package to apply this rule. However, I'm concerned that this will drastically reduce the performance of my SSIS package. I'm not an expert on the inner workings of SSIS/SQL Server, but I'm concerned that this script will cause my script to lose "BULK INSERT" capabilities (and other efficiencies) since the script will have to inspect every row and apply the changes as needed.
Can anyone confirm if adding such a script will cause major performance impacts? Or does the SSIS/SQL-Server engine run the script on every row and then bulk-insert? Is there another way I can apply this rule without taking a performance hit?
Firstly, you can use script task when required. Script task will be executed only once for each execution of the whole package not for every row. For every row there is another component called script component. When the other regular SSIS tasks are not enough to achieve what you want you can surely use script component. I don't believe it is a performance killer unless you implement it badly.
Secondly, this particular requirement you can simply use Flat File Source task to import your csv file. It will put the value NULL when there is no value. I'm considering this is a valid csv value and each row has correct number of comma for every fields (total field - 1 actually) even if value is empty or null for some fields.

Not Creating the File when source has 0 rows

I have the below within the Data-flow area. The problem I'm experiencing is that even if the result is 0, it is still creating the file.
Can anyone see what I'm doing wrong here?
This is pretty much expected and known annoying behavior.
SSIS will create an empty flat file, even if unchecked: "column names in a first data row".
The workarounds are:
remove such file by a file system task if #RowCountWriteOff = 0 just after the execution of a dataflow.
as alternative, do not start a dataflow if expected number of rows in the source is 0:
Update 2019-02-11:
Issue I have is that I have 13 of these export to csv commands in the
data flow and they are costly queries
Then double querying a source to check a row-count ahead will be even more expensive and perhaps better to reuse a value of variable #RowCountWriteOff.
Initial design has 13 dataflows, adding 13 constraints and 13 filesystem tasks the main control flow will make package more complex and harder to maintain
Therefore, suggestion is to use a OnPostExecute event handler, so cleanup logic is isolated to some certain dataflow:
Update 1 - Adding more details based on OP comments
Based on your comment i will assume that you want to loop over many tables using SQL Commands, check if table contains row, if so then you should export rows to flat files, else you should ignore the tables. I will mention the steps that you need to achieve that and provide links that contains more details for each step.
First you should create a Foreach Loop container to loop over tables
You should add an Execute SQL Task with a count command SELECT COunt(*) FROM ....) and store the Resultset inside a variable
Add a Data Flow Task that import data from OLEDB Source to Flat File Destination.
After that you should add a precedence constraint with expression, to the Data Flow Task, with expression similar to #[User::RowCount] > 0
Also, it is good to check the links i provided because they contains a lot of useful informations and step by step guides.
Initial Answer
Preventing SSIS from creating empty flat files is a common issue that you can find a lot of references online, there are many workarounds suggested and many methods that may solves the issue:
Try to set the Data Flow Task Delay Validation property to True
Create another Data Flow Task within the package, which will be used only to count rows in the Source, if it is bigger than 0 then the precedence constraint should led to the other Data Flow Task
Add a File System Task after the Data Flow Task which delete the output file if RowCount is o, you should set the precedence constraint expression to ensure that.
References and helpful links
How to prevent SSIS package creating empty flat file at the destination
Prevent SSIS from creating an empty flat file
Eliminating Empty Output Files in SSIS
Prevent SSIS for creating an empty csv file at destination
Check for number of rows returned and do not create empty destination file
Set the Data Flow Task Delay Validation property to True

SSIS redirect empty rows as flat file source read errors

I'm struggling to find a built-in way to redirect empty rows as flat file source read errors in SSIS (without resorting to a custom script task).
as an example, you could have a source file with an empty row in the middle of it:
DATE,CURRENCY_NAME
2017-13-04,"US Dollar"
2017-11-04,"Pound Sterling"
2017-11-04,"Aus Dollar"
and your column types defined as:
DATE: database time [DT_DBTIME]
CURRENCY_NAME: string [DT_STR]
with all that, package still runs and takes the empty row all the way to destination where it, naturally fails. I was to be able to catch it early and identify as a source read failure. Is it possible w/o a script task? A simple derived column perhaps but I would prefer if this could be configured at the Connection Manager / Flat File Source level.
The only way to not rely on a script task is to define your source flat file with only one varchar(max) column, chose a delimiter that is never used within and write all the content into a SQL Server staging table. You can then clean those empty lines and parse the rest to a relational output using SQL.
This approach is not very clean and a takes lot more effort than using a script task to dump empty lines or ones not matching a pattern. It isn't that hard to create a transformation with the script component
This being said, my advise is to document a clear interface description and distribute it to all clients using your interface. Handle all files that throw an error while reading the flat file and send a mail with the file to the responsible client with information that it doesn't follow the interface rules and needs to be fixed.
Just imagine the flat file is manually generated, even worse using something like excel, you will struggle with wrong file encoding, missing columns, non ascii characters, wrong date format etc.
You will be working on handling all exceptions caused by quality issues.
Just add a Conditional Split component, and use the following expression to split rows
[DATE] == ""
And connect the default output connector to the destination
References
Conditional Split Transformation

Piping Data from CSV File to OLEDB Destination in SSIS

I have a SSIS package in which I use a ForEach Container to loop through a folder destination and pull a single .csv file.
The Container takes the file it finds and uses the file name for the ConnectionString of a Flat File Connection Manager.
Within the Container, I have a Data Flow Task to move row data from the .csv file (using the Flat File Connection Manager) into an OLEDB destination (this has another OLEDB Connection Manager it uses).
When I try to execute this container, it can grab the file name, load it into the Flat File Connection Manager, and begin to transfer row data; however, it continually errors out before moving any data - namely over two issues:
Error: 0xC02020A1 at Move Settlement File Data Into Temp Table, SettlementData_YYYYMM [1143]: Data conversion failed. The data conversion for column ""MONTHS_REMAIN"" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".
Error: 0xC02020A1 at Move Settlement File Data Into Temp Table, Flat File Source [665]: Data conversion failed. The data conversion for column ""CUST_NAME"" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".
In my research so far, I know that you can set what conditions to force an error-out failure and choose to ignore failures from Truncation in the Connection Manager; however, because the Flat File Connection Manager's ConnectionString is re-made each time the Container executes, it does not seem to hold on to those option settings. It also, in my experience, should be picking the largest value from the dataset when the Connection Manager chooses the OutputColumnWidth for each column, so I don't quite understand how it is truncating names there (the DB is set up as VARCHAR(255) so there's plenty of room there).
As for the failed data conversions, I also do not understand how that can happen when the column referenced is using simple Int values, and both the Connection Manager AND the receiving DB are using floats, which should encompass the Int data (am I unaware that you cannot convert Int into Float?).
It's been my experience that some .csv files don't play well in SSIS when going directly into a DB destination; so, would it be better to transform the .csv into a .xlsx file, which plays much nicer going into a DB, or is there something else I am missing to easily move massive amounts of data from a .csv file into a DB - OR, am I just being stupid and turning a trivial matter into something bigger than it is?
Note: The reason I am dynamically setting the file in the Flat File Connection Manager is that the .csv file will have a set name appended with the month/year it was produced as part of a repeating process, and so I use the constant part of the name to grab it regardless of the date info
EDIT:
Here is a screen cap of my Flat File Connection Manager previewing some of the data that it will try to pipe through. I noticed some of these rows have quotes around them, and wanted to make sure that wouldn't affect anything adversely - the column having issues is the MONTHS_REMAIN one
Is it possible that one of the csv files in the suite you are processing is malformed? For instance, if one of the files had an extra column/comma, then that could force a varchar column into an integer column, producing error similar to the ones you have described. Have you tried using error row redirection to confirm that all of your csv files are formed correctly?
To use error row redirection, update your Flat File Source and adjust the Error Output settings to redirect rows. Your Flat File Source component will now have an extra red arrow which you can connect to a destination. Drag the red arrow from your source component to a new conditional split. Next, right-click the red line and add dataviewer. Now, when error rows are processed, they will flow over the red line into the data viewer so you can examine them. Last, Execute the package and wait for the dataviewer to capture the errant rows for examination.
Do the data values captured by the data viewer look correct? Good luck!

Temporary storage for cleaned data in Integration Services

I have an Excel file that I need to process three times in integration services, once for projects, once for persons and once for time tracking data.
At each step I have the excel source and I do need to do some data clean up and type conversions (same in all three steps).
Is there an easy way of creating a step that does all this and that allows me to use the output as input to the other "real" steps?
I am starting to think about importing it into SQL server in a temp table, which is by all means ok, but it would be nice if I could skip that step.
This can actually be achieved using a single data flow.
You can read the Excel data source once and then use Multicast Transformation to create copies of the data set in memory. You can then process each of your three data flow branches accordingly and can also make use of parallel processing!
See the following reference for details:
http://msdn.microsoft.com/en-us/library/ms137701(SQL.90).aspx
I hope what I have detailed is clear and understandable but please feel free to contact me directly if you require further guidance.
Cheers, John
[Added in response to comments]
With regard to your further question, you can specify the precedence/flow control of your package using more than one flow. So for example, you could use the multicast task to create three data flows however and then subsequently define your precedence flow control so that all transformation tasks in flow 1 must be completed before the transformations in flow two can begin.
You could use three separate data flow tasks with a file operation task first. The File Operation would be to copy the original Excel file to a temporary area. Each of the three Data Flow tasks would start with the temp file and write to the temp file (I think they may need to write to a copy).
An issue with this is that this makes the data flows operate sequentially. This might not be an issue for your Excel file, but would be an issue for processing larger numbers of rows. In such a case, it would be better to process the three "steps" in parallel, and join the results at the final stage.

Resources