I have an Excel file that I need to process three times in integration services, once for projects, once for persons and once for time tracking data.
At each step I have the excel source and I do need to do some data clean up and type conversions (same in all three steps).
Is there an easy way of creating a step that does all this and that allows me to use the output as input to the other "real" steps?
I am starting to think about importing it into SQL server in a temp table, which is by all means ok, but it would be nice if I could skip that step.
This can actually be achieved using a single data flow.
You can read the Excel data source once and then use Multicast Transformation to create copies of the data set in memory. You can then process each of your three data flow branches accordingly and can also make use of parallel processing!
See the following reference for details:
http://msdn.microsoft.com/en-us/library/ms137701(SQL.90).aspx
I hope what I have detailed is clear and understandable but please feel free to contact me directly if you require further guidance.
Cheers, John
[Added in response to comments]
With regard to your further question, you can specify the precedence/flow control of your package using more than one flow. So for example, you could use the multicast task to create three data flows however and then subsequently define your precedence flow control so that all transformation tasks in flow 1 must be completed before the transformations in flow two can begin.
You could use three separate data flow tasks with a file operation task first. The File Operation would be to copy the original Excel file to a temporary area. Each of the three Data Flow tasks would start with the temp file and write to the temp file (I think they may need to write to a copy).
An issue with this is that this makes the data flows operate sequentially. This might not be an issue for your Excel file, but would be an issue for processing larger numbers of rows. In such a case, it would be better to process the three "steps" in parallel, and join the results at the final stage.
Related
I have the below within the Data-flow area. The problem I'm experiencing is that even if the result is 0, it is still creating the file.
Can anyone see what I'm doing wrong here?
This is pretty much expected and known annoying behavior.
SSIS will create an empty flat file, even if unchecked: "column names in a first data row".
The workarounds are:
remove such file by a file system task if #RowCountWriteOff = 0 just after the execution of a dataflow.
as alternative, do not start a dataflow if expected number of rows in the source is 0:
Update 2019-02-11:
Issue I have is that I have 13 of these export to csv commands in the
data flow and they are costly queries
Then double querying a source to check a row-count ahead will be even more expensive and perhaps better to reuse a value of variable #RowCountWriteOff.
Initial design has 13 dataflows, adding 13 constraints and 13 filesystem tasks the main control flow will make package more complex and harder to maintain
Therefore, suggestion is to use a OnPostExecute event handler, so cleanup logic is isolated to some certain dataflow:
Update 1 - Adding more details based on OP comments
Based on your comment i will assume that you want to loop over many tables using SQL Commands, check if table contains row, if so then you should export rows to flat files, else you should ignore the tables. I will mention the steps that you need to achieve that and provide links that contains more details for each step.
First you should create a Foreach Loop container to loop over tables
You should add an Execute SQL Task with a count command SELECT COunt(*) FROM ....) and store the Resultset inside a variable
Add a Data Flow Task that import data from OLEDB Source to Flat File Destination.
After that you should add a precedence constraint with expression, to the Data Flow Task, with expression similar to #[User::RowCount] > 0
Also, it is good to check the links i provided because they contains a lot of useful informations and step by step guides.
Initial Answer
Preventing SSIS from creating empty flat files is a common issue that you can find a lot of references online, there are many workarounds suggested and many methods that may solves the issue:
Try to set the Data Flow Task Delay Validation property to True
Create another Data Flow Task within the package, which will be used only to count rows in the Source, if it is bigger than 0 then the precedence constraint should led to the other Data Flow Task
Add a File System Task after the Data Flow Task which delete the output file if RowCount is o, you should set the precedence constraint expression to ensure that.
References and helpful links
How to prevent SSIS package creating empty flat file at the destination
Prevent SSIS from creating an empty flat file
Eliminating Empty Output Files in SSIS
Prevent SSIS for creating an empty csv file at destination
Check for number of rows returned and do not create empty destination file
Set the Data Flow Task Delay Validation property to True
Hi,
I try to use same data twice in SSIS data flow panel, however, it only allow me to build one path, is there anyway I can build another path of it or I can duplicate the data I want to use?
Thanks,
You're looking for multi cast transformation.
Connect the above 'CONVERT DATA TYPE2' TO 'MULICAST TRANSFORMATION'.
From multicast you can take anynumber of outtflows.
There are 2 ways to add a path, it depends on your requirements:
Multicast transformation
The Multicast transformation distributes its input to one or more outputs. This transformation is similar to the Conditional Split transformation. Both transformations direct an input to multiple outputs. The difference between the two is that the Multicast transformation directs every row to every output, and the Conditional Split directs a row to a single output
Script Component multiple outputs
If you are looking to create many distinct path based on the script component code, then script component allow creating many outputs. (check the link above for more details)
Option 1
The best and most SSIS way of doing this is by using the Multicast component. Connect it to the output path of your Script Transformation "Convert data type 2" and from there, you can connect it to both "Sort 1" and "Sort 3"
Option 2
If your Script Transformation is asynchronous (1 row in to many rows out, many rows in to 1 out, etc) then you could add a second output and also send the data along. That answer is only provided for completeness. Doing this would cause the amount of data required for a row in your pipeline to double (the Multicast component does some pointer reference voodoo to not physically duplicate the data)
Finally, I'm not sure what business problem you're solving but if performance is an issue, it'll be the package design and not SSIS itself. Without knowing more (aka a difference quest
Image the following scenario:
1. there are N number of jobs
2. the jobs write data to the same file once a day sequentially
3. task setting indicates whether the file should be overwritten or appended to
what i've tried thus far is using a conditional split in my data flow:
to test it out, the Case 1 & 2 are:
what actually happens is, conditional split tries to work out which data rows to send where, ends up sending all rows to one side and 0 rows to the other and both sides end up opening the file (i think), hence the errors:
i get that i'm misusing the conditional split here, but come one, it's 2017 outside, the must be a way to do this without resorting to Script tasks clearing the files?
Your problem - you are misusing Conditional Split; it is designed to manipulate data rows in a data flow and you are trying to manage control flow. Speaking of SSIS, it does not know in advance that you will use only one of Flat File Destinations; it tries to initialize both. By doing so SSIS tries to open the same file from two Destinations and fails with an error.
You can handle the task the SSIS way - manage control flow with tasks. In your case, the destination file should be appended or being overwritten. But the being overwritten can be viewed as being overwritten with zero lines and then being appended. Luckily for you, SSIS overwrites a file event no records are coming from a Data Flow.
So, before your dataflow which should always append data, you create another dataflow which always receives ZERO lines of data (columns in the set can be arbitrary) and Flat File Destination overwriting the file. Then use conditional execution of control flow with precedence constrains to execute this "File Cleanup DataFlow Task". You might also need to set DelayValidation=true on this "File Cleanup DataFlow Task".
I am trying to use a SSIS package to insert data from a file into a table but only if all the data in the file is good. I have read around and realise that I can split my good data and bad data with a conditional split.
However I cannot come up with a way to not write the good data if there is some bad data rows.
I can solve my problem use a staging table. I just thought I would ask if I am missing a more elegant way to do this within SSIS package rather than load then transform with TSQL.
Thanks
SSIS way allows wrapping actions in a Transaction. According to your task, you need to count bad rows in the dataflow, and if there is at least one bad row - do nothing i.e. rollback.
Below is how I would do it in Pure SSIS. Create a sequence and specify TransactionOption=Required on it, move your dataflow to the sequence. Add Count Rows transformation to your bad rows dataflow and store its result to some variable. After DataFlow inside sequence - create conditional task link where you check whether bad_rowcount variable > 0, and on the next - do little script task which raise an error to roll back transaction.
Pure SSIS - yes! Simpler than using staging table - not sure.
I'm sure this question is a common one, but I'm having a real challenging time coming up with a modular design for what should not be an impossible task. I have a situation where I have common destination tables, about five or six of them, but multiple input files which need to be massaged into a certain format for insertion. I've been tasked with making the design modular so as we work with new data providers with different formats, the pieces of the package that handle the insertion don't change, nor the error reporting, etc., just the input side. I've suggested using a common file format which would mean taking the source files and then transforming them and running the rest of the common import process on them. It was suggested that I consider using tables for this process on the input side.
I guess what strikes me about this process is the fact that the package can be saved as a template and I can use the common pieces over and over and set up new connections as we work with other data providers. Outside of that, I could see resorting to custom code in a script task to ensure a common format to be inserted into common input tables, but that's as far as I've gotten.
If anyone has ever dealt with a situation as such, I would appreciate design suggestions to accommodate functionality for now and in the future.
Update: I think the layered architectural design that is being emphasized in this particular instance would be as such (which is why I find it confusing):
There would be six layers. They are as follows:
A. File acquisition
B. File Preparation
C. Data Translation to common file format (in XML)
D. Transformation of data to destination format (XML - preparation for insertion into database)
E. Insert into database
F. Post processing (reporting and output of erred out
Since we will be dealing with several different data providers, the steps would be the same for processing the data, but the individual steps themselves may differ between providers, if that makes sense. Example: We may get data from provider A, but we would receive files from them and they are zipped CSV files. Provider B's would be in XML, uncompressed. One provider might send files to us and we may have to go pick files up (This would take place in the file acquisition step above).
So my question are:
A. Is it more important to follow the architectural pattern here or to combine things were possible? It was a possible suggestion to combine all the connection items in a single package as the top layer, so therefore a single package would handle things like making a service call, SFTP, FTP, and anything else that was needed. I'm not sure quite how one would do multiple connections for different providers when a schedule is needed. It just seems to complicate things...I'm thinking connection layer, but have it be specific to the provider, not a be all end all.
B.Same thing with the file preparation layer.
I think a modular design is key, but stacking things into control tasks seems to make things more complicated in design than they should be. Any feedback or suggestions would be helpful in this area.
I would do what was suggested in another comment and import each file into the appropriate temp table first and then union them all later. This will give you more modularity and make adding or removing input files easier and make debugging easier because you can easily see which section failed. Here is an outline of what I would do:
Step 1 SQL Task:
Create the temp table (repeat as needed for each task with a unique table):
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id = OBJECT_ID(N'[dbo].[tbl_xxxxxx_temp]')
AND type in (N'U'))
DROP TABLE [dbo].[tbl_xxxxxx_temp]
GO
CREATE TABLE [dbo].[tbl_xxxxxx_temp](
(columns go here)
) ON [PRIMARY]
GO
Step 2: Data Flow Task
Create a Data Flow Task and import each file into their unique temp table which you created above.
Step 3: Data Flow Task
Create a second DFT and connect each temp table to a union all Data Flow Transformation (convert, or derive columns as needed) and then connect the output to your static data base table.
Step 4: SQL Task: Drop temp tables
DROP TABLE tbl_xxxxxx_temp
Please note it is necessary to set "DelayValidation" to True in each Data Flow Task in order for this to work.