I'm trying to manipulate a column in SSIS which looks like below after i removed unwanted rows with derived column and conditional split in my data flow task. The source for this is a flatfile.
XXX008001161022061116030S1TVCO3057
XXX008002161022061146015S1PUAG1523
XXX009001161022063116030S1DVLD3002
XXX009002161022063146030S1TVCO3057
XXX009003161022063216015S1PUAG1523
XXX010001161022065059030S1MVMA3020
XXX010002161022065129030S1TVCO3057
XXX01000316102206515901551PPE01504
The first three numbers from the left (starting with "008" first row) represent a series, and the next three ("001") represent another number within the series. what i need is to change all of the first three numbers starting from "001" to the end.
The desired reslut would thus look like:
XXX001001161022061116030S1TVCO3057
XXX001002161022061146015S1PUAG1523
XXX002001161022063116030S1DVLD3002
XXX002002161022063146030S1TVCO3057
XXX002003161022063216015S1PUAG1523
XXX003001161022065059030S1MVMA3020
XXX003002161022065129030S1TVCO3057
XXX00300316102206515901551PPE01504
...
My potential solution would be to load the file to a temporary database table and query it with SQL from there, but i am trying to avoid this.
The final destination is a flatfile.
Does anybody have any ideas how to pull this off in SSIS? Other solutions are appreciated also.
Thanks in advance
I would definitely use the staging table approach and use windows functions to accomplish this. I could see a use case if SSIS was on another machine than the database engine and there was a need to offload the processing to the SSIS box.
In that case I would create a script transformation. You can process each row and make the necessary changes before passing the row to the output. You can use C# or VB.
There are many examples out there. Here is MSDN article - https://msdn.microsoft.com/en-us/library/ms136114.aspx
Related
I want to optimize my dataflow reading just data I really need.
I created a dataset that maps a view on my database. This dataset is used by different dataflow so I need a generic projection.
Now I am creating a new dataflow and I want to read just a subset of the dataset.
Here how I created the dataset:
And that is the generic projection:
Here how I created the data flow. That is the source settings:
But now I want just a subset of my dataset:
It works but I think I am doing wrong:
I wanto to read data from my dataset (as you can see from source settings tab), but when I modify the projection I read from the underlying table (as you can see from source option). It seems an inconsistence. Which is the correct way to manage this kind of customization?
Thank you
EDIT
The solution proposed does not solve my problem. If I go in monitor and I analyze the exections that is what I saw...
Before I had applyed the solution proposed and with the solution I wrote above I got this:
As you can see I had read just 8 columns from database.
With the solution proposed, I get this:
And just then:
Just to be clear, the purpose of my question is:
How ca n I read only the data I really need instead of read all data and filter them in second moment?
I found a way (explained in my question) but there is an inconsistency with the configuration of the dataflow (I set a dataflow as input but in the option I write a query that read from db).
First import data as a Source.
You can use Select transformation in DataFlow activity to select CustomerID from imported dataset.
Here you can remove unwanted columns.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-select
I am trying to create a transformation which takes values from a table input (lets say, 10 rows) and in turns creates variables from the values from the rows. For each row in the original set, I then need to do a new transformation using the variables.
How can I loop through a bunch of rows, one by one, reading it into variables which will be used later in a transformation of its own?
looping in PDI is a bit complicated.
You can use the following proceed:
Create a job in PDI. First execute a Transformation, which reads or
generates the rows you need and use the "Copy rows to result" step.
After that create a execute Job step in the first Job. Here you have
to check the "Execute for every input row" Option in the "Advanced"
settings of the job. In this job you create and execute the final
Transformation, which is transforming your data.
In this transformation you have to use the "get rows from result".
Here you can finally read the variables you have defined before.
Hope i could help you.
Best regards.
I need to import sheets which look like the following:
March Orders
***Empty Row
Week Order # Date Cust #
3.1 271356 3/3/10 010572
3.1 280353 3/5/10 022114
3.1 290822 3/5/10 010275
3.1 291436 3/2/10 010155
3.1 291627 3/5/10 011840
The column headers are actually row 3. I can use an Excel Sourch to import them, but I don't know how to specify that the information starts at row 3.
I Googled the problem, but came up empty.
have a look:
the links have more details, but I've included some text from the pages (just in case the links go dead)
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/97144bb2-9bb9-4cb8-b069-45c29690dfeb
Q:
While we are loading the text file to SQL Server via SSIS, we have the
provision to skip any number of leading rows from the source and load
the data to SQL server. Is there any provision to do the same for
Excel file.
The source Excel file for me has some description in the leading 5
rows, I want to skip it and start the data load from the row 6. Please
provide your thoughts on this.
A:
Easiest would be to give each row a number (a bit like an identity in
SQL Server) and then use a conditional split to filter out everything
where the number <=5
http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/947fa27e-e31f-4108-a889-18acebce9217
Q:
Is it possible during import data from Excel to DB table skip first 6 rows for example?
Also Excel data divided by sections with headers. Is it possible for example to skip every 12th row?
A:
YES YOU CAN. Actually, you can do this very easily if you know the number columns that will be imported from your Excel file. In
your Data Flow task, you will need to set the "OpenRowset" Custom
Property of your Excel Connection (right-click your Excel connection >
Properties; in the Properties window, look for OpenRowset under Custom
Properties). To ignore the first 5 rows in Sheet1, and import columns
A-M, you would enter the following value for OpenRowset: Sheet1$A6:M
(notice, I did not specify a row number for column M. You can enter a
row number if you like, but in my case the number of rows can vary
from one iteration to the next)
AGAIN, YES YOU CAN. You can import the data using a conditional split. You'd configure the conditional split to look for something in
each row that uniquely identifies it as a header row; skip the rows
that match this 'header logic'. Another option would be to import all
the rows and then remove the header rows using a SQL script in the
database...like a cursor that deletes every 12th row. Or you could
add an identity field with seed/increment of 1/1 and then delete all
rows with row numbers that divide perfectly by 12. Something like
that...
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/847c4b9e-b2d7-4cdf-a193-e4ce14986ee2
Q:
I have an SSIS package that imports from an Excel file with data
beginning in the 7th row.
Unlike the same operation with a csv file ('Header Rows to Skip' in
Connection Manager Editor), I can't seem to find a way to ignore the
first 6 rows of an Excel file connection.
I'm guessing the answer might be in one of the Data Flow
Transformation objects, but I'm not very familiar with them.
A:
Question Sign in to vote 1 Sign in to vote rbhro, actually there were
2 fields in the upper 5 rows that had some data that I think prevented
the importer from ignoring those rows completely.
Anyway, I did find a solution to my problem.
In my Excel source object, I used 'SQL Command' as the 'Data Access
Mode' (it's drop down when you double-click the Excel Source object).
From there I was able to build a query ('Build Query' button) that
only grabbed records I needed. Something like this: SELECT F4,
F5, F6 FROM [Spreadsheet$] WHERE (F4 IS NOT NULL) AND (F4
<> 'TheHeaderFieldName')
Note: I initially tried an ISNUMERIC instead of 'IS NOT NULL', but
that wasn't supported for some reason.
In my particular case, I was only interested in rows where F4 wasn't
NULL (and fortunately F4 didn't containing any junk in the first 5
rows). I could skip the whole header row (row 6) with the 2nd WHERE
clause.
So that cleaned up my data source perfectly. All I needed to do now
was add a Data Conversion object in between the source and destination
(everything needed to be converted from unicode in the spreadsheet),
and it worked.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
We provide guidance to our customers and vendors about how files must be formatted before we can process them and it is up to them to meet the guidlines as much as possible. People often aren't aware that files like that create a problem in processing (next month it might have six lines before the data starts) and they need to be educated that Excel files must start with the column headers, have no blank lines in the middle of the data and no repeating the headers multiple times and most important of all, they must have the same columns with the same column titles in the same order every time. If they can't provide that then you probably don't have something that will work for automated import as you will get the file in a differnt format everytime depending on the mood of the person who maintains the Excel spreadsheet. Incidentally, we push really hard to never receive any data from Excel (only works some of the time, but if they have the data in a database, they can usually accomodate). They also must know that any changes they make to the spreadsheet format will result in a change to the import package and that they willl be charged for those development changes (assuming that these are outside clients and not internal ones). These changes must be communicated in advance and developer time scheduled, a file with the wrong format will fail and be returned to them to fix if not.
If that doesn't work, may I suggest that you open the file, delete the first two rows and save a text file in a data flow. Then write a data flow that will process the text file. SSIS did a lousy job of supporting Excel and anything you can do to get the file in a different format will make life easier in the long run.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
Not entirely correct.
SSIS forces you to use the format and quite often it does not work correctly with excel
If you can't change he format consider using our Advanced ETL Processor.
You can skip rows or fields and you can validate the data the way you want.
http://www.dbsoftlab.com/etl-tools/advanced-etl-processor/overview.html
Sky is the limit
You can just use the OpenRowset property you can find in the Excel Source properties.
Take a look here for details:
SSIS: Read and Export Excel data from nth Row
Regards.
I have about 500 fixed width columns in a flat file that I want to apply the same logic to to replace an empty column with null before it goes into the database.
I know the command to replace the empty string with null but I really don't want to have to use the gui to input that command for every column.
So is there a tool out there that can do this all on the back end?
You could look at something like the EzAPI to create your data flow. This this answer, I have an example of how one creates a EzDerivedColumn and sets the formula within it.
Automatically mapping columns with EZApi with OLEDBSource
If you can install third party components, I've seen a number of implementations of a Trim-To-Null functionality on codeplex.com
BIML might be an option to generate your package as well. I'd need to play with that to figure the syntax though.
My googlefu worked a little better after lunch.
I as able to modify about the 5th comment down on http://social.msdn.microsoft.com/Forums/sqlserver/en-US/222e70f5-0a21-4bb8-a3fc-3f365d9c701f/ssis-custom-component-derivedcolumn-programmatically-problems?forum=sqlintegrationservices to work for my needs.
My c# code will now loop through all the input columns from a "Flat File Source" object and add a derived column for each.
My problem is as follows. I have a CSV file (~100k rows) containting history information with the column format of:
ID1,History1,ID2,History2...ID110,History110
Each row may have anywhere between 0 and 110 history entries. Each separate entry requires a stored procedure to be called.
If there were a small number of possible entries per row, I imagine the way to do this would be to transform the data using a script, and send it to a unique path. Creating 110 paths would probably work, but isn't very elegant (and quite time consuming).
What would the best way to approach this be?
Just load the data (raw csv unchanged, one row per file line) into a staging table. Then, call a stored procedure that will use a string splitter to break up and loop over the staging table rows and call your other procedure for each history entry.
see: Arrays and Lists in SQL Server 2005 and Beyond
also see this previous answer: SQL comma delimted column => to rows then sum totals?
If you want to solve this in SSIS without the staging tables, you could create a destination script component. You could use switch statement or hashtable to lookup the right sproc to execute for the data row.
It is unclear whether this is a better solution then the staging table approach above; but it is an alternative.
I know you already accepted an answer, but couldn't you use an Unpivot task to achieve what you wanted to do here?