Import data from .xls to table by removing unwanted columns? [duplicate] - sql-server

I need to import sheets which look like the following:
March Orders
***Empty Row
Week Order # Date Cust #
3.1 271356 3/3/10 010572
3.1 280353 3/5/10 022114
3.1 290822 3/5/10 010275
3.1 291436 3/2/10 010155
3.1 291627 3/5/10 011840
The column headers are actually row 3. I can use an Excel Sourch to import them, but I don't know how to specify that the information starts at row 3.
I Googled the problem, but came up empty.

have a look:
the links have more details, but I've included some text from the pages (just in case the links go dead)
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/97144bb2-9bb9-4cb8-b069-45c29690dfeb
Q:
While we are loading the text file to SQL Server via SSIS, we have the
provision to skip any number of leading rows from the source and load
the data to SQL server. Is there any provision to do the same for
Excel file.
The source Excel file for me has some description in the leading 5
rows, I want to skip it and start the data load from the row 6. Please
provide your thoughts on this.
A:
Easiest would be to give each row a number (a bit like an identity in
SQL Server) and then use a conditional split to filter out everything
where the number <=5
http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/947fa27e-e31f-4108-a889-18acebce9217
Q:
Is it possible during import data from Excel to DB table skip first 6 rows for example?
Also Excel data divided by sections with headers. Is it possible for example to skip every 12th row?
A:
YES YOU CAN. Actually, you can do this very easily if you know the number columns that will be imported from your Excel file. In
your Data Flow task, you will need to set the "OpenRowset" Custom
Property of your Excel Connection (right-click your Excel connection >
Properties; in the Properties window, look for OpenRowset under Custom
Properties). To ignore the first 5 rows in Sheet1, and import columns
A-M, you would enter the following value for OpenRowset: Sheet1$A6:M
(notice, I did not specify a row number for column M. You can enter a
row number if you like, but in my case the number of rows can vary
from one iteration to the next)
AGAIN, YES YOU CAN. You can import the data using a conditional split. You'd configure the conditional split to look for something in
each row that uniquely identifies it as a header row; skip the rows
that match this 'header logic'. Another option would be to import all
the rows and then remove the header rows using a SQL script in the
database...like a cursor that deletes every 12th row. Or you could
add an identity field with seed/increment of 1/1 and then delete all
rows with row numbers that divide perfectly by 12. Something like
that...
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/847c4b9e-b2d7-4cdf-a193-e4ce14986ee2
Q:
I have an SSIS package that imports from an Excel file with data
beginning in the 7th row.
Unlike the same operation with a csv file ('Header Rows to Skip' in
Connection Manager Editor), I can't seem to find a way to ignore the
first 6 rows of an Excel file connection.
I'm guessing the answer might be in one of the Data Flow
Transformation objects, but I'm not very familiar with them.
A:
Question Sign in to vote 1 Sign in to vote rbhro, actually there were
2 fields in the upper 5 rows that had some data that I think prevented
the importer from ignoring those rows completely.
Anyway, I did find a solution to my problem.
In my Excel source object, I used 'SQL Command' as the 'Data Access
Mode' (it's drop down when you double-click the Excel Source object).
From there I was able to build a query ('Build Query' button) that
only grabbed records I needed. Something like this: SELECT F4,
F5, F6 FROM [Spreadsheet$] WHERE (F4 IS NOT NULL) AND (F4
<> 'TheHeaderFieldName')
Note: I initially tried an ISNUMERIC instead of 'IS NOT NULL', but
that wasn't supported for some reason.
In my particular case, I was only interested in rows where F4 wasn't
NULL (and fortunately F4 didn't containing any junk in the first 5
rows). I could skip the whole header row (row 6) with the 2nd WHERE
clause.
So that cleaned up my data source perfectly. All I needed to do now
was add a Data Conversion object in between the source and destination
(everything needed to be converted from unicode in the spreadsheet),
and it worked.

My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
We provide guidance to our customers and vendors about how files must be formatted before we can process them and it is up to them to meet the guidlines as much as possible. People often aren't aware that files like that create a problem in processing (next month it might have six lines before the data starts) and they need to be educated that Excel files must start with the column headers, have no blank lines in the middle of the data and no repeating the headers multiple times and most important of all, they must have the same columns with the same column titles in the same order every time. If they can't provide that then you probably don't have something that will work for automated import as you will get the file in a differnt format everytime depending on the mood of the person who maintains the Excel spreadsheet. Incidentally, we push really hard to never receive any data from Excel (only works some of the time, but if they have the data in a database, they can usually accomodate). They also must know that any changes they make to the spreadsheet format will result in a change to the import package and that they willl be charged for those development changes (assuming that these are outside clients and not internal ones). These changes must be communicated in advance and developer time scheduled, a file with the wrong format will fail and be returned to them to fix if not.
If that doesn't work, may I suggest that you open the file, delete the first two rows and save a text file in a data flow. Then write a data flow that will process the text file. SSIS did a lousy job of supporting Excel and anything you can do to get the file in a different format will make life easier in the long run.

My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
Not entirely correct.
SSIS forces you to use the format and quite often it does not work correctly with excel
If you can't change he format consider using our Advanced ETL Processor.
You can skip rows or fields and you can validate the data the way you want.
http://www.dbsoftlab.com/etl-tools/advanced-etl-processor/overview.html
Sky is the limit

You can just use the OpenRowset property you can find in the Excel Source properties.
Take a look here for details:
SSIS: Read and Export Excel data from nth Row
Regards.

Related

SSIS Package that removes or ignores multiple rows in flat file

I'm learning how to develop SSIS packages for ETL systems this week. One of my first objectives is to discover different ways to import flat files into a database. As this is pretty straight forward for the most part, I've been playing around with different flat files that contain a variety of data.
One issue I ran into today was with a Excel document that contained data in the first row, the header information in the second row and foot information in the last couple of rows. What I want to import into the database is the header and all the rows leading up to the footer. I do not want the first row and I do not want the footer.
My current solution is to create a Data Flow task in Advance Settings and OpenRowSet with "Sheet1$A2:I20000". This allows me to open the sheet I want, select the second row (where my header resides) and then select all other rows that are between A2 and I20000.
This solution also allows me to read the header information (which I want) and all the rows that follow for importation. Unfortunately, this also selects the footer rows and doesn't seem optimize for good performance as the package has to scan a massive range of rows regardless if there is data in those rows or not.
The screenshot below contains the Excel sheet that I'm trying to import based on the MS SQL sample database. The rows I want to remove or ignore are circles with the red box. Everything else not circled is what I want to import.
Any thoughts on how I can ignore the first row, read the second row for my header information, read the rows that follow the header for my data set and then ignore the last couple of rows that I'm deeming as the footer?
Addition Information About This File
The first row will never change.
The header row will never change.
The data set after the header will change values, not data types.
The first column of footer will never change.
The second column of footer will change values, not data types.
The rest of the footer columns will never change.
I figured out the solution to my own question.
I used the Conditional Split as shown in my diagram to filter out the rows I didn't need. For example, I put a condition that checks if the first column of data (member_no) was < (less than) a number. If TRUE, it goes to my OLE DB. If False, it goes nowhere. This prevented the "SUM TOTAL" from being passed to the database.
I also edited my start range with 'Sheet1$A2:I' as opposed to 'Sheet1$A2:I20000'. That way the package scans until there is no records to scan and stops (I assume).

What is a better alternative to Excel for loading data to a SQL Server database?

I have a huge amount of trouble loading spreadsheets into a SQL Server database.
Currently, I'm using an SSIS package to load the data and I have had to make lots of adjustments to get the data to load:
All numbers must be formatted as text (otherwise they don't load properly).
Sometimes numbers must be preceded with single quote (') to get them to load.
If a column has a mix of number cells and text cells, the text cells must come first in the file (otherwise only numbers load and text comes in as NULL).
If a user changes a column name the file will not load.
If a user changes a tab name the file won't load.
If a user adds a new column (even at the end of a sheet) the file won't load.
Extra sheets in the file is not a problem, thankfully!
Dates seem sensitive whether or not they will load properly.
Connection strings to the Excel file must include "IMEX=1" or things are worse.
Scheduled SSIS jobs must be run as 32-bit even on 64-bit system.
I've been loading the data (usually 200,000-500,000 rows per file) into a table with all fields defined as nvarchar. Then, when loaded I transfer that data in the next step of the SSIS package to the working table with typed data fields.
All of the requirements that I must put on the user for how to format the Excel file is really a pain. We usually have to send the file back multiple times until all the formatting issues are correct before the file will load. I'd like to eliminate this thrash.
I know I'm not the only one that is facing this type of problem. So, I must ask...
What is a better alternative to Excel for loading data into a SQL Server database?
Or, am I going about this the wrong way? Should I be using something other than SSIS to load Excel spreadsheets?
You can try OpenRowSet:
SELECT *
INTO SomeTable
From OpenRowSet('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=\\servername\c$\filename.xls;HDR=YES;IMEX=1', [Sheet2$])
Not really a SQL answer, but an easy one:
You could require the users to copy and paste data to an excel spreadsheet where everything but the data fields to be included are locked. This will prevent many of the pain points described.

SSIS 2008 R2 - How to load header and data row from CSV

I have a CSV file where there is a header row and data rows in the same file.
I want to get information from both rows during the same load.
What is the easiest way to do this?
i.e File Example - Import.CSV
2,11-Jul-2011
Mr,Bob,Smith,1-Jan-1984
Ms,Jane,Doe,23-Apr-1981
In the first row, there a a count of the number of rows and the date of transmission.
In the second and subsequent rows is the actual data, in this Title, FirstName, LastName, Birthdate
SQL Server Integration Services Conditional Split Transformation should do it.
I wonder what would You do with that info in the pipeline. However, there is only one solution to read it in one pass (take a look at notes/limitations at the end):
Create a data flow
Put File source component and set it the way You want
Add script task to count the number of rows
Put conditional split transformation where condition is mycounter=0
One path from condition split will be the first row of file (mycounter=0) and the other path will be the rest of the rows (2 in your example).
Note#1: file source can set only one metadata for each column in the source. This means that if your first column of data is string (Mr, Ms, ...) then You have to set it as string data type in the source. Otherwise, if You set it as integer (DT_Ix) it
will fail as soon as it encounters row with string data (Mr, Ms, ...) in the first column of file. This applies to all columns, not just the first one.
Note #2: SSIS will see only the number of columns You told it to. This means that You have to have the same number of columns in EACH row. Otherwise, You have ragged csv file and You need to take another approach - search the Internet. But those solutions also require different layout of csv.
Answers in the following links explain how to load parent-child data from a flat file into an SQL Server database when both parent and child rows exist in the same file next to each other.
How do I split flat file data and load into parent-child tables in database?
How to load a flat file with header and detail data into a database using SSIS package?

SQL import wizard drops leading zero

I've read all the posts about prefixing the numbers with ', and setting IMEX=1 in the connection string; nothing seems to do the trick for me.
Here's the setup: Excel column with mixed data - 99% numbers (some start with 0) 1% text.
PROGRAMATICALLY mporting into SQL Server 2005 table / column type - varchar(255).
Import works fine locally, but once i move the code to production (GoDaddy), it drops the leading 0's in the column.
Any ideas?
p.s. I knew about the registry change solution, matter of fact - the value was set to 0 on my dev box, but the answer made me realize that the value wasn't set on the PRODUCTION SERVER :)
The ISAM driver only samples the first 8 rows, but you can change that behaviour through a registry change:
http://sqlserversd.wordpress.com/2008/09/14/ssis-excel-values-import-as-nulls/
But yes, using Excel for machine-to-machine data transfer is a nightmare...Is there no other way you can be sent the data?
Yes. The Excel driver only sample the 1st 8 rows to determine the data type.
This means that it assumes the column is numeric is "bob" does not appear in rows 1 to 8.
The target table column datatype is irrelevant.
Ths issue has been there for a long time, I saw it in 2003 the first time.
BOL notes on excel import
We usually save the file as a .csv file or .txt file and then the issue doesn't occur.
There is a quick and tricky way for this. following these steps BUT first copy all the data / columns and rows from the actual excel sheet into another excel sheet just to be of the safe side so that you have the actual data to compare with.
Steps:
Copy all the values in the column and paste them into a notepad.
Now change the column type to text in the Excel sheet (it will trim the preceding / trailing Zeros), don't worry about that.
Go to Notepad and copy all the values that you have pasted just now.
Go to your excel sheet and paste the values in same column.
If you have more than one column with 0 values then just repeat the same.
Now your excel document is ready to be imported with Zero Values :).
Happy Days.

How to import variable record length CSV file using SSIS?

Has anyone been able to get a variable record length text file (CSV) into SQL Server via SSIS?
I have tried time and again to get a CSV file into a SQL Server table, using SSIS, where the input file has varying record lengths. For this question, the two different record lengths are 63 and 326 bytes. All record lengths will be imported into the same 326 byte width table.
There are over 1 million records to import.
I have no control of the creation of the import file.
I must use SSIS.
I have confirmed with MS that this has been reported as a bug.
I have tried several workarounds. Most have been where I try to write custom code to intercept the record and I cant seem to get that to work as I want.
I had a similar problem, and used custom code (Script Task), and a Script Component under the Data Flow tab.
I have a Flat File Source feeding into a Script Component. Inside there I use code to manipulate the incomming data and fix it up for the destination.
My issue was the provider was using '000000' as no date available, and another coloumn had a padding/trim issue.
You should have no problem importing this file. Just make sure when you create the Flat File connection manager, select Delimited format, then set SSIS column length to maximum file column length so it can accomodate any data.
It appears like you are using Fixed width format, which is not correct for CSV files (since you have variable length column), or maybe you've incorrectly set the column delimiter.
Same issue. In my case, the target CSV file has header & footer records with formats completely different than the body of the file; the header/footer are used to validate completeness of file processing (date/times, record counts, amount totals - "checksum" by any other name ...). This is a common format for files from "mainframe" environments, and though I haven't started on it yet, I expect to have to use scripting to strip off the header/footer, save the rest as a new file, process the new file, and then do the validation. Can't exactly expect MS to have that out-of-the box (but it sure would be nice, wouldn't it?).
You can write a script task using C# to iterate through each line and pad it with the proper amount of commas to pad the data out. This assumes, of course, that all of the data aligns with the proper columns.
I.e. as you read each record, you can "count" the number of commas. Then, just append X number of commas to the end of the record until it has the correct number of commas.
Excel has an issue that causes this kind of file to be created when converting to CSV.
If you can do this "by hand" the best way to solve this is to open the file in Excel, create a column at the "end" of the record, and fill it all the way down with 1s or some other character.
Nasty, but can be a quick solution.
If you don't have the ability to do this, you can do the same thing programmatically as described above.
Why can't you just import it as a test file and set the column delimeter to "," and the row delimeter to CRLF?

Resources