Power Query Joins and More Columns than Expected - inner-join

In Microsofts Power Query, I'm trying to Merge (i.e. Join) two sources, and I'm getting a mysterious error.
I have a source file which is a large csv data dump. I read it in and clean it, unpivot some data, and load it as a Connection-only. Here's a preview of the Connection-Only:
The second source is a 1 column table. Here's its Preview:
Finally, I'm merging them, with this code, and getting this error:
I get the same error using Table.Join. I've tried changing the order of the two tables to the join, but no joy.
Oddly, everything works fine while I'm in the query-editor. It's only when I Load and Close that I hit the error.
Any idea what's causing it? Alternatively, any workarounds?

My hypothesis is that one of the values in the CSV file has a comma in the Part No, which makes it think that number spans two columns and you're left with an extra column.
The reason it works in the query editor but not when you close and load is probably because the editor only loads a preview of the rows and the error occurs not in the previewed section.
You might be able to resolve the issue by filtering out that one row with a comma in it (assuming my hypothesis is correct). Another option would be to load it as a text file and try splitting the columns in the query editor instead of relying on the CSV import to automatically handle it correctly.

Related

Duplicate SSIS Column Headers

I am taking a view in a SQL DB and placing it into a CSV file using SSIS. Before doing so, I convert everything to unicode, which gives me two of everything. I was not having this issue until I recently made the change to append a date to the end of my output file by using an expression. I am receiving duplicate rows and everything is just pasted accross twice. Any suggestions to only get them to come out once on the CSV? Image below.
I eventually figured it out. When I changed my connection string to my flat file to be a function, it reset my columns to have duplicate columns. I just had to go into my connection manager for the flat file and delete the extra columns.

Import Flat File into SQL Server Repeated Headers Break Integrity Constraints

I'm having several issues with importing a flat file into MS SQL Server using the SQL Server import / export wizard. I'd like to know how to effectively load the file into a SQL Server table.
File Conditions:
The flat file is fairly large (800MB, and serveral million rows)
It's poorly formatted
The first column is empty
The header is a 3 row set: top blank, middle has field names, bottom blank
This 3 row header is repeated approximately every 60,000 rows
Some values are nulls
It's tab delimited
First, I tried to load it in as Flat File, but SQL server failed to recognize the tab delimiters. Excel opens it correctly (although partially), but SQL Server sticks it all in 1 column.
Second, I tried opening and saving it as an excel file and loading it as an excel file into the SQL Server import wizard (which I'm not sure if it resaves all the data anyway). Now SQL Server parses the columns correctly, but it says integrity constrints are broken when it hits the repeated headers (every numeric type field has a string header every 60000 rows).
If anyone can tell me how to get around this that would be great. I'd ideally like to upload it without the integrity constraints and remove the extra headers with a DELETE WHERE header or blank clause. Not the only solution I'll take, but an idea.
Also, this is my first stackoverflow post, so patience is appreciated.
Thanks,
Since I don't have a formal answer yet, I'll post what I ended up doing.
Essentially, I just made everything a varchar so it would just load into a table. Then I wrote several queries to clean up the garbage in it. Later I made new typed fields and filled them with an insert and cast from the varchar typed fields.
I don't know that this will ever help someone, but at least there's an answer here.

Import data from .xls to table by removing unwanted columns? [duplicate]

I need to import sheets which look like the following:
March Orders
***Empty Row
Week Order # Date Cust #
3.1 271356 3/3/10 010572
3.1 280353 3/5/10 022114
3.1 290822 3/5/10 010275
3.1 291436 3/2/10 010155
3.1 291627 3/5/10 011840
The column headers are actually row 3. I can use an Excel Sourch to import them, but I don't know how to specify that the information starts at row 3.
I Googled the problem, but came up empty.
have a look:
the links have more details, but I've included some text from the pages (just in case the links go dead)
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/97144bb2-9bb9-4cb8-b069-45c29690dfeb
Q:
While we are loading the text file to SQL Server via SSIS, we have the
provision to skip any number of leading rows from the source and load
the data to SQL server. Is there any provision to do the same for
Excel file.
The source Excel file for me has some description in the leading 5
rows, I want to skip it and start the data load from the row 6. Please
provide your thoughts on this.
A:
Easiest would be to give each row a number (a bit like an identity in
SQL Server) and then use a conditional split to filter out everything
where the number <=5
http://social.msdn.microsoft.com/Forums/en/sqlintegrationservices/thread/947fa27e-e31f-4108-a889-18acebce9217
Q:
Is it possible during import data from Excel to DB table skip first 6 rows for example?
Also Excel data divided by sections with headers. Is it possible for example to skip every 12th row?
A:
YES YOU CAN. Actually, you can do this very easily if you know the number columns that will be imported from your Excel file. In
your Data Flow task, you will need to set the "OpenRowset" Custom
Property of your Excel Connection (right-click your Excel connection >
Properties; in the Properties window, look for OpenRowset under Custom
Properties). To ignore the first 5 rows in Sheet1, and import columns
A-M, you would enter the following value for OpenRowset: Sheet1$A6:M
(notice, I did not specify a row number for column M. You can enter a
row number if you like, but in my case the number of rows can vary
from one iteration to the next)
AGAIN, YES YOU CAN. You can import the data using a conditional split. You'd configure the conditional split to look for something in
each row that uniquely identifies it as a header row; skip the rows
that match this 'header logic'. Another option would be to import all
the rows and then remove the header rows using a SQL script in the
database...like a cursor that deletes every 12th row. Or you could
add an identity field with seed/increment of 1/1 and then delete all
rows with row numbers that divide perfectly by 12. Something like
that...
http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/847c4b9e-b2d7-4cdf-a193-e4ce14986ee2
Q:
I have an SSIS package that imports from an Excel file with data
beginning in the 7th row.
Unlike the same operation with a csv file ('Header Rows to Skip' in
Connection Manager Editor), I can't seem to find a way to ignore the
first 6 rows of an Excel file connection.
I'm guessing the answer might be in one of the Data Flow
Transformation objects, but I'm not very familiar with them.
A:
Question Sign in to vote 1 Sign in to vote rbhro, actually there were
2 fields in the upper 5 rows that had some data that I think prevented
the importer from ignoring those rows completely.
Anyway, I did find a solution to my problem.
In my Excel source object, I used 'SQL Command' as the 'Data Access
Mode' (it's drop down when you double-click the Excel Source object).
From there I was able to build a query ('Build Query' button) that
only grabbed records I needed. Something like this: SELECT F4,
F5, F6 FROM [Spreadsheet$] WHERE (F4 IS NOT NULL) AND (F4
<> 'TheHeaderFieldName')
Note: I initially tried an ISNUMERIC instead of 'IS NOT NULL', but
that wasn't supported for some reason.
In my particular case, I was only interested in rows where F4 wasn't
NULL (and fortunately F4 didn't containing any junk in the first 5
rows). I could skip the whole header row (row 6) with the 2nd WHERE
clause.
So that cleaned up my data source perfectly. All I needed to do now
was add a Data Conversion object in between the source and destination
(everything needed to be converted from unicode in the spreadsheet),
and it worked.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
We provide guidance to our customers and vendors about how files must be formatted before we can process them and it is up to them to meet the guidlines as much as possible. People often aren't aware that files like that create a problem in processing (next month it might have six lines before the data starts) and they need to be educated that Excel files must start with the column headers, have no blank lines in the middle of the data and no repeating the headers multiple times and most important of all, they must have the same columns with the same column titles in the same order every time. If they can't provide that then you probably don't have something that will work for automated import as you will get the file in a differnt format everytime depending on the mood of the person who maintains the Excel spreadsheet. Incidentally, we push really hard to never receive any data from Excel (only works some of the time, but if they have the data in a database, they can usually accomodate). They also must know that any changes they make to the spreadsheet format will result in a change to the import package and that they willl be charged for those development changes (assuming that these are outside clients and not internal ones). These changes must be communicated in advance and developer time scheduled, a file with the wrong format will fail and be returned to them to fix if not.
If that doesn't work, may I suggest that you open the file, delete the first two rows and save a text file in a data flow. Then write a data flow that will process the text file. SSIS did a lousy job of supporting Excel and anything you can do to get the file in a different format will make life easier in the long run.
My first suggestion is not to accept a file in that format. Excel files to be imported should always start with column header rows. Send it back to whoever provides it to you and tell them to fix their format. This works most of the time.
Not entirely correct.
SSIS forces you to use the format and quite often it does not work correctly with excel
If you can't change he format consider using our Advanced ETL Processor.
You can skip rows or fields and you can validate the data the way you want.
http://www.dbsoftlab.com/etl-tools/advanced-etl-processor/overview.html
Sky is the limit
You can just use the OpenRowset property you can find in the Excel Source properties.
Take a look here for details:
SSIS: Read and Export Excel data from nth Row
Regards.

What is a better alternative to Excel for loading data to a SQL Server database?

I have a huge amount of trouble loading spreadsheets into a SQL Server database.
Currently, I'm using an SSIS package to load the data and I have had to make lots of adjustments to get the data to load:
All numbers must be formatted as text (otherwise they don't load properly).
Sometimes numbers must be preceded with single quote (') to get them to load.
If a column has a mix of number cells and text cells, the text cells must come first in the file (otherwise only numbers load and text comes in as NULL).
If a user changes a column name the file will not load.
If a user changes a tab name the file won't load.
If a user adds a new column (even at the end of a sheet) the file won't load.
Extra sheets in the file is not a problem, thankfully!
Dates seem sensitive whether or not they will load properly.
Connection strings to the Excel file must include "IMEX=1" or things are worse.
Scheduled SSIS jobs must be run as 32-bit even on 64-bit system.
I've been loading the data (usually 200,000-500,000 rows per file) into a table with all fields defined as nvarchar. Then, when loaded I transfer that data in the next step of the SSIS package to the working table with typed data fields.
All of the requirements that I must put on the user for how to format the Excel file is really a pain. We usually have to send the file back multiple times until all the formatting issues are correct before the file will load. I'd like to eliminate this thrash.
I know I'm not the only one that is facing this type of problem. So, I must ask...
What is a better alternative to Excel for loading data into a SQL Server database?
Or, am I going about this the wrong way? Should I be using something other than SSIS to load Excel spreadsheets?
You can try OpenRowSet:
SELECT *
INTO SomeTable
From OpenRowSet('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=\\servername\c$\filename.xls;HDR=YES;IMEX=1', [Sheet2$])
Not really a SQL answer, but an easy one:
You could require the users to copy and paste data to an excel spreadsheet where everything but the data fields to be included are locked. This will prevent many of the pain points described.

Move Error data to another table

How do we redirect error/failed data to another table in SQL Server, during data importing in SSIS 2008 ?
In a particular data flow component - in Configure Error Output, choose to redirect the row. You may need to add some derived columns after that, and then union all your errors from different parts of your package together if you have just one unified error output.
Cade's way will work for any errors.
If you have data that you know in advance you want to redirect (say states that are not in a list of official states or people with no address), then you can do a conditional split and redirect the rows that way. I prefer to check for known problem issues rather than just to rely on something failing insert to avoid sending things to my datbase that might actually go into the filed but which are data I don't want. For instance I got a file that had the phrase "Legistlative restriction" in the last name field - this clearly wasn't a person, so I redirected the rows. But the actual text would have fit in our lastname field and the record would have been inserted if I had just reliedon error output.

Resources