SQL Server Except operator - how to identify culprit columns - sql-server

Just to give background, we have created new SSIS packages to get feeds into SQL Server tables.
To make sure new SSIS packages put data in same manner in which how current live process does, we have written comparison script to compare Live and Test tables which will be executed parallel for a month.
So have i have used EXCEPT command to get the differences and it returns data too which has differences.
Now problem is that, i have around 50 columns and i need to check data of each column and compare with other tables to find out the culprit.
In some cases, i am getting around 40000 rows as a difference and identifying correct columns is too cumbersome.
In most of the cases, it was because of NULL value in Live and blank value in Test. I have done following to identify columns
Limit Number of columns
Look at value for few columns and then change SQL statement to include ISNULL function & so on
This is time consuming.
Is there any good way to get the columns names because of which we are getting difference in rows.
It would be good if i can get better way to handle this.

Related

SSIS 2005 Fuzzy grouping does not use the representative value for all grouped records

I am using fuzzy grouping to find the possible duplicate addresses in my dataset. I grouped records and everything was ok until I noticed that there is a potential issue in the grouping procedure that starts around a specific record. Indeed, the procedure seems to start making several mistakes. For example, if you look at the screenshot in attachment, you will see that although the procedure recognizes the two records 15373 and 15374 in the same group, the procedure does not assign them the same PTNOME_Clean. Surprisingly, such issue starts emerging at a certain stage. Dataset ordered by column _key_out.

Query SQL database from Excel

I'm attempting to create a MS Query to return data from a SQL database based on a value from a cell in Excel. I have actually successful accomplished this, but only for 1 row. I cant figure out how to get it to copy-down to other rows.
I've created a connection as follows:
Notice that the SQL statement includes a parameter. The parameter is set to point to a specific cell:
I guess this makes sense as I'm only looking to return 1 value per row:The problem is that I have multiple lines to return values for. How do I return a value per row for multiple rows?
I've tried changing the cell reference in the Parameters dialog box, but this does not work as the Excel Table is designed to grow dynamically.
Excel data connections works in a way that every connection has only one SQL Query. So in order to do what you'r looking for, you will need to have many connections, and that's not the "best practice".
However, there are two ways you can solve this situation:
1. Make a single connection with all of the data and create a pivot table based on it. Then use VLOOKUP/INDEX to gather the data to your requested cells.
2. If the data is too big, you can use VBA code to create a smaller Query based on the cells you mentioned and then continue as described on the first option.
Good luck.

How can I minimize validation intervals when changing the SQL in ADO NET Source Tasks

Part of an SSIS package is the data import from an external database via a SQL command embedded into an ADO.NET Source Data Flow Source. Whenever I make even the slightest adjustment to the query (such as changing a column name) it takes ages (in that case 1-2 hours) until the program has finished validation. The query itself returns around 30,000 rows with 20 columns each.
Is there any way to cut these long intervals or is this something I have to live with?
I usually store the source queries in a table and the first part of my package would execute a select and store the query returned from the table in a package variable, which would then be used by the ADO.NET Source Data Flow. So In my package for the default value of the variable I usually have the query that is stored in the database along with a "where 1=2" at the end. Hence during design time it does execute the query but just returns the column metadata. Let me know if you have any questions.

SQLBulkCopy and Dates (1/1/1753)

I've got an application which has been working fine for quite a while, but there is an annoying item that continues to get in the way on occasion.
Let's say that I use an object such as OracleDataReader or MySQLDataReader to pass the data to the sqlbulkcopy object for insert. Let's assume that all the columns maps just fine and for the most part, it all works well.
Granted, I don't have control over the source application or database (which is either MySQL or Oracle). So some goof goes into a different application and puts in a date on the invoice table of 5/31/0210. He really meant to put in 5/31/2010, but the application he's using is not validating the data very tightly and the Oracle database accepts it. For all intensive purposes, the data of 5/31/0210 is a valid date for the Oracle db. It might be stupid in terms of data entry, but it is what it is at this point.
Now our OracleDataReader comes along and is transferring this invoice table over to SQL Server via the SQLBulkCopy. It is passing the data to perfectly matched table with the right column names and data types. You can see what is going to happen. This date of 05/31/0210 from Oracle is not accepted by the SQL Server db engine, as the DATETIME field only allows dates from 1/1/1753 to 12/31/9999.
When it encounters this record, it simply fails and gives an overflow error. It doesn't skip the record, it kills the feed. So if it happens a thousand records in on a million record table, you don't get the remaining 999,000 records.
Is there anyway to get around this issue so that the feed will continue?
Ideally, I'd like to move the receiving SQL Server DB to 2008 and use DATETIME2, which would allow for these goofy dates, but unfortunately not all my clients are ready to move to this version yet, so I'm stuck with DATETIME in SQL 2000/2005/2008.
Any ideas on how to get around this without changing the SQL? Ideally, I wouldn't mind if it just skipped the record. I know that I could do this in the SQL for the datareder, but this would be extremely complicated when you have twenty date fields in a single query. It would be maintenance nightmare.
Any thoughts would be appreciated.
One option would be to change the datetime column type to varchar. Then add a derived column for converting the string to datetime. The trick would be to use a function in the derived column to validate the date and put an arbitrary datetime if the coversion will fail. If you do heavy date comparisons, persist the computed column and/or index it.
I say all of this under the impression that sqlbulkcopy is not able to do transforms. Maybe you can. Hopefully, someone will chime in with a way to.
SSIS would be great in this situation, as you could do the transform and also get the performance benefits of the bulk update lock.

SSIS, splitting a single row into multiple rows

My problem is as follows. I have a CSV file (~100k rows) containting history information with the column format of:
ID1,History1,ID2,History2...ID110,History110
Each row may have anywhere between 0 and 110 history entries. Each separate entry requires a stored procedure to be called.
If there were a small number of possible entries per row, I imagine the way to do this would be to transform the data using a script, and send it to a unique path. Creating 110 paths would probably work, but isn't very elegant (and quite time consuming).
What would the best way to approach this be?
Just load the data (raw csv unchanged, one row per file line) into a staging table. Then, call a stored procedure that will use a string splitter to break up and loop over the staging table rows and call your other procedure for each history entry.
see: Arrays and Lists in SQL Server 2005 and Beyond
also see this previous answer: SQL comma delimted column => to rows then sum totals?
If you want to solve this in SSIS without the staging tables, you could create a destination script component. You could use switch statement or hashtable to lookup the right sproc to execute for the data row.
It is unclear whether this is a better solution then the staging table approach above; but it is an alternative.
I know you already accepted an answer, but couldn't you use an Unpivot task to achieve what you wanted to do here?

Resources