Exact matching w/ SSIS Fuzzy matching - sql-server

I'm using the Fuzzy Matching SSIS component and want to do a fuzzy match on new names (from input columns Name1) against known names (Alias from Lookup Columns). Doing this alone works great (and is fast), but when I want to restrict the matching to only those records that have the same Country code as shown below, BTW both columns are char(3) is ISO codes, SSIS performance is so slow that it never completes.
I've attempted every variation of indexing available on the [Reference Table] tab and I believe I'm using the combination of Relationships correctly per https://msdn.microsoft.com/en-us/library/ms186488.aspx
Anyone run into a similar issue and figure out a workable solution?
Adding answer to question asked in comments here (better formatting);
SQL Server version : 2014 Ent
SSIS Version : not sure, but created w/ SQL Server data tool for VS 2013
Data Volume Source : 65K
Data Volume Data to Match : 105K (but SSIS pipeline gets stuck around 10k)
SQL Server indicates it's waiting for SSIS to pull more data in
Looking at task manager DTSdebug is using ~15% CPU and will do so indefinitely
The really odd thing is if I remove the country matching (which is set to exact) and use a larger source set (172K vs 65K) SSIS runs wonderfully.

Related

Visual Studio Integration Services becomes unresponsive

I am developing ETL solutions in Visual Studio and as soon as I select a view from a SQL Server database, the Visual Studio freezes, and clicking anywhere results in the following notification: "Visual Studio is Busy".
It is very frustrating and I cannot finish creating my solution.
Any advice for making it faster and more responsive?
I. What happens when selecting a view as an OLE DB Source?
I created an SQL Server Profiler trace to track all T-SQL commands execute over the AdventureWorks2017 database while I am selecting the [HumanResources].[vEmployee] view as an OLE DB Source.
The following screenshot shows that the following command is executed twice:
set rowcount 1
select * from [HumanResources].[vEmployee]
This means that the OLE DB source limit the result set of the query to a single row and executes the Select * command over the selected view in order to extract the required metadata.
It is worth mentioning that the SET ROWCOUNT 1 causes SQL Server to stop processing the query after the specified number of rows are returned. This means that only one row is requested and not all the view's data.
II. Issue's possible reasons
The issue you mentioned mostly happens due to the following reasons:
(1) Third-party extensions installed in Visual Studio
In that case, you should try to start Visual Studio in safe mode to prevent loading third-party extensions. You can use the following command
devenv.exe /safemode
(2) View querying a large amount of data
Visual Studio may freeze if the view returns a huge amount of data or contains bad JOINS. You may solve this using a simple workaround. Alter the view's SQL and add a condition that only returns a few rows (For example SELECT TOP 1). Then, use this view while designing the package. Once done, remove the added condition.
(3) Bad database design
Moreover, it is highly important that your views are well designed and that the underlying tables have the appropriate indexes. Besides, check that you don't have any issues related to the database design. For example:
(a) Index fragmentation
The index fragmentation is the index performance value in percentage, which can be fetched by SQL Server DMV. You can refer to the following article for more information:
How to identify and resolve SQL Server Index Fragmentation
(b) Large binary column
Make sure that the view does not include large binary columns since it highly affects the query execution.
Best Practices for tables with VARBINARY(MAX)
How Your SQL Server Data Type Choices Can Affect Database Performance
(4) Hardware issues
Even I do not think this should be the cause in that case. Try to check the available resources on your machine. For Example:
(a) Drive out of storage
If using windows, check the C: drive storage (default system databases directory) and the drive where the databases are stored and make sure they are not full.
(b) Server is out of memory
Make sure that your machine is not running out of memory. You can simply use the Task Manager to identify the amount of available memory.
(5) Optimizing Visual Studio performance
The last thing to mention is that there are several recommendations to improve the performance of Visual Studio. Feel free to check them:
Optimize Visual Studio performance
This can sometimes happen when you try to validate a select statement against a huge table. Depending on the RDBMS , some data sources while doing the validation do not do a good job of returning just metadata to validate against, and instead run Select * from table. So, validation can take what seems like forever.
Try to check if this is actually happening , check the running queries on the RDBMS in the package, when you load up the package.
Otherwise try copying the package and switch to the XML and rebuild it until you find issue. Remove the problem from your XML file, save, and redraw in the designer.

Importing a CSV into SQL Server - Truncation

I'm trying to import data into SQL Server using SQL Server Management Studio and I keep getting the "output column... failed because truncation occurred" error. This is because I'm letting the Studio autodetect the field length which it isn't very good at.
I know I can go back and extend the column length but I'm thinking there must be a better way to get it right first time without having to manaully work out how long each column is.
I know that this must be a common issue but my Google searches aren't coming up with anything as I'm more looking for a technique rather than a specific issue.
One approach you may take, assuming the import is not something which would take hours to complete, is to just set every text column to VARCHAR(MAX), and then complete the CSV import. Once you have the actual table in SQL Server, you can inspect each column using LEN to see how wide it is. Based on that, you can either alter columns, or you could just take notes, drop the table, and reimport using appropriate widths.
You should look into leveraging SSIS for this task. There is somewhat of a fixed cost in terms of spending time setting up the process for importing the csv file and creating a physical table in the database. Ultimately, though, you will be able to set the data types for each column/field in your file. Further, SSIS will enable you to transform or reformat the data to say the least.
I would suggest downloading Visual Studio and SQL Server Data Tools. The latter contains the necessary tools, including SSIS, SSRS, and SSAS, for which you would need to complete this task.
The main point is being able to automate this task, especially if it's an ongoing project of uploading csv files into the database.

Convert or output SSIS package/job to SQL script?

I understand this may be a little far-fetched, but is there a way to take an existing SSIS package and get an output of the job it's doing as T-SQL? I mean, that's basically what it is right? Transfering data from one database to another can be done with T-SQL as well.
I'm wondering this because I'm trying to get away from using SSIS packages for data transfer and instead using EF/linq to do this on the fly in my application. My thought process is that currently I have an SSIS package that transfers and formats data from one database to another in preparation to be spit out to an excel. This SSIS package runs nightly and helps speed up the generation of the excel as once the data is transferred to the second db, it's already nice and formatted correctly.
However, if I could leverage EF and maybe some linq to sql in order to format the data from the first database on the fly and spit it out to excel quickly without having to use this second db, that would be great. So can my original question be done, can I extract the t-sql representation of an SSIS package some how?
SSIS packages are not exclusively T-SQL. They can consist of custom back-end code, file system changes, Office document creation steps, etc, to name only a few. As a result, generating the entirety of an SSIS package's work into T-SQL isn't possible, because the full breadth of it's work isn't limited to SQL Server.

Access to SQL migration. Error GROUP-BY expression must contain at least one column that is not an outer reference

I have migrated an 2010 Access DB tables to SQL Server 2012 and linked them to the Access FE. I have pass through queries with return record set to no. I use the final result table which are linked in the FE of access for reporting purpose.
Each reports has no. of queries which are SQL pass through queries, which are run when the report selection Mass run is done.
I have tested each query individually and they work fine.
Scenario 1 :
If I keep the access local tables and use the linked source tables for loading the final table , it works fine but they take an enormous amount of time. This was one reason for migration.
Scenario 2:
When I use the approach I mentioned in the description , I get a error stating "ODBC_failed"
SQL Native Client 11.0 Each Group by expression must contain at least one column that is not an outer reference. I have gone through each query for checking the group by error. I have even removed the query referencing the group by , but I still get the same error.
I have been researching from almost a week , I tried a lot of suggestions and now I'm at the dead end, The error is not helping in any way.
I would really appreciate if someone could suggest me with some tips.

Changing VC++6 app database from Access to SQL Server - can I use linked tables?

We have a Visual C++ 6 app that stores data in an Access database using DAO. The database classes have been made using the ClassWizard, basing them on CDaoRecordset.
We need to move from Access to SQL Server because some clients have huge (1.5Gb+) databases that are really slow to run reports on (using Crystal Reports and a different app).
We're not too worried about performance on this VC++ app - it is downloading data from data recorders and putting it in the database.
I used the "Microsoft SQL Server Migration Assistant 2008 for Access" to migrate my database from Access into SQL Server - it then linked the tables in the original Access database. If I open the Access database then I can browse the data in the SQL Server database.
I've then tried to use that database with my app and keep running into problems.
I've changed all my recordsets to be dbOpenDynaset instead of dbOpenTable. I also changed the myrecordsetptr->open() calls to be myrecordsetptr->open(dbOpenDynaset, NULL, dbSeeChanges) so that I don't get an exception.
But... I'm now stuck getting an exception 3251 - 'Operation is not supported for this type of object' for one of my tables when I try to set the current index using myrecordsetptr->->SetCurrentIndex(_T("PrimaryKey"));
Are there any tricks to getting the linked tables to work without rewriting all the database access code?
[UPDATE 17/7/09 - thanks for the tips - I'll change all the Seek() references to FindFirst() / FindNext() and update this based on how I go]
Yes, but I don't think you can set/change the index of a linked table in the recordset, so you'll have to change the code accordingly.
For instance: If your code is expecting to set an index & call seek, you'll basically have to rewrite it use the Find method instead.
Why are you using SetCurrentIndex when you have moved your table from Access to SQL Server?
I mean - you are using Access only for linked table.
Also, as per this page - it says that SetCurrentIndex can be used for table type recordset.
In what context are you using the command SetCurrentIndex? If it's a subroutine that uses SEEK you can't use it with linked tables.
Also, it's Jet-only and isn't going to be of any value with a different back end.
I advise against the use of SEEK (even in Access with Jet tables) except for the most unusual situations where you need to jump around a single table thousands of times in a loop. In all other DAO circumstances, you should either be retrieving a limited number of records by using a restrictive WHERE clause (if you're using SEEK to get to a single record), or you should be using .FindFirst/FindNext. Yes, the latter two are proportionally much slower than SEEK, but they are much more portable, and also the absolute performance difference is only going to be relevant if you're doing thousands of them.
Also, if your SEEK is on an ordered field, you can optimize your navigation by checking whether the sought value is greater or lesser than the value of the current record, and choosing .FindPrevious or .FindNext, accordingly (because the DAO recordset Find operations work sequentially through the index).

Resources