Is it possible to perform T-SQL fuzzy lookup without SSIS? - sql-server

SSIS 2005/2008 does fuzzy lookups and groupings. Is there a feature that does the same in T-SQL?

Fuzzy lookup uses a q-gram approach, by breaking strings up into tiny sub-strings and indexing them. You can then then search input by breaking it up into equally sized strings. You can inspect the format of their index and write a CLR function to use the same style of index but you might be talking about a fair chunk of work.
It is actually quite interesting how they did it, very simple yet provides very robust matching and is very configurable.
From that I recall of the index when I last looked at it, each q-gram or substring is stored in a row in an table (the index). That row contains an nvarchar column (among other values) that is used as binary data and contains references to the rows that match.
There is also an open feedback suggestion on Microsoft Connect for this feature.

SQL Server has a SOUNDEX() function:
SELECT *
FROM Customers
WHERE SOUNDEX(Lastname) = SOUNDEX('Stonehouse')
AND SOUNDEX(Firstname) = SOUNDEX('Scott')

Full Text Search is a great fuzzy tool. Brief primer here

On March 5 2009 I will have an article posted on www.sqlservercentral.com with a sample of Jaro-Winkler TSQL

Related

Fuzzy Logic Lookup - How to use calculated columns

We're starting to implement Unicode as we've added some international customers. There are some issues comparing character data in SSIS because of capitals, accents, and other data problems.
I've thought that the Fuzzy logic lookup could be a good solution. However, when testing this solution out, I realized that in a lot of our existing code we limit what data to process, and send in those values by parameters.
I've noticed that in the Fuzzy Lookup, I can specify the name of the table, but I can't make changes like remove a % from a field and turn it into a decimal. Any ideas how we can setup the lookup with calculated fields?
Thanks!
Create a view in your database with the proper transformation your require using a sql query.

Algorithms for key value pair, where key is string

I have a problem where there is a huge list of strings or phrases it might scale from 100,000 to 100Million. when i search for a phrase if found it gives me the Id or index to database for further operation. I know hash table can be used for this, but i am looking for other algorithm which could serve me to generate index based on strings and can also be useful in some other features like autocomplete etc.
I read suffix tree/array based on some SO threads they serve the purpose but consumes alot memory than i can afford. Any alternatives to this?
Since my search is only in a huge list of millions of strings. No docs no webpages not interested in search engine like lucene etc.
Also read about inverted index sounds helpful but which algorithm i need to study for it?.
If this Database index is within MS SQL Server you may get good results with SQL Full Text Indexing. Other SQL providers may have a similar function but I would not be able to help with those.
Check out: http://www.simple-talk.com/sql/learn-sql-server/understanding-full-text-indexing-in-sql-server/
and
http://msdn.microsoft.com/en-us/library/ms142571.aspx

Can Joins between views and table hurt performance?

I am new in sql server,
my manager has given me job where i have to find out performance of query in sql server 2008.
That query is very complex having joins between views and table. I read in internet that joins between views and table cause performance hurt?
If any expert can help me on this? Any good link where i found knowledge of this? How to calculate query performance in sql server?
Look at the query plan - it could be anything. Missing indexes on one of the underlying view tables, missing indexes on the join table or something else.
Take a look at this article by Gail Shaw about finding performance issues in SQL Server - part 1, part 2.
A view (that isn't indexed/materialised) is just a macro: no more, no less
That is, it expands into the outer query. So a join with 3 views (with 4, 5, and 6 joins respectively) becomes a single query with 15 JOINs.
This is a common "I can reuse it" mistake: DRY doesn't usually apply to SQL code.
Otherwise, see Oded's answer
Oded is correct, you should definitely start with the query plan. That said, there are a couple of things that you can already see at a high level, for example:
CONVERT(VARCHAR(8000), CN.Note) LIKE '%T B9997%'
LIKE searches with a wildcard at the front are bad news in terms of performance, because of the way indexes work. If you think about it, it's easy to find all people in the phone book whose name starts with "Smi". If you try to find all people who have "mit" anywhere in their name, you will find that you need to read the entire phone book. SQL Server does the same thing - this is called a full table scan, and is typically quite slow. The other issue is that the left side of the condition uses a function to modify the column (specifically, converting it to a varchar). Essentially, this again means that SQL Server cannot use an index, even if there was one for the column CN.Note.
My guess is that the column is a text column, and that you will not be allowed to change the filter logic to remove the wildcard at the beginning of the search. In this case, I would recommend looking into Full-Text Search / Indexing functionality. By enabling full text indexing, and using specific keywords such as CONTAINS, you should get better performance.
Again (as with all performance optimisation scenarios), you should still start with the query plan to see if this really is the biggest problem with the query.

SQLServer indexing for search critieria

In a sqlserver 2008 database we have a table with 10 columns. In a web application the user interface is designed to allow the user to specify search criteria on some or all of the columns. The web application calls a stored procedure which dynamically creates a sql statement with only the specified options in the where clause, then executes the query using sp_executesql.
What is the best way to index these columns? We currently have 10 indexes, each one with a different column. Should we have 1 index with all 10, or some other combination?
The bible on optimizing dynamic search queries was written by SQL Server MVP Erland Sommarskog:
http://www.sommarskog.se/dyn-search.html
For SQL Server 2008 specifically:
http://www.sommarskog.se/dyn-search-2008.html
There is a lot of information to digest here, and what you ultimately decide will depend on how the queries are formed. Are there certain parameters that are always searched on? Are there certain combinations of parameters that are usually requested together? Can you really afford to create an index on every column (remember that not all will [edit] necessarily be used even if multiple columns are mentioned in the where clause, and additional indexes are not "free" - you pay for them in maintenance)?
A compound index can only be used when the leftmost key is specified in the search condition. If you have an index on (A, B, C) it can be used to search values WHERE A =#a, WHERE A=#a AND B=#b, WHERE A=#a AND C=#c or WHERE A=#a AND B=#b AND C=#c. But it cannot be used if the leftmost key is not specified, WHERE B=#b or WHERE C=#c cannot use this index. Therefore 10 indexes each in a column can each be used for on specific user criteria, but 1 index on 10 column will only be useful if the user includes the criteria on the first column and useless on all other cases. At least this is the 10000ft answer. There are more details if you start to digg into it.
For a comprehensive discussion of your problem and possible solutions, see Dynamic Search Conditions in T-SQL.
It completely depends on what the data are: how well they index (eg an index on a column with only two values isn't going to help you much), how likely they are to be searched on, and how likely they are to be search on together.
In particular, if column A is queried a lot, and column B tends only to be queried when also querying column A, a compound index over (A, B) will make queries that search for particular values of both columns very fast, and also give you the benefits of a single index on A (but not B) for free.
It is possible that one index per column makes sense for your data, but more likely not. There will probably be a better trade-off taking into account the nature of your data and schema.
Personally I would not bother using a stored procedure to create dynamic SQL. There are no performance benefits compared to doing it in whatever server-side scripting language you're using in the webapp itself, and the language you're writing the webapp in will almost always have more flexible, readable and secure string handling functions than SQL does. Generating SQL strings in SQL itself is an exercise in pain; you will almost certainly get some escaping wrong somewhere and give yourself an SQL-injection security hole.
One index per column. The prioblem is that you have no clue about the queries, and this is the most generic way.
In my experience, having combined indexes does make the queries faster. In this case, you can't have all possible combinations.
I would suggest doing some usage testing to determine which combinations are used most frequently. Then focus on indexes that combine those columns. If the most frequent combinations are:
C1, C2, C3
C1, C2, C5
... then make a combined index on C1 and C2.

Sql Server String Comparision

Is there any information as to how SQL Server compares strings and handles searching in them (like statments)? I am trying to find out if there is a way to determine how efficient it is to store information as a large string and use sql server to do a bunch of comparisons on rows to determine which match. I know this is potentially going to be slow (the each string of information would be 2400 characters long), but I need something doucmenting how the string is compared, so I can show the efficency (or inefficency) of it.
each string of information would be 2400 characters long
Exactly 2400? So you've got fixed-width fields in there? Save your time and just split it into separate columns. You'll thank yourself later.
If you must have data, set up a test db and try it both ways. Then at least you'll have data that's specific to your system.
searching in them will be slow because you won't be able to create an index since an index can't be over 900 bytes long/wide
I would do what Joel Coehoorn suggests and split it up into columns
you also might want to split it up in more tables because you can only store 3 rows pr page with 2400 chars per row
There are full text search indexes that you can apply to sql server, which are often used for things like search engines. The full text indexes typically allow for boolean logic operators for the search.
Just additional information to the already mentioned. If you need to filter the large string with like, indices are also not used (except the wildcard % is only at the end of the search string). So it's best to avoid like and make the part you need to filter for available in an own field.
In the MSDN Article about Full-Text searches the following is called out regarding how the LIKE predicate uses character patterns.
Comparing LIKE to Full-Text Search
In contrast to full-text search, the LIKE Transact-SQL predicate works
on character patterns only. Also, you cannot use the LIKE predicate to
query formatted binary data. Furthermore, a LIKE query against a large
amount of unstructured text data is much slower than an equivalent
full-text query against the same data. A LIKE query against millions
of rows of text data can take minutes to return; whereas a full-text
query can take only seconds or less against the same data, depending
on the number of rows that are returned.

Resources