MS SQL - WHERE substring matches are phonetic?

MS SQL - WHERE substring matches are phonetic? - sql-server

I'd like to make a search feature that searches based on "sounds like" match.
For instance, lets say I have a company list that looks like this (lets say we live in Bizzaro world too):
Acme
Already allusion cite LTD
All ready illusion site INC
Apart assent
Assent sight
(Or something simmilar with names... George or Jeorge ? "Yah-way", or "ye-hova" ?)
When someone searches for something that "sounds like" the soundex("site") == S230, they should see results for "Sight" also.
As most people who've used soudnex before already know, normal substring matches obviously don't do this.
I'm trying to work out in my head how to make a WHERE clause that can match based on this, so instead of a typical WHERE company LIKE input, I'd like to run a soundex. Obviously if I run soundex on the whole company name, I won't be able to do substring searching (for example, a user searching "ALL" will never match a soundex of "All ready"). Soundex split on each word might not be worthwhile either, so I'm not sure running all combinations of a soundex is a good idea... or even if that's going to be computationally feasible in a database with more than 1000 records.
Basically the interaction I want to have is when (in an office or something) Tom says to Sally "That name was something like Rebekkah Schwartzkopff" and it can be searched phonetically for a fuzzy match.
Obviously we're going to run into issues with non-English named companies because of soudnex, but I'm will to compromise on this one.
I'd like to do this without adding anything to the database, or a stored procedure.

If SOUNDEX is a good beginning for what you are doing, you can use DIFFERENCE.
eg:
SELECT *
FROM Person
WHERE DIFFERENCE(Person.FirstName, 'George') >= 3
Note that the DIFFERENCE function returns the difference between the SOUNDEX values of two strings using a value of 0-4; 4 meaning the strings are pretty close to the same and 0 meaning they are completely different (kind of a backwards scale to me, but I suppose it works).

Very interesting question. I did a little poking around and found this:
http://www.codeproject.com/KB/database/dmetaphone4.aspx
I haven't tested it myself but it seems like it would be worth checking out.
It would require you to add something to the database, but I don't see how you can implement the functionality you want with built in SQL Server functionality...

Related

SQL Contains exact phrase

I try to implement a search-mechanism with "CONTAINS()" on a SQL Server 2014.
I've read here https://technet.microsoft.com/en-us/library/ms142538%28v=sql.105%29.aspx and in the book "Pro Full-Text Search in SQL Server 2008" that I need to use double quotes to search an exact phrase.
But e.q. if I use this CONTAINS(*, '"test"') I receive results containing words like "numerictest" also. If I try CONTAINS(*, '" test "') it is the same. I've noticed, that there are less results as if I would search with CONTAINS(*, '*test*') for a prefix, sufix search, so there is definitely a delta between the searches.
I didn't expect the "numerictest" in the first statement. Is there an explanation for this behaviour?

I have been wracking my brain about a very similar problem and I recently found the solution.
In my case I was searching full text fields for "#username" but using CONTAINS(body, "#username") returned just "username" as well. I wanted it to strictly match with the # sign.
I could use LIKE "%#username%" but the query took over a minute which was unacceptable so I kept looking.
With the help of some people in a chat room they suggested using both CONTAINS and LIKE. So:
SELECT TOP 25 * FROM table WHERE
CONTAINS(body, "#username") AND body LIKE "%#username%";
this worked perfectly for me because the contains pulls both username and #username records and then the LIKE filters out the ones with the # sign. Queries take 2-3 seconds now.
I know this is an old question but I came across it in my searching so having the answer I thought I would post it. I hope this helps.

Contains(*,'"test"') will only match full words of "test" as you expect.
Contains(*,'" test "') same as above
Contains(*,'"*test*"') will actually do a PREFIX ONLY search, basically strips out any special characters at the start of word and only uses the 2nd *.
You cannot do POSTFIX searches using full text search.
My concern lies with the Contains(*) part, this will search for any full text cataloged items in that entire row. Without seeing the data it is hard to tell but my guess is that another column in that row you think is bad is actually matching on "test" somewhere.

Proper-Casing Street names in SSRS

With the application that I am working with and writing reports for, the user is entering the Location in all upper case. It has been requested by those who my reports are going to that the Location be in proper case. This was fine till I realized that proper case does not recognize abbreviations. Is there a way to write an expression in SSDT that will, while converting the street name into proper case, also make is so abbreviations like "SE" or "DR" are upper case?

John Saunders is right, it's not simple, and it'd be better if you can fix the data at the source. But you can wrap your Proper Case function in a series of outer REPLACE Functions. It's not simple because you'll have to analyze your data and figure out all the abbreviations you want to handle, and manually code each one. It will get huge, so you might consider creating this function in SSRS custom code, so it doesn't look so cluttered in the expression builder.
Psuedo code would look something like this:
REPLACE(
REPLACE(
ProperCase(MyFieldName)
,"Se","SE")
,"Dr","DR")
Add a REPLACE(InnerExpression,ProperCaseExpression,UpperCaseExpression) for each individual abbreviation you want to handle. It won't be fun, but it will work.

How can I use SQL Server's full text search across multiple rows at once?

I'm trying to improve the search functionality on my web forums. I've got a table of posts, and each post has (among other less interesting things):
PostID, a unique ID for the individual post.
ThreadID, an ID of the thread the post belongs to. There can be any number of posts per thread.
Text, because a forum would be really boring without it.
I want to write an efficient query that will search the threads in the forum for a series of words, and it should return a hit for any ThreadID for which there are posts that include all of the search words. For example, let's say that thread 9 has post 1001 with the word "cat" in it, and also post 1027 with the word "hat" in it. I want a search for cat hat to return a hit for thread 9.
This seems like a straightforward requirement, but I don't know of an efficient way to do it. Using the regular FREETEXT and CONTAINS capabilities for N'cat AND hat' won't return any hits in the above example because the words exist in different posts, even though those posts are in the same thread. (As far as I can tell, when using CREATE FULLTEXT INDEX I have to give it my index on the primary key PostID, and can't tell it to index all posts with the same ThreadID together.)
The solution that I currently have in place works, but sucks: maintain a separate table that contains the entire concatenated post text of every thread, and make a full text index on THAT. I'm looking for a solution that doesn't require me to keep a duplicate copy of the entire text of every thread in my forums. Any ideas? Am I missing something obvious?

As far as i can see there is no "easy" way of doing this.
I would create a stored procedure which simply splits up the search words and starts looking for the first word and put the threadid's in a table variable. Then you look for the other words (if any) in the threadids you just collected (inner join).
If intrested i can write a few bits of code but im guessing you wont need it.

What are you searching for?
CAT HAT as a complete word, in which case:
CONTAINS(*,'"CAT HAT")
CAT OR HAT then..
CONTAINS (*,'CAT OR HAT')
Searching for "CAT HAT" and expecting just the post with CAT in doesn't make any sense. If the problem is parsing what the user types, you could just replace SPACES with OR (to search any of the words, AND if both required). The OR will give you both posts for thread 9.
SELECT DISTINCT ThreadId
FROM Posts
WHERE CONTAINS (*,'"CAT OR HAT")
Better still you could , if it helps, use the brilliant irony (http://irony.codeplex.com/) which translates (parses) a search string into a Fulltext query. Might help for you.
Requires the use of google syntax for the original search which can only be a good thing as most people are used to typing in google searches.
Plus here is an article on how to use it.
http://www.sqlservercentral.com/articles/Full-Text+Search+(2008)/64248/

Make SQL Server index small numbers

We're using SQL Server 2005 in a project. The users of the system have the ability to search some objects by using 'keywords'. The way we implement this is by creating a full-text catalog for the significant columns in each table that may contain these 'keywords' and then using CONTAINS to search for the keywords the user inputs in the search box in that index.
So, for example, let say you have the Movie object, and you want to let the user search for keywords in the title and body of the article, then we'd index both the Title and Plot column, and then do something like:
SELECT * FROM Movies WHERE CONTAINS(Title, keywords) OR CONTAINS(Plot, keywords)
(It's actually a bit more advanced than that, but nothing terribly complex)
Some users are adding numbers to their search, so for example they want to find 'Terminator 2'. The problem here is that, as far as I know, by default SQL Server won't index short words, thus doing a search like this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"')
is actually equivalent to doing this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator"') <-- notice the missing '2'
and we are getting a plethora of spurious results.
Is there a way to force SQL Server to index small words? Preferably, I'd rather index only numbers like 1, 2, 21, etc. I don't know where to define the indexing criteria, or even if it's possible to be as specific as that.
Well, I did that, removed the "noise-words" from the list, and now the behaviour is a bit different, but still not what you'd expect.
A search won't for "Terminator 2" (I'm just making this up, my employer might not be really happy if I disclose what we are doing... anyway, the terms are a bit different but the principle the same), I don't get anything, but I know there are objects containing the two words.
Maybe I'm doing something wrong? I removed all numbers 1 ... 9 from my noise configuration for ENG, ENU and NEU (neutral), regenerated the indexes, and tried the search.

These "small words" are considered "noise words" by the full text index. You can customize the list of noise words. This blog post provides more details. You need to repopulate your full text index when you change the noise words file.

I knew about the noise words file, but I'm not why your "Terminator 2" example is still giving you issues. You might want to try asking this on the MSDN Database Engine forum where people that specialize in this sort of thing hang out.

You can combine CONTAINS (or CONTAINSTABLE) with simple where conditions:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"') and Title like '%Terminator 2%'
While the CONTAINS find all Terminator the where will eliminate 'Terminator 1'.
Of course the engine is smart enough to start with the CONTAINS not the like condition.

Full text catalog/index search for %book%

I'm trying to wrap my head around how to search for something that appears in the middle of a word / expression - something like searching for "LIKE %book% " - but in SQL Server (2005) full text catalog.
How can I do that? It almost appears as if both CONTAINS and FREETEXT really don't support wildcard at the beginning of a search expression - can that really be?
I would have imagined that FREETEXT(*, "book") would find anything with "book" inside, including "rebooked" or something like that.

unfortunately CONTAINS only supports prefix wildcards:
CONTAINS(*, '"book*"')

SQL Server Full Text Search is based on tokenizing text into words. There is no smaller unit as a word, so the smallest things you can look for are words.
You can use prefix searches to look for matches that start with certain characters, which is possible because word lists are kept in alphabetical order and all the Server has to do is scan through the list to find matches.
To do what you want a query with a LIKE '%book%' clause would probably be just as fast (or slow).

If you want to do some serious full text searching then I would (and have) use Lucene.Net. MS SQL Full Text search never seems to work that well for anything other than the basics.

Here's a suggestion that is a workaround for that wildcard limitation. You create a computed column that contains the same content but in reverse as the column(s) you are searching.
If, for example, you are searching on a column named 'ProductTitle', then create a column named ProductsRev. Then update that field's 'Computed Column Specification' value to be:
(reverse([ProductTitle]))
Include the 'ProductsRev' column in your search and you should now be able to return results that support a wildcard at the beginning of the word. Good luck!!

Full text has a table that lists all the words the engine has found. It should have orders-of-magnitude less rows than your full-text-indexed table. You could select from that table " where field like '%book%' " to get all the words that have 'book' in them. Then use that list to write a fulltext query. Its cumbersome, but it would work, and it would be ok in the speed department. HOWEVER, ultimately you are using fulltext wrong when you are doing this. It might actually be better to educate the source of these feature requests about what fulltext is doing. You want them to understand what it WANTS to do, so they can get high value from fulltext. Example, only use wild cards at the end of a word, which means think of the words in an ordered list.

why don't program an assembly in C# to compute all the non repeated sufixes. For example if you have the Text "eat the red meat" you can store in a field "eat at t the he e red ed d meat" (note that is not necesary to add eat at and t again) ind then in this field use full text search. A function for doing that can easily written in Csharp
x) I know it seems od... it's a workarround
x) I know I'm adding overhead in the insert / update .... only justified if this overhead is insignificant besides the improvement in the search function
x) I know there is also an overhead in the size of the stored data.
But I'm pretty conffident that will be quite fast

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight