MS SQL : Full Text Search Results are not relevant - sql-server

I am trying MS SQL Full Text Query on single column.
For this I am using "FREETEXTABLE" function.
When I query "Horse ride" the result set contains videos where title contain the word "ride".
No wonder that when using FREE or "FREETEXTTABLE" the process is to break query string
into words, create inflectional words and that is how the result set get generated.
So my question is if this is the process, why the result set have no video where the "horse" word is
present (I have videos in DB where videos title contains the "horse" word).
Is it because the word breaker gives preference to "verbs" ?
Please comment on how "word breaker" and "stemmer" works for English language.
Links where I could find grate details about "word breaker" and "stemmer" will also be
help full.
This is very important for me to get relevant results every time.
Thank you.

Full text search filters the noise words and punctuations and you have the flexibility of adding more noise words to the default list of noise words. But to manipulate verbs, inflectional or synonyms we can make use of different functions in where clause.
In your case if you are looking for fields where the word "Horse" AND "ride" exists you can simply make use of Contains function, something like this....
SELECT ColumnName
FROM TableName
WHERE Contains(ColumnName, '"horse" AND "ride"')
If you are looking for values where there is word "Horse" and any inflectional form of "ride" say like ride, riding. You can use something like this ....
SELECT ColumnName
FROM TableName
WHERE Contains(ColumnName, '"horse"') AND CONTAINS(ColumnName, 'FORMSOF(INFLECTIONAL, ride)')

Related

Customize Normalization in SQL Server Full Text Search by replacing characters

I want to customize SQL Server FTS to handle language specific features better.
In many language like Persian and Arabic there are similar characters that in a proper search behavior they should consider as identical char like these groups:
['آ' , 'ا' , 'ء' , 'ا']
['ي' , 'ی' , 'ئ']
Currently my best solution is to store duplicate data in new column and replace these characters with a representative member and also normalize search term and perform search in the duplicated column.
Is there any way to tell SQL Server to treat any members of these groups as an identical character?
as far as i understand ,this would be used for suggestioning purposes so the being so accurate is not important. so
in farsi actually none of the character in list above doesn't share same meaning but we can say they do have a shared short form in some writing cases ('آ' != 'اِ' but they both can write as 'ا' )
SCENARIO 1 : THE INPUT TEXT IS IN COMPLETE FORM
imagine "محمّد" is a record in a table formatted (id int,text nvarchar(12))named as 'table'.
after removing special character we can use following command :
select * from [db].[dbo].[table] where text REPLACE(text,' ّ ','') = REPLACE(N'محمد',' ّ ','');
the result would be
SCENARIO 2: THE INPUT IS IN SHORT FORMAT
imagine "محمد" is a record in a table formatted (id int,text nvarchar(12))named as 'table'.
in this scenario we need to do some logical operation on text before we query in data base
for e.g. if "محمد" is input as we know and have a list of this special character ,it should be easily searched in query as :
select * from [db].[dbo].[table] where REPLACE(text,' ّ ','') = 'محمد';
note:
this solution is not exactly a best one because the input should not be affected in client side it, would be better if the sql server configure to handle this.
for people who doesn't understand farsi simply he wanna tell sql that َA =["B","C"] and a have same value these character in the list so :
when a "dad" word searched, if any word "dbd" or "dcd" exist return them too.
add:
some set of characters can have same meaning some of some times not ( ['ي','أ'] are same but ['آ','اِ'] not) so in we got first scenario :
select * from [db].[dbo].[table] where text like N'%هی[أي]ت' and text like N'هی[أي]ت%';

SQL Contains exact phrase

I try to implement a search-mechanism with "CONTAINS()" on a SQL Server 2014.
I've read here https://technet.microsoft.com/en-us/library/ms142538%28v=sql.105%29.aspx and in the book "Pro Full-Text Search in SQL Server 2008" that I need to use double quotes to search an exact phrase.
But e.q. if I use this CONTAINS(*, '"test"') I receive results containing words like "numerictest" also. If I try CONTAINS(*, '" test "') it is the same. I've noticed, that there are less results as if I would search with CONTAINS(*, '*test*') for a prefix, sufix search, so there is definitely a delta between the searches.
I didn't expect the "numerictest" in the first statement. Is there an explanation for this behaviour?
I have been wracking my brain about a very similar problem and I recently found the solution.
In my case I was searching full text fields for "#username" but using CONTAINS(body, "#username") returned just "username" as well. I wanted it to strictly match with the # sign.
I could use LIKE "%#username%" but the query took over a minute which was unacceptable so I kept looking.
With the help of some people in a chat room they suggested using both CONTAINS and LIKE. So:
SELECT TOP 25 * FROM table WHERE
CONTAINS(body, "#username") AND body LIKE "%#username%";
this worked perfectly for me because the contains pulls both username and #username records and then the LIKE filters out the ones with the # sign. Queries take 2-3 seconds now.
I know this is an old question but I came across it in my searching so having the answer I thought I would post it. I hope this helps.
Contains(*,'"test"') will only match full words of "test" as you expect.
Contains(*,'" test "') same as above
Contains(*,'"*test*"') will actually do a PREFIX ONLY search, basically strips out any special characters at the start of word and only uses the 2nd *.
You cannot do POSTFIX searches using full text search.
My concern lies with the Contains(*) part, this will search for any full text cataloged items in that entire row. Without seeing the data it is hard to tell but my guess is that another column in that row you think is bad is actually matching on "test" somewhere.

SQL Server Full Text Search - Is it possible to search in the middle of a word?

I have full text search on my database.
Is it possible to search in the middle of a word for some text?
For example, I have a column Description that contains the following text:
Revolution
Is it possible to search for 'EVO' and have it find it in the word Revolution or am I stuck doing a LIKE:
SELECT * FROM Table WHERE Description LIKE '%EVO%'
Is there a FTS equivalent of the above query?
EDIT
I want to make it clear what I am trying to ask because it appear a few people might be confused. I believe that SQL Server FTS can only search at the beginning of the word (prefix search). So if I query like:
SELECT * FROM Table WHERE CONTAINS(Description, '"Revo*"')
Then it will find the word Revolution. I want to know if it is possible at all to search something in the MIDDLE of the word. Not at the end. Not at the beginning. From what it looks like this is not possible and it makes sense because how would SQL server index this, but I just wanted to be certain.
This looks like it has come up before and the short answer was "No".
Previous thread
You can use CONTAINS. See this link
Full text catalog/index search for %book%
The only way to do this search is to add a "rotational" break down of the words.
As an exemple, the word "locomotion" will be break down into 9 new "word" like :
"ocomotion"
"comotion"
"omotion"
"motion"
"otion"
"tion"
"ion"
"on"
"n"
So now you can put this table into the Full Text Search (or create à new columns with all these parts of the word) to find it quickly.
I wrote a paper to do that without FTS (but it is in french) :
https://blog.developpez.com/sqlpro/p13123/langage-sql-norme/like-mot-ou-les-index-rotatifs
You can use CONTAINS instead of LIKE.
SELECT *
FROM Table
WHERE CONTAINS(Description, '"EVO*"')

Make SQL Server index small numbers

We're using SQL Server 2005 in a project. The users of the system have the ability to search some objects by using 'keywords'. The way we implement this is by creating a full-text catalog for the significant columns in each table that may contain these 'keywords' and then using CONTAINS to search for the keywords the user inputs in the search box in that index.
So, for example, let say you have the Movie object, and you want to let the user search for keywords in the title and body of the article, then we'd index both the Title and Plot column, and then do something like:
SELECT * FROM Movies WHERE CONTAINS(Title, keywords) OR CONTAINS(Plot, keywords)
(It's actually a bit more advanced than that, but nothing terribly complex)
Some users are adding numbers to their search, so for example they want to find 'Terminator 2'. The problem here is that, as far as I know, by default SQL Server won't index short words, thus doing a search like this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"')
is actually equivalent to doing this:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator"') <-- notice the missing '2'
and we are getting a plethora of spurious results.
Is there a way to force SQL Server to index small words? Preferably, I'd rather index only numbers like 1, 2, 21, etc. I don't know where to define the indexing criteria, or even if it's possible to be as specific as that.
Well, I did that, removed the "noise-words" from the list, and now the behaviour is a bit different, but still not what you'd expect.
A search won't for "Terminator 2" (I'm just making this up, my employer might not be really happy if I disclose what we are doing... anyway, the terms are a bit different but the principle the same), I don't get anything, but I know there are objects containing the two words.
Maybe I'm doing something wrong? I removed all numbers 1 ... 9 from my noise configuration for ENG, ENU and NEU (neutral), regenerated the indexes, and tried the search.
These "small words" are considered "noise words" by the full text index. You can customize the list of noise words. This blog post provides more details. You need to repopulate your full text index when you change the noise words file.
I knew about the noise words file, but I'm not why your "Terminator 2" example is still giving you issues. You might want to try asking this on the MSDN Database Engine forum where people that specialize in this sort of thing hang out.
You can combine CONTAINS (or CONTAINSTABLE) with simple where conditions:
SELECT * FROM Movies WHERE CONTAINS(Title, '"Terminator 2"') and Title like '%Terminator 2%'
While the CONTAINS find all Terminator the where will eliminate 'Terminator 1'.
Of course the engine is smart enough to start with the CONTAINS not the like condition.

Full text catalog/index search for %book%

I'm trying to wrap my head around how to search for something that appears in the middle of a word / expression - something like searching for "LIKE %book% " - but in SQL Server (2005) full text catalog.
How can I do that? It almost appears as if both CONTAINS and FREETEXT really don't support wildcard at the beginning of a search expression - can that really be?
I would have imagined that FREETEXT(*, "book") would find anything with "book" inside, including "rebooked" or something like that.
unfortunately CONTAINS only supports prefix wildcards:
CONTAINS(*, '"book*"')
SQL Server Full Text Search is based on tokenizing text into words. There is no smaller unit as a word, so the smallest things you can look for are words.
You can use prefix searches to look for matches that start with certain characters, which is possible because word lists are kept in alphabetical order and all the Server has to do is scan through the list to find matches.
To do what you want a query with a LIKE '%book%' clause would probably be just as fast (or slow).
If you want to do some serious full text searching then I would (and have) use Lucene.Net. MS SQL Full Text search never seems to work that well for anything other than the basics.
Here's a suggestion that is a workaround for that wildcard limitation. You create a computed column that contains the same content but in reverse as the column(s) you are searching.
If, for example, you are searching on a column named 'ProductTitle', then create a column named ProductsRev. Then update that field's 'Computed Column Specification' value to be:
(reverse([ProductTitle]))
Include the 'ProductsRev' column in your search and you should now be able to return results that support a wildcard at the beginning of the word. Good luck!!
Full text has a table that lists all the words the engine has found. It should have orders-of-magnitude less rows than your full-text-indexed table. You could select from that table " where field like '%book%' " to get all the words that have 'book' in them. Then use that list to write a fulltext query. Its cumbersome, but it would work, and it would be ok in the speed department. HOWEVER, ultimately you are using fulltext wrong when you are doing this. It might actually be better to educate the source of these feature requests about what fulltext is doing. You want them to understand what it WANTS to do, so they can get high value from fulltext. Example, only use wild cards at the end of a word, which means think of the words in an ordered list.
why don't program an assembly in C# to compute all the non repeated sufixes. For example if you have the Text "eat the red meat" you can store in a field "eat at t the he e red ed d meat" (note that is not necesary to add eat at and t again) ind then in this field use full text search. A function for doing that can easily written in Csharp
x) I know it seems od... it's a workarround
x) I know I'm adding overhead in the insert / update .... only justified if this overhead is insignificant besides the improvement in the search function
x) I know there is also an overhead in the size of the stored data.
But I'm pretty conffident that will be quite fast

Resources