SQL Server, Full text search word breakers

SQL Server, Full text search word breakers - sql-server

In the sql server documentation for Full Text Search, and validated in production sadly, searching using english language the system will match exact phrases ignoring punctuation between words.
Books online says:
Punctuation is ignored. Therefore, CONTAINS(testing, "computer
failure") matches a row with the value, "Where is my computer? Failure
to find it would be expensive."
Is there a word breaker for english that doesn't ignore punctuation so rows like their example would not be returned?

That is the limitation of FTS or say good thing of FTS. FTS is used to fast search as well as this type of search where you don't know exact string.
If you want exact or ignoring this type of thing, you have to use Like search rather than FTS.

Related

Querytype=Full and searching for stop words returns no results

When using azure cognitive search, we are using full query syntax. When searching for something like: the document we create a query like this (this is a simplified example):
(Title:the OR Contents:the) AND (Title:document OR Contents:document)
(we need to split up the query for unrelated reasons)
The problem is that the could be a stopword in the language we are searching in (we search in several languages), causing the entire query to fail. We would like to be able to ignore stop words in generating queries like this, of have the search engine simply return true for the specific stop word search parts
I figure the latter is not possible. (or is it?). Might there be a way to query the stop words for specific language analyzers so we can exclude the stop words ourselves? Or is there a way to alter out query to be able to handle stop words better?

If you want to strip stop words from your search query the only thing I can think of is calling the analyzer with the search query and check the returned tokens.
In this example you would call the en.microsoft analyzer with the search query "the document".
The tokens returned only contain "document", so you know "the" is considered a stop word by the analyzer. But when searching multiple languages you might need to call multiple analyzers and strip stop words for all those languages.

Azure search: Wild card queries does not work with japanese/chinese characters

I used icu_tokenizer using custom analyzer to create a search index for Japanese words. Index was created successfully. Using icu_tokenizer as for asian languages it works better than the default azure search tokenizer.
Now when I use query for string Ex:- 赤城 I see multiple search results (total 131) from the index. But when I use the wild card search with the same word, Ex: 赤城* (adding * at the end of the word) or /赤城.*/ (using regex search query) i see 0 search results. The weird part is that * seems to work with single japanese character 赤* gives me same number of search results as 赤 gives. But as soon as I increase the number of japanese characters from 1, wild card queries with * stops working and returns 0 search result. All of these queries I am testing it on search explorer on Azure portal using querytype=full (lucene syntax query)
In my application search terms are normally used as prefix search so normally we append * at the end of the search string to fetch search results but looks like these lucene wildcard queries with japanse characters just do not work. Any idea, how can I make these prefix queries (using wildcard * at end of search strings) work when search strings are given in japanese characters?
Any quick help will be much appreciated!!

I tested with my installation now and I can confirm that wildcards only work with Japanese content when you use a Japanese analyzer.
In my example I set up one index using a property Body that does not have a specific analyzer defined. Then I set up another index where Body uses the ja.microsoft language analyzer. The content in both indexes are identical. I then tried to search for 自動車 (automobile) with a trailing wildcard.
自動車* returns multiple hits from my index using the japanese analyzer. No hits are returned from the index without a specific analyzer defined.

sorry for the late reply.
Have you tried using one of the Japanese language analyzers? For example, ja.microsoft
Also, if you want to use prefix search, you can try experimenting with the suggester feature which is designed to be efficient for this scenario.

Multi-word CONTAINS full-text search only working partially in SQL Server

I'm using SQL Server 2012 and have created full-text index for NAME column in COMPANY table. All the searches I've tested are of the following format (with variable number of words to search), matching by beginnings of words in any order:
select id, name from company where contains(name, '"ka*" AND "de*"')
The problem is that there are cases where this query doesn't return any results even though it should be perfect match. For example when company name is "ka de we oy", the example above returns a match but '"ka*" AND "de*" AND "we*"' does not and neither does searching with all the four 'words'.
There are also other cases where, strangely enough, the search does not return results even with exact words. This seems related to very short (two-letter) words. There are also some issues with searching with many (6+) words.
Is there some explicit restriction to the number of words in a single query or how short they can be? How can I fix or work around this?
Edit: it seems to be certain common English words which are entirely excluded from the index (like 'we' in the example). This is an issue since it's a requirement that a few of the common words definitely should be searchable. Is there any way to change which words are not indexed or e.g. change the 'language' of the indexing to apply different set of common words that are left out?

Apparently this is simply a case of defining correct stopwords / stoplist:
https://msdn.microsoft.com/en-us/library/ms142551.aspx
https://msdn.microsoft.com/en-us/library/cc280405.aspx
Or setting the full-text index language for the column to the actual language so that English words don't cause issues.
Edit: actually it was easiest to simply disable the stoplist for the table entirely:
ALTER FULLTEXT INDEX ON company SET STOPLIST = OFF
Hopefully this helps someone else

SQL Server Full Text Index Contains search exact match containing "it"

I'm fairly new to Full Text Index in SQL server. It has been working really well for me however, recently someone did an exact match search for "IT Manager" and the "IT" part of the search seems to be ignored.
e.g.
SELECT * FROM CONTAINSTABLE(vCandidateSearch, SearchText, '"it manager"')
and
SELECT * FROM CONTAINSTABLE(vCandidateSearch, SearchText, '"manager"')
return the same results. What am I doing wrong?

The problem is that the fulltext engine sees "it" as a "noise" - or stop - word, and ignores it.
Assuming you're using SQL 2008+, then see the documentation here on stoplists and stopwords: https://msdn.microsoft.com/en-us/library/ms142551(v=sql.100).aspx
These are lists containing various "filler" words (e.g. "a" "the" "it" etc) in various languages, that are generally not useful in fulltext searches and are ignored.
My experience is that these default lists are great for searching larger bodies of text, but often not so useful for things like product (or indeed job) titles that need to be more specific.
You can create your own stoplists containing (or not) whatever stopwords are appropriate for your particular need.
For a job title search it may well be appropriate to use no stopwords at all for that particular column. You can choose which stoplist (containing stopwords) is associated with a particular fulltext index when the index is created. You can create an empty list if need be, and use it in an index on one column only (although you would have to adjust your queries to take this into account).
In the unlikely event you're on SQL 2005 or below, it uses a much more primitive system of "noise words" that are just held in a text file: https://msdn.microsoft.com/en-us/library/ms142551(v=sql.90).aspx

"" doesn't mean an exact match. It just looks for that phrase in the text.
If I have a value
The big red house
Example matches
"big red house"
"big"
"house"
"red house"
Example of a non match
"the big yellow"
If you need that only "The big red house" matches then you might be better off creating a non-clustered index on that column and using a regular = predicate

Parsing search queries for SQL 2008 FTS

We want to use SQL SERVER 2008 Full Text Search and seem to run into a lot of problems handling the search query.
If the user types in "blue dog" it just crashes sql unless we parse the search terms to include the "" around the words but that makes it a phrase instead of keywords.
I want results where blue or dog are included but that means replacing spaces with or(s) and so on. Unfortunately there seem to be far too many combination a user might type.
Are there any libraries out there (for .net) that can already parse a search string into something FT understands?
We'd like a Google like syntax :)
thanks

I was looking for the "FREETEXT" option and was using the "CONTAINS" keyword instead, my bad. Freetext is giving me the results I wanted.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

SQL Server, Full text search word breakers - sql-server

That is the limitation of FTS or say good thing of FTS. FTS is used to fast search as well as this type of search where you don't know exact string. If you want exact or ignoring this type of thing, you have to use Like search rather than FTS.

Related

Querytype=Full and searching for stop words returns no results

Azure search: Wild card queries does not work with japanese/chinese characters

Multi-word CONTAINS full-text search only working partially in SQL Server

SQL Server Full Text Index Contains search exact match containing "it"

Parsing search queries for SQL 2008 FTS

Categories

Resources