BRAVO value won't be found after a SQL SELECT CONTAINS - sql-server

I have a very strange behavior on sql server.
I have a User table with one row having BRAVO as last name.
When I use this simple request:
select * from User u where contains (u.LastName, 'BRAVO')
it finds no result.
If I update User table and set the lastname BRAVO to CRAVO (or any other letter) and call
select * from User u where contains (u.LastName, 'CRAVO')
it will work.
Is BRAVO a reserved word in SQL server? Am I missing something?
Thx

By default when you create a full text index it is associated with a system stoplist.Default stoplist has more than 150 words for english language.You can run below query and see all the stop words for english language for a particular database.
SELECT stopword,language_id FROM sys.fulltext_system_stopwords WHERE language_id = 1033
What is a stop list ? :
Stopwords are managed in databases by using objects called stoplists. A stoplist is a list of stopwords that, when associated with a full-text index, is applied to full-text queries on that index.
What is a stop word ? :
To prevent a full-text index from becoming bloated, SQL Server has a mechanism that discards commonly occurring strings that do not help the search. These discarded strings are called stopwords. During index creation, the Full-Text Engine omits stopwords from the full-text index.
You can use the below query to find the system specified stop words in English Language :
SELECT ssw.stopword, slg.name
FROM sys.fulltext_system_stopwords ssw
JOIN sys.fulltext_languages slg
ON slg.lcid = ssw.language_id
WHERE slg.lcid =1033
So, if you include these words in full text search, including BRAVO (for exact match, match giving in double quotes) it won’t give you the exact result.

Related

SQL Server fulltext search issue

I am working with SQL Server full text search. The issue is SQL Server is returning the wrong records.
For example: I am searching for was word in article's table column striptitle
SELECT
TitleStripped
FROM
[pastic_com].[dbo].[Psa_Articles]
WHERE
FREETEXT (TitleStripped, 'was')
With this query, I found 8 records; for reference two of them are pasted below:
Seasonal dynamics and relative abundance of AM fungi in rhizosphere of rice (Oryza sativa L. cv. Basmati supper).
Seasonal dynamics of AM fungi in sugarcane (Saccharum officinarum L.CV.SPF-213) in relation to red rot (Colletotrichum falcatum) disease from Punjab, Pakistan.
You will notice title column does not contain "was" word .
For more reference here's a screenshot:
[1]: https://i.stack.imgur.com/w0gdI.png
The full text search depends on thesaurus files and stoplist objects. Please double check your configuration for entries related to was.
Also, note the difference between FREETEXT and CONTAINS. If you look for exact matches of the word was then try CONTAINS instead of FREETEXT for the reason below.
Snippet from the documentation for FREETEXT, you probably want to avoid these actions.
Is a predicate used in the Transact-SQL WHERE clause of a Transact-SQL
SELECT statement to perform a SQL Server full-text search on full-text
indexed columns containing character-based data types. This predicate
searches for values that match the meaning and not just the exact
wording of the words in the search condition. When FREETEXT is used,
the full-text query engine internally performs the following actions
on the freetext_string, assigns each term a weight, and then finds the
matches:
Separates the string into individual words based on word boundaries
(word-breaking).
Generates inflectional forms of the words (stemming).
Identifies a list of expansions or replacements for the terms based on
matches in the thesaurus.

Multi-word CONTAINS full-text search only working partially in SQL Server

I'm using SQL Server 2012 and have created full-text index for NAME column in COMPANY table. All the searches I've tested are of the following format (with variable number of words to search), matching by beginnings of words in any order:
select id, name from company where contains(name, '"ka*" AND "de*"')
The problem is that there are cases where this query doesn't return any results even though it should be perfect match. For example when company name is "ka de we oy", the example above returns a match but '"ka*" AND "de*" AND "we*"' does not and neither does searching with all the four 'words'.
There are also other cases where, strangely enough, the search does not return results even with exact words. This seems related to very short (two-letter) words. There are also some issues with searching with many (6+) words.
Is there some explicit restriction to the number of words in a single query or how short they can be? How can I fix or work around this?
Edit: it seems to be certain common English words which are entirely excluded from the index (like 'we' in the example). This is an issue since it's a requirement that a few of the common words definitely should be searchable. Is there any way to change which words are not indexed or e.g. change the 'language' of the indexing to apply different set of common words that are left out?
Apparently this is simply a case of defining correct stopwords / stoplist:
https://msdn.microsoft.com/en-us/library/ms142551.aspx
https://msdn.microsoft.com/en-us/library/cc280405.aspx
Or setting the full-text index language for the column to the actual language so that English words don't cause issues.
Edit: actually it was easiest to simply disable the stoplist for the table entirely:
ALTER FULLTEXT INDEX ON company SET STOPLIST = OFF
Hopefully this helps someone else

Why is SQL Server 2012 applying stop words that are NOT present in our custom stoplist?

We use SQL Server 2012 CONTAINSTABLE full text search queries and we want certain words to be found: 'noord', 'oost', 'zuid', 'west'. The example is for Dutch but the issue is not language specific.
For example 'noord' is not found because this is a word in the Dutch system stoplist. This is understandable.
We therefore created a custom stoplist from the system stoplist, and removed the offending stop words: 'noord', 'west' and 'zuid' in this case.
Queries containing search term 'noord' now yield results, as expected. However search term 'west' still yields no results.
Despite correctly using the custom stoplist, rebuilding the full text catalog and so on SQL Server still applies stop word 'west'. Why?
In short this seems to be caused by other stop words: 'zuidzuidwest' and 'westzuidwest'. SQL-Server applies some splitting mechanism, causing 'west' to still be a stop word. Possibly it uses a word breaker, or it applies the system stoplist to split words in the custom stoplist.
The measure taken is to remove stop words 'zuidzuidwest' and 'westzuidwest' from the custom stoplist. This solves the issue.
Below some details.
Whether words are in the system stoplist can be established using the following query.
SELECT * FROM sys.fulltext_system_stopwords WHERE language_id=1043
AND stopword IN ('noord', 'oost', 'zuid', 'west');
This yields
noord 1043
west 1043
zuid 1043
Create a custom stoplist from the system stoplist:
CREATE FULLTEXT STOPLIST CustomStoplist FROM SYSTEM STOPLIST;
Establish the stoplist id:
SELECT * FROM sys.fulltext_stoplists;
Yields (in this case) stop list id 6 to be used in the queries below. Remove offending stop words:
ALTER FULLTEXT STOPLIST CustomStoplist DROP 'noord' LANGUAGE 1043;
ALTER FULLTEXT STOPLIST CustomStoplist DROP 'west' LANGUAGE 1043;
ALTER FULLTEXT STOPLIST CustomStoplist DROP 'zuid' LANGUAGE 1043;
The following query shows that SQL-Server will still filter 'zuid' and 'west':
SELECT * FROM sys.dm_fts_parser('"noord" or "oost" or "zuid" or "west"', 1043, 6, 0);
This shows that 'zuid' and 'west' are noise words, despite the words being removed from the custom stoplist.
Exact Match noord
Exact Match oost
Noise Word zuid
Noise Word west
Take the above measure:
ALTER FULLTEXT STOPLIST CustomStoplist DROP 'zuidzuidwest' LANGUAGE 1043;
ALTER FULLTEXT STOPLIST CustomStoplist DROP 'westzuidwest' LANGUAGE 1043;
Repeat the dm_fts_parser query: problem solved.
In order to find all composed words that could interfere:
SELECT * FROM sys.fulltext_stopwords WHERE stoplist_id=6
AND language_id=1043
AND (stopword LIKE '%noord%' OR stopword LIKE '%oost%'
OR stopword LIKE '%zuid%' OR stopword LIKE '%west%');
Yields for example 'zuidwest' and 'zuidzuidoost'. To be sure words like these can also be dropped from the custom stoplist.
For completeness a search query. Note this query cannot be run because table Contents and columns Nr and Title are application specific.
SELECT c.Nr, c.Title FROM CONTAINSTABLE(Contents, (Title),
'"noord" or "oost" or "zuid" or "west"') x JOIN Contents c
ON x.[KEY]=c.Nr ORDER BY c.Nr;
The query yields a certain number of hits. After dropping stop word 'zuidzuidwest' the number of hits increases, which was the original goal. After dropping stop word 'westzuidwest' the number of hits increases even more. Thereafter dropping additional stop words like 'zuidwest' does not result in additional hits.

Full text search - Contains plus wildcard and single quote

I have a table with a name field with this
Test O'neill 123
If I use
SELECT *
FROM table F
WHERE CONTAINS ( F.*, '"Test O''neill 123"' )
it works fine but if I use a wildcard * I get no results.
SELECT *
FROM table f
WHERE CONTAINS ( F.*, '"Test O''neill 123*"' )
why is this ?
I am using a parser for my search terms and this is adding the wildcard *
I checked some sites, about escaping the ' but I haven't found anything referred to this..
Thanks in Advance
The problem is due to the combination of 1) using the Neutral language 2) plus a stoplist for your full text index 3) plus unexpected behavior when using a wildcard in a search that includes stopwords.
The Neutral language doesn't cover all of the nuances of the English language, so at index-time it considers O'neill to be 2 separate words O and neill. Then your stoplist considers O to be a stopword so this "word" is not added to the index, only neill is.
At search-time, the search engine typically ignores stop words in multi-word phrases. For example, searching for Contains(*, '"we x people"') will match the text ...we the people..., x and the both being stopwords and thus automatically "matching" each other. (I use the term "matching" loosely because the search engine is not matching the stopwords, but rather it knows that people is 1 word away from we.)
So you might expect the wildcard search Contains(*, '"we the people*"') to also find its match, except that it does not when using a stoplist. If it weren't for the stopword the in the search phrase, or if the was not considered a stopword, the search would work fine. I really can't explain this behavior but I suspect it has something to do with the way the word positions are computed. I also suspect it is not the intended behavior.
So back to your case, Contains(*, '"Test O''neill 123"') will find a match but the wildcard search Contains(*, '"Test O''neill 123*"') does not. (You can even simplify the search to Contains(*, '"O''neill*"') and you'll see that it still does not find a match.) The combination of the stopword O with a wildcard runs into the problem I explained in the last paragraph. This is the crux of the problem stated in your question.
Solutions ranging from most-effective to least-effective-but-possibly-more-practical-for-your-case:
1) Change the language on your full text index to English and re-index. This will cause O'neill to be treated as 1 word and thus you'll avoid the weird wildcard behavior that I explained. You can change the language in the full text index properties via SQL Server Management Studio or by dropping and recreating the index as follows:
ALTER FULLTEXT INDEX ON MyTable DROP (Column1)
GO
ALTER FULLTEXT INDEX ON MyTable ADD (Column1 LANGUAGE [English])
-- repeat for each column in the index
2) If you need to keep using the Neutral language, consider removing O from your stoplist and re-index.
ALTER FULLTEXT STOPLIST MyStoplist DROP 'o' LANGUAGE 'Neutral';
3) Or don't use a stoplist if you don't need one.
ALTER FULLTEXT INDEX ON MyTable SET STOPLIST = OFF
4) If none of the above solutions are practical, consider removing stopwords from the search phrase, or at least the O' prefix in surnames.

SQL Server FullText search for phrase containing punctuation, e.g. IP address?

I'm using SQL Server fulltext to search text in large varchar/varbinary columns that may contain IP addresses. I understand that the dots in the address aren't in the index, but I thought a phrase search such as this would work:
select * from myFTETable where contains(myFTEcolumn, "192 168 100 101")
It doesn't. What am I doing wrong? Is there a way to search for IP addresses, or more generally is there a way to do a phrase search when the phrase in the original data contains punctuation?
Thanks.
You're right, the LIKE operator doesn't take advantage of the fulltext index resulting in long query runtimes as your database grows.
Have you tried querying the internal index table to see what numbers are being indexed? This can be accomplished by running -
SELECT * FROM sys.dm_fts_index_keywords(db_id('{database}'), object_id('{table}'))
Inserting 192.168.100.101 in a full-text indexed table shows up as 8-distinct entries internally(4 numeric, 4 character) and running a CONTAINS('"192 168 100 101"') brings up the relevant row.
As a caveat, fulltext will strip some of the lower numbers as part of its stoplists mechanism. This can be overridden by specifying STOPLIST OFF during index creation or removing the matching strings from the internal stoplist.

Resources