SQL Server Full Text Index Contains search exact match containing "it" - sql-server

I'm fairly new to Full Text Index in SQL server. It has been working really well for me however, recently someone did an exact match search for "IT Manager" and the "IT" part of the search seems to be ignored.
e.g.
SELECT * FROM CONTAINSTABLE(vCandidateSearch, SearchText, '"it manager"')
and
SELECT * FROM CONTAINSTABLE(vCandidateSearch, SearchText, '"manager"')
return the same results. What am I doing wrong?

The problem is that the fulltext engine sees "it" as a "noise" - or stop - word, and ignores it.
Assuming you're using SQL 2008+, then see the documentation here on stoplists and stopwords: https://msdn.microsoft.com/en-us/library/ms142551(v=sql.100).aspx
These are lists containing various "filler" words (e.g. "a" "the" "it" etc) in various languages, that are generally not useful in fulltext searches and are ignored.
My experience is that these default lists are great for searching larger bodies of text, but often not so useful for things like product (or indeed job) titles that need to be more specific.
You can create your own stoplists containing (or not) whatever stopwords are appropriate for your particular need.
For a job title search it may well be appropriate to use no stopwords at all for that particular column. You can choose which stoplist (containing stopwords) is associated with a particular fulltext index when the index is created. You can create an empty list if need be, and use it in an index on one column only (although you would have to adjust your queries to take this into account).
In the unlikely event you're on SQL 2005 or below, it uses a much more primitive system of "noise words" that are just held in a text file: https://msdn.microsoft.com/en-us/library/ms142551(v=sql.90).aspx

"" doesn't mean an exact match. It just looks for that phrase in the text.
If I have a value
The big red house
Example matches
"big red house"
"big"
"house"
"red house"
Example of a non match
"the big yellow"
If you need that only "The big red house" matches then you might be better off creating a non-clustered index on that column and using a regular = predicate

Related

SQL Server fulltext search issue

I am working with SQL Server full text search. The issue is SQL Server is returning the wrong records.
For example: I am searching for was word in article's table column striptitle
SELECT
TitleStripped
FROM
[pastic_com].[dbo].[Psa_Articles]
WHERE
FREETEXT (TitleStripped, 'was')
With this query, I found 8 records; for reference two of them are pasted below:
Seasonal dynamics and relative abundance of AM fungi in rhizosphere of rice (Oryza sativa L. cv. Basmati supper).
Seasonal dynamics of AM fungi in sugarcane (Saccharum officinarum L.CV.SPF-213) in relation to red rot (Colletotrichum falcatum) disease from Punjab, Pakistan.
You will notice title column does not contain "was" word .
For more reference here's a screenshot:
[1]: https://i.stack.imgur.com/w0gdI.png
The full text search depends on thesaurus files and stoplist objects. Please double check your configuration for entries related to was.
Also, note the difference between FREETEXT and CONTAINS. If you look for exact matches of the word was then try CONTAINS instead of FREETEXT for the reason below.
Snippet from the documentation for FREETEXT, you probably want to avoid these actions.
Is a predicate used in the Transact-SQL WHERE clause of a Transact-SQL
SELECT statement to perform a SQL Server full-text search on full-text
indexed columns containing character-based data types. This predicate
searches for values that match the meaning and not just the exact
wording of the words in the search condition. When FREETEXT is used,
the full-text query engine internally performs the following actions
on the freetext_string, assigns each term a weight, and then finds the
matches:
Separates the string into individual words based on word boundaries
(word-breaking).
Generates inflectional forms of the words (stemming).
Identifies a list of expansions or replacements for the terms based on
matches in the thesaurus.

FullText Search with FreeText but only return records that contain all words of an expression, except those on stopword list

i´ve searched for a solution for a while now. Anyway I can not come up with a way that returns me the recordset I want.
I have a table full of different texts as a collection of all texts used in a HMI software.
Now when a user creates a new text I want to check if a similar text already exists in the table.
I´ve come so far to find FullTextSearch on MS SQL Server should be the best way to do this. My Problem is the following:
When I use FreeText on a new text that should be checked for similar values I get way to many results. Every record is listed that contains even only one of the relevant words in my search string.
Example:
Search text:
Deceleration Linear Motor Transfer to Top
Values that should be found:
'Deceleration linear motor transfer top'
'Deceleration linear motor handover to top'
Values that should not be found:
'Accelearion linear motor handover to top'
'linear motor handover to top'
So I want it to work just like FreeText is working (with INFLECTIONAL and THESAURUS comparison), but only records that contain all words in the search string, except those who are on the stopword list (so fill words are also ignored).
I thought about using Contains in combination with Formsof for every single word in my search string. But then it does not ignore those words on the stopword list.
I hope I was able to specify my problem properly and hope someone can help me with it.
Thanks in advance.
For anyone who might also run into this kinda problem. I solved it myself by now with the sledgehammer approach.
I just concat all words in my search expression with
(Formsof(... Thesaurus, *Word1* ) OR Formsof(... INFLECTIONAL *Word1*)) AND
(Formsof(... Thesaurus, *Word2* ) OR Formsof(... INFLECTIONAL *Word2*))
For the stopwords I skip those words manually by checking each word if it is listed before adding it to my where string.
This article helped me a lot with getting the correct language id for the selected column in the code.
Some Useful Full Text Index Stoplist Related Queries

Multi-word CONTAINS full-text search only working partially in SQL Server

I'm using SQL Server 2012 and have created full-text index for NAME column in COMPANY table. All the searches I've tested are of the following format (with variable number of words to search), matching by beginnings of words in any order:
select id, name from company where contains(name, '"ka*" AND "de*"')
The problem is that there are cases where this query doesn't return any results even though it should be perfect match. For example when company name is "ka de we oy", the example above returns a match but '"ka*" AND "de*" AND "we*"' does not and neither does searching with all the four 'words'.
There are also other cases where, strangely enough, the search does not return results even with exact words. This seems related to very short (two-letter) words. There are also some issues with searching with many (6+) words.
Is there some explicit restriction to the number of words in a single query or how short they can be? How can I fix or work around this?
Edit: it seems to be certain common English words which are entirely excluded from the index (like 'we' in the example). This is an issue since it's a requirement that a few of the common words definitely should be searchable. Is there any way to change which words are not indexed or e.g. change the 'language' of the indexing to apply different set of common words that are left out?
Apparently this is simply a case of defining correct stopwords / stoplist:
https://msdn.microsoft.com/en-us/library/ms142551.aspx
https://msdn.microsoft.com/en-us/library/cc280405.aspx
Or setting the full-text index language for the column to the actual language so that English words don't cause issues.
Edit: actually it was easiest to simply disable the stoplist for the table entirely:
ALTER FULLTEXT INDEX ON company SET STOPLIST = OFF
Hopefully this helps someone else

Why solr.SnowballPorterFilterFactory cuts last letter of search term if protword file is empty?

I have a solr schema that uses solr.SnowballPorterFilterFactory. When I do admin/analysis
I see that for query "iphone", after SnowballPorterFilterFactory I get "iphon", even if the file specified in schema (protwords_ro.txt) is empty.
I have removed the filter and term text remains "iphone". Since my protwords_ro.txt file is empty I don't really need that filter right now, but I was wondering why is this happening.
Actually, this filter is for stemming.
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form
So for example for word resume this filter will give resum, etc.
Also,
The Snowball stemmers rely on algorithms and considered fairly aggressive
I think this is the reason why you got iphon, even when your text file is empty.

Oracle Text - Index a BLOB Field (which contains PDF data)

Do any of you have any experience with using Oracle Text to search for content inside PDF files?
I have a table, with a field called FILEDATA(blob).
I would like to do the following query:
SELECT id FROM ttc.contract_attachment WHERE CONTAINS(filedata, 'EXAMPLE') > 0;
However, i'm not too sure about the type of index to add to it.
I found the following code:
begin
ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-');
end;
/
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
parameters ('lexer doc_lexer sync (on commit)');
Ref: http://www.devx.com/dbzone/Article/21563/1954
I have no idea what BASIC_LEXER is. I'm at a bit of a loss. I shall endeavour to continue searching for an answer. Any help would be great.
Thanks.
I've used Oracle Text to index not only PDF's but other data like XML structures. Oracle has the concept of lexers which take content and parses, tokenizes and indexes the tokens. The basic lexer handles English words, there are other lexers for Chinese, Japanese, Korean, etc. The printjoin attribute allows you to index characters that are normally excluded such as hyphes, quotes, etc.
The index you have defined above will work. Keep in mind that Oracle Text indexing is an asynchronous process, meaning the commit occurs and then sometime in the future the document is indexed. However you will need to synchronize the index as part of a scheduled job or the like. With the option "sync (on commit)" on your index, it will index the document as part of the transaction. This is noteworthy only if you are indexing sizable PDF documents.
I would recommend utilizing progressive relaxation for any search you may want to run, as it can being with a restrictive search and expand out to a more generic search, thereby providing the user with results that are decreasing in relevancy. For instance:
<query>
<textquery lang="ENGLISH" grammar="CONTEXT"> cat dog
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "ACCUM"))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="COUNT"/>
</query>
The above query tokenizes the search keywords "cat dog" attempts to find them as a phrase, then any documents contains cat AND dog (not necessarily beside each other), then any document containing cat OR dog, documents containing both words are scored higher than if a document just has a single one. Futhermore the structure automatically dedups the results as it returns them.
All of that being said, you could simply define your index as:
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT)
indextype is ctxsys.context
parameters ('sync (on commit)');
and it would probably work very well for your needs. You would only need to change the behavior of the lexer if you have a need for doing so. I hope this helps.

Resources