Finding matched words using SQL Server Full Text Search FREETEXT function - sql-server

I'm trying to figure out where the matches were found when using FREETEXT so I can extract the paragraph in which they appear.
I can do this using CONTAINS by searching the the exact phrase in the document. However, because FREETEXT uses a more "fuzzy" approach and uses synonyms, I have no idea what it matched on.
For example, assume I have a column that includes the text "...this will help marginalized communities..."
select #CONTAINS(some_column, '"marginalized groups"')
will return the example column above. But, I can't find the paragraph is appears in because I'm looking for one search term, but FREETEXT is smart enough to find similar terms.
Is there any way to find what the actual match was?

Related

SQL Server fulltext search issue

I am working with SQL Server full text search. The issue is SQL Server is returning the wrong records.
For example: I am searching for was word in article's table column striptitle
SELECT
TitleStripped
FROM
[pastic_com].[dbo].[Psa_Articles]
WHERE
FREETEXT (TitleStripped, 'was')
With this query, I found 8 records; for reference two of them are pasted below:
Seasonal dynamics and relative abundance of AM fungi in rhizosphere of rice (Oryza sativa L. cv. Basmati supper).
Seasonal dynamics of AM fungi in sugarcane (Saccharum officinarum L.CV.SPF-213) in relation to red rot (Colletotrichum falcatum) disease from Punjab, Pakistan.
You will notice title column does not contain "was" word .
For more reference here's a screenshot:
[1]: https://i.stack.imgur.com/w0gdI.png
The full text search depends on thesaurus files and stoplist objects. Please double check your configuration for entries related to was.
Also, note the difference between FREETEXT and CONTAINS. If you look for exact matches of the word was then try CONTAINS instead of FREETEXT for the reason below.
Snippet from the documentation for FREETEXT, you probably want to avoid these actions.
Is a predicate used in the Transact-SQL WHERE clause of a Transact-SQL
SELECT statement to perform a SQL Server full-text search on full-text
indexed columns containing character-based data types. This predicate
searches for values that match the meaning and not just the exact
wording of the words in the search condition. When FREETEXT is used,
the full-text query engine internally performs the following actions
on the freetext_string, assigns each term a weight, and then finds the
matches:
Separates the string into individual words based on word boundaries
(word-breaking).
Generates inflectional forms of the words (stemming).
Identifies a list of expansions or replacements for the terms based on
matches in the thesaurus.

FullText Search with FreeText but only return records that contain all words of an expression, except those on stopword list

i´ve searched for a solution for a while now. Anyway I can not come up with a way that returns me the recordset I want.
I have a table full of different texts as a collection of all texts used in a HMI software.
Now when a user creates a new text I want to check if a similar text already exists in the table.
I´ve come so far to find FullTextSearch on MS SQL Server should be the best way to do this. My Problem is the following:
When I use FreeText on a new text that should be checked for similar values I get way to many results. Every record is listed that contains even only one of the relevant words in my search string.
Example:
Search text:
Deceleration Linear Motor Transfer to Top
Values that should be found:
'Deceleration linear motor transfer top'
'Deceleration linear motor handover to top'
Values that should not be found:
'Accelearion linear motor handover to top'
'linear motor handover to top'
So I want it to work just like FreeText is working (with INFLECTIONAL and THESAURUS comparison), but only records that contain all words in the search string, except those who are on the stopword list (so fill words are also ignored).
I thought about using Contains in combination with Formsof for every single word in my search string. But then it does not ignore those words on the stopword list.
I hope I was able to specify my problem properly and hope someone can help me with it.
Thanks in advance.
For anyone who might also run into this kinda problem. I solved it myself by now with the sledgehammer approach.
I just concat all words in my search expression with
(Formsof(... Thesaurus, *Word1* ) OR Formsof(... INFLECTIONAL *Word1*)) AND
(Formsof(... Thesaurus, *Word2* ) OR Formsof(... INFLECTIONAL *Word2*))
For the stopwords I skip those words manually by checking each word if it is listed before adding it to my where string.
This article helped me a lot with getting the correct language id for the selected column in the code.
Some Useful Full Text Index Stoplist Related Queries

Azure search contains word not working as expected

I am new to Azure Search. I am trying to use "contains" logic in my search query. I looked it up and found out that I need to add something like following in my search query.
&queryType=full&search=/.*_search.*/
where _search in the string I want to search. Now what happens is that the "contains" logic works fine. For example, I try to search sweep and I get well sweep-cmu in the results.
But, when I search well sweep-cmu, I get zero results. Why? and how can I improve my query to get results when I enter partial and full strings.
If you want exact match for the search query please surround the query with double quotes.
eg: "well sweep-cmu"
This will return all documents which contain the exact phrase.
Since you've just started to play with Azure Search you might find this article particularly interesting. It explains how the full text search works in Azure Search.
https://learn.microsoft.com/en-us/azure/search/search-lucene-query-architecture
In order to get results for partial terms, you should use wildcard expressions in your search queries. The above article explains this in detail.
PS: Some wildcard queries can be very expensive and hence slow.

Multi-word CONTAINS full-text search only working partially in SQL Server

I'm using SQL Server 2012 and have created full-text index for NAME column in COMPANY table. All the searches I've tested are of the following format (with variable number of words to search), matching by beginnings of words in any order:
select id, name from company where contains(name, '"ka*" AND "de*"')
The problem is that there are cases where this query doesn't return any results even though it should be perfect match. For example when company name is "ka de we oy", the example above returns a match but '"ka*" AND "de*" AND "we*"' does not and neither does searching with all the four 'words'.
There are also other cases where, strangely enough, the search does not return results even with exact words. This seems related to very short (two-letter) words. There are also some issues with searching with many (6+) words.
Is there some explicit restriction to the number of words in a single query or how short they can be? How can I fix or work around this?
Edit: it seems to be certain common English words which are entirely excluded from the index (like 'we' in the example). This is an issue since it's a requirement that a few of the common words definitely should be searchable. Is there any way to change which words are not indexed or e.g. change the 'language' of the indexing to apply different set of common words that are left out?
Apparently this is simply a case of defining correct stopwords / stoplist:
https://msdn.microsoft.com/en-us/library/ms142551.aspx
https://msdn.microsoft.com/en-us/library/cc280405.aspx
Or setting the full-text index language for the column to the actual language so that English words don't cause issues.
Edit: actually it was easiest to simply disable the stoplist for the table entirely:
ALTER FULLTEXT INDEX ON company SET STOPLIST = OFF
Hopefully this helps someone else

Solr highlighting gives field/snippets with ANY term, instead of those that satisfy the query fully

I'm using Solr 5.x, standard highlighter, and i'm getting snippets which matches even one of the search terms only, even if i indicate q.op=AND.
I need ONLY the fields and snippets that matches ALL the terms (unless i say q.op=OR or just omit it), i.e. the field/snippet must satisfy the query. Solr does return the field/snippet that has all the terms, but also return many others.
I'm using hl.fl=*, to get the only fields having the terms, and searching against the default field ('text' containing full doc). Need to use * since i have multiple dynamic fields. Most fields are 'text_general' type (for search and HL), and some are 'string' type for faceting.
If its not possible for snippets to have all the terms, i MUST get only the fields that satisfy the query fully (since the question is more talking about matching all the terms, but the search query can become arbitrarily complex, so the fields/snippets should match the query).
Also, next is to get snippets highlighted with proximity based search/terms. What should i do/use for this? The fields coming in highlighting in this scenario should also satisfy the proximity query (unlike i get a field that contain any term, without regard to proximity constrains and other query terms etc)
Thanks for your help.
I've also encountered the same problem with highlighting. In my case, the query like
(foo AND bar) OR eggs
highlighted eggs and foo despite bar was not present in the document. I didn't manage to come up with proper solution, however I devised a dirty workaround.
I use the following query:
id:highlighted_document_id AND text:(my_original_query)
with debugQuery set to true. Then I parse explain text for highlighted_document_id. The text contains the terms from the query, which have contributed to the score. The terms, which should not be highlighted, are not present in the explanation.
The Python regex expressions I use to extract the terms (valid for Solr 5.2.1):
term_regex = re.compile(r'weight\(text:(.+) in')
wildcard_term_regex = re.compile(r'text:(.+), product')
then I simply search the markings in the highlighted text and remove them if the term doesn't match against any of the term in term_regex and wildcard_term_regex.
The solution is probably pretty limited, but works for me.

Resources