Optimising keyword retrieval using SQL Server's Full-Text Search - sql-server

I have a SQL Server Full-Text Search indexing two columns in one of my tables.
I am pulling out suggested keywords from a web front-end based on user's input. Such that entering a phrase like 'ban' would yield words such as banana, banish, urban, husband, etc. The user would then click on one of these words to confirm their choice, or add further letters to narrow down their search.
I have the following number of total keywords, as shown by the following query:
SELECT COUNT(*) FROM sys.dm_fts_index_keywords ( DB_ID(), OBJECT_ID('Search'))
217,998
So, when querying the keywords I have a query like below:
SELECT TOP 10 *, display_term, document_count
FROM sys.dm_fts_index_keywords ( DB_ID(), OBJECT_ID('Search'))
WHERE column_id=5
AND keyword != 0xFF
AND display_term like '%ban%'
AND display_term NOT LIKE 'nn%'
However, this currently takes circa 30 seconds to run! Clearly this is far too slow to be of any use.
So, as a way of a work around I have created my own keywords table to store my keywords. Whenever I add content to my full-text search table, I run a query below to find out which keywords will be indexed:
SELECT display_term AS Term, COUNT(display_term) AS [Count]
FROM sys.dm_fts_parser('"There are many types of fruit, including apples, bananas and cherries." ', 1033, 0, 0)
WHERE display_term NOT LIKE 'nn%'
AND special_term NOT IN ('Noise Word', 'End of Sentence')
GROUP BY display_term
I then take these words and store them into my own keywords table, for later use by the web front end described above. This is much quicker.
However, I can't help feeling that I shouldn't need to create a workaround and that finding keywords is something that many people would need to be doing.
I have searched for other methods, tables, or other functionality contained within SQL Server, but all to no avail.
I have also looked into indexing the sys.dm_fts_index_keywords table. However, searching for the word "indexing" is problematic due to the nature of subject matter.
Does anyone have another method that is quick to execute, and hopefully also requires less programmatic intervention?

Related

How to search for similar words in SQL Server

I am using CONTAINS and FREETEXT on SQL query to search for text in big text fields.
What I noticed that the search returns result when the exact word match, but what if I want to search for similar words?
For example, when I type Carlo, it did not display anything if what I have is Carlos (with an S)
Below is a simple query similar to the one I use:
SELECT P.*
FROM MyTable AS P
WHERE(CONTAINS(P.*, 'Carlo') OR freetext(P.*, 'Carlo'))
How can I make the search bring similar words to Carlo such as Carlos, Carla, etc... without affecting the performance?
Try this
SELECT P.*
FROM MyTable AS P
WHERE CONTAINS(P.*, 'FORMSOF(INFLECTIONAL, "Carlo")')
For reference you can check documentation

Multiple Full Text Search SQL Queries Merged and Scored (Ranked Search Results)

I have a bunch of articles in one table that I'd like to query for search results. Using Full Text Search I can return a list of items that have the search keywords "near" each other.
Full text search does not seem to allow thesaurus (FORMSOF) with the NEAR delimiter.
What I'd like to do, in SQL, is create a query, or a number of queries, which search the same data, in different ways, and return a score (or RANK if using Full Text Search), then I would like to merge these results so there are no duplicates, and total up the ranks/scores, so that I can ORDER BY those scores.
Add in that I would also like to search a separate link table of "tags" that the documents have been assigned, and also assign extra score for those with corresponding tags.
What is the best practice way of fulfilling these requirements?
Full-text search can do search like ('"word*" near "another*"') in CONTAINSTABLE statement. The asterisk will help to search any words started with 'word' and 'another' near each other with ranking.
On the other side you can launch FORMSOF(Thesaurus, word) AND FORMSOF(Thesaurus, another) search with CONTAINSTABLE statement.
Then MERGE the results and use ORDER BY to sort by both given RANKs.

Multi-word CONTAINS full-text search only working partially in SQL Server

I'm using SQL Server 2012 and have created full-text index for NAME column in COMPANY table. All the searches I've tested are of the following format (with variable number of words to search), matching by beginnings of words in any order:
select id, name from company where contains(name, '"ka*" AND "de*"')
The problem is that there are cases where this query doesn't return any results even though it should be perfect match. For example when company name is "ka de we oy", the example above returns a match but '"ka*" AND "de*" AND "we*"' does not and neither does searching with all the four 'words'.
There are also other cases where, strangely enough, the search does not return results even with exact words. This seems related to very short (two-letter) words. There are also some issues with searching with many (6+) words.
Is there some explicit restriction to the number of words in a single query or how short they can be? How can I fix or work around this?
Edit: it seems to be certain common English words which are entirely excluded from the index (like 'we' in the example). This is an issue since it's a requirement that a few of the common words definitely should be searchable. Is there any way to change which words are not indexed or e.g. change the 'language' of the indexing to apply different set of common words that are left out?
Apparently this is simply a case of defining correct stopwords / stoplist:
https://msdn.microsoft.com/en-us/library/ms142551.aspx
https://msdn.microsoft.com/en-us/library/cc280405.aspx
Or setting the full-text index language for the column to the actual language so that English words don't cause issues.
Edit: actually it was easiest to simply disable the stoplist for the table entirely:
ALTER FULLTEXT INDEX ON company SET STOPLIST = OFF
Hopefully this helps someone else

Full Text Search by prefixed FORMSOF result

Is there a way to concatenate a prefix onto all the results of a FORMSOF() lookup when doing a CONTAINSTABLE() query? I work in the nordic ski industry, and we sell "rollerskis" for summer training. As this is a pretty obscure word, the parser doesn't quite give me the right inflectional forms I'd like. Specifically, if I try to run a FORMSOF(INFLECTIONAL,"rollerski"), the parsing function sys.dm_fts_parser returns the following terms (no thesaurus, English language):
{"rollerski", "rollerskiing", "rollerskies", "rollerskied"}
That's close to what I need, but it's notably missing the pluralized rollerskis, which is used throughout our website, most notably in the name of several products and product categories. What I would like to do to get a more accurate list is return all the inflectional forms of "ski" and prefix each of them with "roller". That would give me the following list of terms:
{"rollerski", "rollerskis'", "rollerskis","rollerskiing","rollerskies","rollerskied","rollerski's"}
Is there a way I can achieve this within the CONTAINSTABLE() query?

Show matching keywords per each document returned from a SQL Server full-text query

Given an arbitrary full text search (FTS) query, it's required to list keywords from the resulting document which match the query. For example, test or rest produces list of 3 documents where 1st one contains only test, 2nd one contains both of the words and the 3rd one has only rest. The explanation should produce 3 lists: (test) (test, rest) (rest) for the end user to understand why the documents appeared in the query output.
The question is related to hit-highlighting and I've explored existing solutions (e.g. http://www.codeproject.com/Articles/623815/Hit-Highlight-for-SQL-Server-Full-Text-Search or How to do hit-highlighting of results from a SQL Server full-text query). Those solutions rely on sys.dm_fts_parser fed with hard-coded FORMSOF (INFLECTIONAL to produce all permutations of the search term.
Particularly, solutions relying on sys.dm_fts_parser seem to stumble upon a prefix search. For example, given 2 queries test and "test*" select content from table where contains(content, #query, language 1033) produces different result sets, but select * from sys.dm_fts_parser(#query, 1033, 0, 1) yields 2 absolutely identical recordsets, which doesn't give any clue as to why query outputs are different.
Anyone has any experience with similar cases?

Resources