SQL Server - Percent based Full Text Search - sql-server

I want to conduct search on a particular column of a table in such a way that returning result set should satify following 2 conditions:
Returning result set should have records whose 90% of the characters matches with the given search text.
Returning result set should have records whose 70% of the consecutive characters matches with the given search text.
It implies that when 10 character word Sukhminder is searched, then:
it should return records like Sukhmindes, ukhminder, Sukhmindzr, because it fulfils both of the above mentioned conditions.
But it should not return records like Sukhmixder because it does not fulfil the second condition. Likewise, It should not return record Sukhminzzz because it does not fulfil the first condition.
I am trying to use Full Text Search feature of SQL Server. But, could not formulate the required query yet. Kindly reply ASAP.

You could try using a combination of the SOUNDEX command and DIFFERENCE command with full text searching.
Check out this Google book online which talks about it

Do you mean 70% of the original word? I think the only way you could do this exactly as stated would be to work out all possible string permutations that could match the 70% criteria and bring back records matching any of those
Col LIKE '%min%' AND (
Col LIKE '%Sukhmin%' OR Col LIKE '%ukhmind%'
OR Col LIKE '%khminde%' OR Col LIKE '%hminder%' )
then do further processing to see if the 90% criteria is met.
Edit: Actually you might find this link on Fuzzy Searching to be of interest http://anastasiosyal.com/archive/2009/01/11/18.aspx

Related

Vlookup within results of a multiple-match lookup

I am trying to find a way to return values from a table like the bottom table:
The tables are provided externally and my understanding is that I can't run filters or sorts as the table is full of extra irrelevant data that would not sort properly across all columns.
I'm approaching this twofold:
firstly, I wanted to return the row information for any entry in the table that has a CPT matching the lookup value.
Second, (where I'm stuck)--when the lookup returns a DESC that corresponds to a matched CPT, the goal would be to also pull in any Code A/Code B entries that correspond to that DESC value.
I found an existing formula that worked for the first part, shown below. (apologies for formatting--SO keeps flagging my draft as having unformatted code).
IFERROR(INDEX($B$3:$B$3000,SMALL(IF(I$2=$A$3:$A$3000,ROW($A$3:$A$3000)- MIN(ROW($A$3:$A$3000))+1,""), ROW()-1)),"")
Currently, the aforementioned formula does return the DESC entries for any matching CPTs and I use vlookup to pull the rest of the relevant columns for a corresponding row.
I'm coming up short in cases where there are multiple Code A/Code B entries for a given DESC, as the lookup only returns information for the row containing the matching CPT and not any relevant codes contained in subsequent rows.
I was thinking I'd have to use something similar to the existing lookup formula to identify a matched row and then display any subsequent rows containing Code entries until the next row with a non-blank CPT entry.
Unfortunately, I don't know if that's actually the best way to approach this. Any resources/suggestions are greatly appreciated.

SQL Contains exact phrase

I try to implement a search-mechanism with "CONTAINS()" on a SQL Server 2014.
I've read here https://technet.microsoft.com/en-us/library/ms142538%28v=sql.105%29.aspx and in the book "Pro Full-Text Search in SQL Server 2008" that I need to use double quotes to search an exact phrase.
But e.q. if I use this CONTAINS(*, '"test"') I receive results containing words like "numerictest" also. If I try CONTAINS(*, '" test "') it is the same. I've noticed, that there are less results as if I would search with CONTAINS(*, '*test*') for a prefix, sufix search, so there is definitely a delta between the searches.
I didn't expect the "numerictest" in the first statement. Is there an explanation for this behaviour?
I have been wracking my brain about a very similar problem and I recently found the solution.
In my case I was searching full text fields for "#username" but using CONTAINS(body, "#username") returned just "username" as well. I wanted it to strictly match with the # sign.
I could use LIKE "%#username%" but the query took over a minute which was unacceptable so I kept looking.
With the help of some people in a chat room they suggested using both CONTAINS and LIKE. So:
SELECT TOP 25 * FROM table WHERE
CONTAINS(body, "#username") AND body LIKE "%#username%";
this worked perfectly for me because the contains pulls both username and #username records and then the LIKE filters out the ones with the # sign. Queries take 2-3 seconds now.
I know this is an old question but I came across it in my searching so having the answer I thought I would post it. I hope this helps.
Contains(*,'"test"') will only match full words of "test" as you expect.
Contains(*,'" test "') same as above
Contains(*,'"*test*"') will actually do a PREFIX ONLY search, basically strips out any special characters at the start of word and only uses the 2nd *.
You cannot do POSTFIX searches using full text search.
My concern lies with the Contains(*) part, this will search for any full text cataloged items in that entire row. Without seeing the data it is hard to tell but my guess is that another column in that row you think is bad is actually matching on "test" somewhere.

SQL Full Text Indexer, exact matches and escaping

I'm trying to replace a Keyword Analyser based Lucene.NET index with an SQL Server 2008 R2 based one.
I have a table that contains custom indexed fields that I need to query upon. The value of the index column (see below) is a combination of name/ value pairs of the custom index fields from a series of .NET types - the actual values are pulled from attributes at run time, because the structure is unknown.
I need to be able to search for set name and value pairs, using ANDs and ORs and return the rows where the query matches.
Id Index
====================================================================
1 [Descriptor.Type]=[5][Descriptor.Url]=[/]
2 [Descriptor.Type]=[23][Descriptor.Url]=[/test]
3 [Descriptor.Type]=[25][Descriptor.Alternative]=[hello]
4 [Descriptor.Type]=[26][Descriptor.Alternative]=[hello][Descriptor.FriendlyName]=[this is a test]
A simple query look like this:
select * from Indices where contains ([Index], '[Descriptor.Url]=[/]');
That query will results in the following error:
Msg 7630, Level 15, State 2, Line 1
Syntax error near '[' in the full-text search condition '[Descriptor.Url]=[/]'.
So with that in mind, I altered the data in the Index column to use | instead of [ and ]:
select * from Indices where contains ([Index], '|Descriptor.Url|=|/|');
Now, while that query is now valid, when I run it all rows containing Descriptor.Url and starting with / are returned, instead of the records (exactly one in this case) that exactly matches.
My question is, how can I escape the query to account for the [ and ] and ensure that just the exact matching row is returned?
A more complex query looks a little like this:
select * from Indices where contains ([Index], '[Descriptor.Type]=[12] AND ([Descriptor.Url]=[/] OR [Descriptor.Url]=[/test])');
Thanks,
Kieron
Your main issue is in using a SQL wordbreaker, and the CONTAINS syntax. By default, SQL wordbreakers eliminates punctuation, and normalizes numbers, dates, urls, email addresses, and the like. It also lowercases everything, and stems words.
So, for your input string:
[Descriptor.Type]=[5][Descriptor.Url]=[/]
You would have the following tokens added to the index (along with their positions)
descriptor type nn5 5 descriptor url
(Note: the nn5 is a way to simplify quering numbers and dates given in different formats, the original number is also indexed at the same position)
So, as you can see, the punctutation is not even stored in the full text index, and thus, there is no way to query it using the CONTAINS statement.
So your statement:
select * from Indices where contains ([Index], '|Descriptor.Url|=|/|');
Would actually be normalized down to "descriptor url" by the query generator before submitting it to the full text index, thus the hits on all the entries that have "descriptor" next to "url", excluding punctuation.
What you need is the LIKE statement.
Using "|" as your delimiter causes the contains query to think of OR. Which is why you are getting unexpected results. You should be able to escape the bracket like so:
SELECT * FROM Indices WHERE
contains ([Index], '[[]Descriptor.Type]=[[]12]')

Finding alphabetical position in a large list

I have an as400 table containing roughly 1 million rows of full names / company names which I would like to convert to use another datastore while still matching the speed of the original.
Currently, a user enters the search and almost instantaneously gets the alphabetical position of the search term in the table and and a page of matches. The user can then paginate either up or down through the records very quickly.
There is almost no updating of the data and approximately 50 inserts per week. I'm thinking that any database can maintain an alphabetical index of the names, but I'm unsure of how to quickly find the position of the search within the dataset. Any suggestions are greatly appreciated.
This sounds just like a regular pagination of results, except that instead of going to a specific page based on a page number or offset being requested, it goes to a specific page based on where the user's search fits in the results alphabetically.
Let's say you want to fetch 10 rows after this position, and 10 rows before.
If the user searches for 'Smith', you could do two selects such that:
SELECT
name
FROM
companies
WHERE
name < 'Smith'
ORDER BY
name DESC
LIMIT 10
and then
SELECT
name
FROM
companies
WHERE
name >= 'Smith'
ORDER BY
name
LIMIT 10
You could do a UNION to fetch that in one query, the above is just simplified.
The term the user searched for would fit half way through these results. If there are any exact matches, then the first exact match will be positioned such that it is eleventh.
Note that if the user searches for 'aaaaaaaa' then they'll probably just get the 10 first results with nothing before it, and for 'zzzzzzzz' they may get just the 10 last results.
I'm assuming that the SQL engine in question allows >= and < comparisons between strings (and can optimise that in indexes), but I haven't tested this, maybe you can't do this. If, like MySQL, it supports internationalized collations then you could even have the ordering done correctly for non-ascii characters.
If by "the position of the search" you mean the number of the record if they were enumerated alphabetically, you may want to try something like:
select count(*) from companies where name < 'Smith'
Most databases ought to optimize that reasonably well (but try it--theories you read on the web don't trump empirical data).
Just to add to the ordering suggestions:
Add an index to the name if this is your standard means of data retrieval.
You can paginate efficiently by combining LIMIT and OFFSET.

Full text catalog/index search for %book%

I'm trying to wrap my head around how to search for something that appears in the middle of a word / expression - something like searching for "LIKE %book% " - but in SQL Server (2005) full text catalog.
How can I do that? It almost appears as if both CONTAINS and FREETEXT really don't support wildcard at the beginning of a search expression - can that really be?
I would have imagined that FREETEXT(*, "book") would find anything with "book" inside, including "rebooked" or something like that.
unfortunately CONTAINS only supports prefix wildcards:
CONTAINS(*, '"book*"')
SQL Server Full Text Search is based on tokenizing text into words. There is no smaller unit as a word, so the smallest things you can look for are words.
You can use prefix searches to look for matches that start with certain characters, which is possible because word lists are kept in alphabetical order and all the Server has to do is scan through the list to find matches.
To do what you want a query with a LIKE '%book%' clause would probably be just as fast (or slow).
If you want to do some serious full text searching then I would (and have) use Lucene.Net. MS SQL Full Text search never seems to work that well for anything other than the basics.
Here's a suggestion that is a workaround for that wildcard limitation. You create a computed column that contains the same content but in reverse as the column(s) you are searching.
If, for example, you are searching on a column named 'ProductTitle', then create a column named ProductsRev. Then update that field's 'Computed Column Specification' value to be:
(reverse([ProductTitle]))
Include the 'ProductsRev' column in your search and you should now be able to return results that support a wildcard at the beginning of the word. Good luck!!
Full text has a table that lists all the words the engine has found. It should have orders-of-magnitude less rows than your full-text-indexed table. You could select from that table " where field like '%book%' " to get all the words that have 'book' in them. Then use that list to write a fulltext query. Its cumbersome, but it would work, and it would be ok in the speed department. HOWEVER, ultimately you are using fulltext wrong when you are doing this. It might actually be better to educate the source of these feature requests about what fulltext is doing. You want them to understand what it WANTS to do, so they can get high value from fulltext. Example, only use wild cards at the end of a word, which means think of the words in an ordered list.
why don't program an assembly in C# to compute all the non repeated sufixes. For example if you have the Text "eat the red meat" you can store in a field "eat at t the he e red ed d meat" (note that is not necesary to add eat at and t again) ind then in this field use full text search. A function for doing that can easily written in Csharp
x) I know it seems od... it's a workarround
x) I know I'm adding overhead in the insert / update .... only justified if this overhead is insignificant besides the improvement in the search function
x) I know there is also an overhead in the size of the stored data.
But I'm pretty conffident that will be quite fast

Resources