SQL Server CONTAINS and highlighting the matches - sql-server

Contains() with FORMSOF() is great for trying to capture the user's intent while searching, but is there any way to highlight the matches.
If I search for "said", it might return texts containing "says, say, spoke" etc. Is there a way a way I can highlight the match in the results, or Is there a way to surround the match with underscores? So I might get
She _says_ yes.
I _say_ my name.
We _spoke_ for hours but he didn't _say_ much.
I've considered an after-the-fact (client-side) regex solution that would esentially remove commmon word endings like (e|ed|es|s|ing) and then look for my results with with all those options (so bakes would become bak and then I'd search for bak[a-z]?(s|d|es|ed|ing) and that works okay for words like that, but there's a whole lot of cases where the past tenses don't follow that formula, like speak vs spoke and spake.

There are two SQL Server functions that can help you with this:
The Soundex function help you to compare similar words.
and the difference function helps you to evaluate the difference.

Related

Match keywords from a list and filter out the rest using regexp or other?

Ok so I'm familiar with the filter in Google Data Studio where strings can be filtered using AND or OR. The maximum however is 75 and I need 100. It's quite annoying though having to put one filter up at a time. Ideal would have been to be able to copy all keywords and paste them in.
Any solutions to this? Can this be done with for instance regexp?
(?i)^(Keyword 1)$|^(Keyword 2)$|^(Keyword 3)$|^(Keyword 4)
So that is what I came up with above and it seems to work.
During copy from Google Sheets, is there a way to format a column of keywords into the regex format above?

Using solr shingle filter at query time

I am trying to build a field in my Solr Schema which will be able to join words together at query time and then search for this new joined word in the index.
Lets say I have the word "bluetooth" in my index and I want this to come up in results when I search "blue tooth".
So far I have been unsuccessful in trying varying combinations of shinglefilterfactory and positionfilterfactory as well as keyword, standard and whitespace tokenizers.
I'm hoping someone might be able to point me in the right direction to solve this!
Your goal is looking obscure to me and strange a little bit. But for your specific use-case the following filter can be used:
"solr.PatternReplaceCharFilterFactory"
"pattern"="[\\W]"
"replacement"=""
It will make "blue tooth" to be replaced into "bluetooth". And also you can specify that field-analysis for query-time only.
But let me tell you that usually tokenization is used instead of concatenation. And let me also offer you the following filter - WordDelimiterFilter. In such case this guy can split "BlueTooth" into "blue" and "tooth" based on cases.

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

Searching for words that are contained in other words

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks
We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.
perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

Parsing search queries for SQL 2008 FTS

We want to use SQL SERVER 2008 Full Text Search and seem to run into a lot of problems handling the search query.
If the user types in "blue dog" it just crashes sql unless we parse the search terms to include the "" around the words but that makes it a phrase instead of keywords.
I want results where blue or dog are included but that means replacing spaces with or(s) and so on. Unfortunately there seem to be far too many combination a user might type.
Are there any libraries out there (for .net) that can already parse a search string into something FT understands?
We'd like a Google like syntax :)
thanks
I was looking for the "FREETEXT" option and was using the "CONTAINS" keyword instead, my bad. Freetext is giving me the results I wanted.

Resources