Searching for words that are contained in other words

Searching for words that are contained in other words - azure-cognitive-search

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks

We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.

perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

Related

What Solr field type provides basic wildcard searches?

I posted a document with the field value "Pineapple upside down cake." I want to get hits for pineapple, pine*, *side, pi?????le, upside down, etc. I chose text_en which does not find *side nor pi?????le.
What out of the box field type will give me hits for all the above?
I'm using Solr 7.6.

If you want to retain all the tokens as is (as I commented on your previous question about this, the text_en type contains a stemmer), use a field type with just a WhitespaceTokenizer and a LowercaseFilter. You'll have to define this field yourself.
I'm guessing you can use text_general to get a decent enough answer (it uses the StandardTokenizer, so it'll split on a few more cases than just whitespace).
The reason is that wildcard searches happens without most processing taking place (as it's impossible to do proper handling of stemming, splitting, etc. when you don't have the complete token), so any wildcard search will be against the generated list of tokens after processing.

Using solr shingle filter at query time

I am trying to build a field in my Solr Schema which will be able to join words together at query time and then search for this new joined word in the index.
Lets say I have the word "bluetooth" in my index and I want this to come up in results when I search "blue tooth".
So far I have been unsuccessful in trying varying combinations of shinglefilterfactory and positionfilterfactory as well as keyword, standard and whitespace tokenizers.
I'm hoping someone might be able to point me in the right direction to solve this!

Your goal is looking obscure to me and strange a little bit. But for your specific use-case the following filter can be used:
"solr.PatternReplaceCharFilterFactory"
"pattern"="[\\W]"
"replacement"=""
It will make "blue tooth" to be replaced into "bluetooth". And also you can specify that field-analysis for query-time only.
But let me tell you that usually tokenization is used instead of concatenation. And let me also offer you the following filter - WordDelimiterFilter. In such case this guy can split "BlueTooth" into "blue" and "tooth" based on cases.

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!

Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

Solr highlighting gives field/snippets with ANY term, instead of those that satisfy the query fully

I'm using Solr 5.x, standard highlighter, and i'm getting snippets which matches even one of the search terms only, even if i indicate q.op=AND.
I need ONLY the fields and snippets that matches ALL the terms (unless i say q.op=OR or just omit it), i.e. the field/snippet must satisfy the query. Solr does return the field/snippet that has all the terms, but also return many others.
I'm using hl.fl=*, to get the only fields having the terms, and searching against the default field ('text' containing full doc). Need to use * since i have multiple dynamic fields. Most fields are 'text_general' type (for search and HL), and some are 'string' type for faceting.
If its not possible for snippets to have all the terms, i MUST get only the fields that satisfy the query fully (since the question is more talking about matching all the terms, but the search query can become arbitrarily complex, so the fields/snippets should match the query).
Also, next is to get snippets highlighted with proximity based search/terms. What should i do/use for this? The fields coming in highlighting in this scenario should also satisfy the proximity query (unlike i get a field that contain any term, without regard to proximity constrains and other query terms etc)
Thanks for your help.

I've also encountered the same problem with highlighting. In my case, the query like
(foo AND bar) OR eggs
highlighted eggs and foo despite bar was not present in the document. I didn't manage to come up with proper solution, however I devised a dirty workaround.
I use the following query:
id:highlighted_document_id AND text:(my_original_query)
with debugQuery set to true. Then I parse explain text for highlighted_document_id. The text contains the terms from the query, which have contributed to the score. The terms, which should not be highlighted, are not present in the explanation.
The Python regex expressions I use to extract the terms (valid for Solr 5.2.1):
term_regex = re.compile(r'weight\(text:(.+) in')
wildcard_term_regex = re.compile(r'text:(.+), product')
then I simply search the markings in the highlighted text and remove them if the term doesn't match against any of the term in term_regex and wildcard_term_regex.
The solution is probably pretty limited, but works for me.

Sunspot/Solr: word concatenation

I'm using Solr with the Sunspot Ruby gem. It works great, but I'm noticing that sometimes users will get poor search results because they have concatenated their search terms (e.g. 'foolproof') where the document text was 'fool proof'. Or vice-versa.
I was going to try and address this by creating a set of alternate match fields by manually concatenating the words from the source documents together. This seems kind of hackish, and implementing the other side (breaking up user concatenations into words) is not obvious.
Is there a way to do this properly in Solr/Sunspot?

Did yo have a look at SOLR spellcheck (or spell check) component?
http://wiki.apache.org/solr/SpellCheckComponent
For example, there is a WordBreakSolrSpellChecker, which may provide valid suggestions in such case.