Sunspot/Solr: word concatenation - solr

I'm using Solr with the Sunspot Ruby gem. It works great, but I'm noticing that sometimes users will get poor search results because they have concatenated their search terms (e.g. 'foolproof') where the document text was 'fool proof'. Or vice-versa.
I was going to try and address this by creating a set of alternate match fields by manually concatenating the words from the source documents together. This seems kind of hackish, and implementing the other side (breaking up user concatenations into words) is not obvious.
Is there a way to do this properly in Solr/Sunspot?

Did yo have a look at SOLR spellcheck (or spell check) component?
http://wiki.apache.org/solr/SpellCheckComponent
For example, there is a WordBreakSolrSpellChecker, which may provide valid suggestions in such case.

Related

Using solr shingle filter at query time

I am trying to build a field in my Solr Schema which will be able to join words together at query time and then search for this new joined word in the index.
Lets say I have the word "bluetooth" in my index and I want this to come up in results when I search "blue tooth".
So far I have been unsuccessful in trying varying combinations of shinglefilterfactory and positionfilterfactory as well as keyword, standard and whitespace tokenizers.
I'm hoping someone might be able to point me in the right direction to solve this!
Your goal is looking obscure to me and strange a little bit. But for your specific use-case the following filter can be used:
"solr.PatternReplaceCharFilterFactory"
"pattern"="[\\W]"
"replacement"=""
It will make "blue tooth" to be replaced into "bluetooth". And also you can specify that field-analysis for query-time only.
But let me tell you that usually tokenization is used instead of concatenation. And let me also offer you the following filter - WordDelimiterFilter. In such case this guy can split "BlueTooth" into "blue" and "tooth" based on cases.

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

Apache solr wild card searching with multiple words

We are using apache solr with php.
There is a problem in wild card searching.
We want to search "project manage*" which can list possible results like project manager, project management etc. However, it is not working whenever there are two words in wild card searching
For example "projectmanage*" is working whereas "proejct manage*" is not working. We also tried by escaping the space but it is not working either..
Looking forward to all valuable inputs.. thanks in advance.
When applying a wild card, the regular analysis chain is not performed when querying. This results in Solr looking for tokens starting with with "project manage" - and if you have an analysis chain when indexing, your text is usually split into multiple tokens.
You can use a Shingle filter to index multiple tokens as a single token, which can be used to get around the issue (be sure to use the same separator as you use in your text).
Another option is to lowercase the field when indexing and querying and use a regular StrField which isn't processed in any way, or use a KeywordTokenizer - which keeps the indexed content as a single token.

Searching for words that are contained in other words

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks
We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.
perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

Apache Solr: Correct use of CompoundWordFilter

I'm trying to figure out how to best configure Solr for my app. I'm indexing (mostly german) PDF-Documents, and I'm using dismax queries to query Solr.
If a document contains the word "Firmenprofil" (a german compound word, -> 'company profile'), it will only be returned in queries for exactly that word. However, it would be desirable for queries only containing "Profil" to also return this document.
I downloaded a german dictionary file and applied a DictionaryCompoundWordTokenFilter to both the index- and the query-analyzer.
The Problem is, that the filter decomposes the query into very small parts (e.g. "pro" in the case of "Firmenprofil" which then results in having all sorts of documents that contain words like "Product" returned...).
I tried removing the Filter from the query-analyzer which leads to solr not finding the document at all. I also tried leaving the query-filter in, but explicitly setting the onlyLongestMatch-option to true, but that didn't seem to have any effect at all.
Ok, seems like my dictionary file was simply too big (~20mb). I replaced it with a more compact one and now it works just fine...
Without your actual config files, its a bit of a guessing game.
Did you check if profil is part of the dictionary?

Resources