So right now I'm just using the admin interface to run search queries. I know that a tilde ~ suffix causes a word to become fuzzy search.
However, what about a phrase? I tried "some words"~ but it doesn't seem to be returning results when it should be. Any idea why? Do I need a special fieldtype or special filters?
Right now, everything is pretty vanilla but I did import a lot of data. (About 12 million rows). I know that there are things in there that should be getting returned with a good fuzzy match that are not.
Any help is appreciated.
Also, if it makes a difference I would like to use the levenshtein algorithm.
ComplexPhraseQueryParser can be used to handle wildcard and fuzzy phrase queries.
Related
I am trying to build a field in my Solr Schema which will be able to join words together at query time and then search for this new joined word in the index.
Lets say I have the word "bluetooth" in my index and I want this to come up in results when I search "blue tooth".
So far I have been unsuccessful in trying varying combinations of shinglefilterfactory and positionfilterfactory as well as keyword, standard and whitespace tokenizers.
I'm hoping someone might be able to point me in the right direction to solve this!
Your goal is looking obscure to me and strange a little bit. But for your specific use-case the following filter can be used:
"solr.PatternReplaceCharFilterFactory"
"pattern"="[\\W]"
"replacement"=""
It will make "blue tooth" to be replaced into "bluetooth". And also you can specify that field-analysis for query-time only.
But let me tell you that usually tokenization is used instead of concatenation. And let me also offer you the following filter - WordDelimiterFilter. In such case this guy can split "BlueTooth" into "blue" and "tooth" based on cases.
The solr syntax for fuzzy search is:
q~n where q is the query term and n is the Levenshtein Distance (e.g. 1-3).
The syntax for prefix search is:
q* where q is a query term and the * indicates a wildcard.
Combining both like q~n* (with even n=1) has the side effect, that nearly everything matches
(for a reason, that i still need to find out).
Combining both like q*~n (with even n=1) has the side effect, that the query performs as it will be a prefix search only.
In our use case we need to offer suggestions based on historical queries stored in index. That seam also to be the thing google does when you type in a misspelled term, and it is a great solution for suggestions.
The problem is, we can either offer suggestions wich start with the same index or some with a defined Levenshtein Distance <= 3 which is impracticable when it comes to long terms.
Now, I know that there is a similar question asked 3 years ago, where the solution says it aint possible to express in solr syntax and the whole case does not make any particular sense, but in my opinion it makes sense and a combination would be a perfekt solution to practical problems.
Not a tested solution, did you think of using this ? q* OR q~1 for example name:S* OR name: S~1 ,
Larger example : name:Samson~3 OR name:Samson* returned : <str name="name">Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133</str></doc>
I have not tried this specifically, but it looks like you might be able to do what you want with the ComplexPhraseQueryParser.
It looks like the ComplexPhraseQueryParser is slated to be distributed with 4.8, but for now you can get the plugin (there are install instructions in the zip files) from Solr's Jira. https://issues.apache.org/jira/browse/SOLR-1604
There is some discussion using distance here. http://lucene.472066.n3.nabble.com/ComplexPhraseQueryParser-and-wildcards-td2742244.html
I would expect with the ComplexPhraseQueryParser you could do a query like "q*"~n.
I'm using Solr with the Sunspot Ruby gem. It works great, but I'm noticing that sometimes users will get poor search results because they have concatenated their search terms (e.g. 'foolproof') where the document text was 'fool proof'. Or vice-versa.
I was going to try and address this by creating a set of alternate match fields by manually concatenating the words from the source documents together. This seems kind of hackish, and implementing the other side (breaking up user concatenations into words) is not obvious.
Is there a way to do this properly in Solr/Sunspot?
Did yo have a look at SOLR spellcheck (or spell check) component?
http://wiki.apache.org/solr/SpellCheckComponent
For example, there is a WordBreakSolrSpellChecker, which may provide valid suggestions in such case.
Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.
For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails
When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.
I have indexed a site with solr. It works very well if stemming is not enabled. Using stemming, however, solr does not return any hits when searching for the root of a word. I use Swedish stemming.
For example, searching for support gives hits if not using stemming. Using stemming, searching for support gives no hits. Though, searching for supporten returns hits that match support.
By debugging the query, I can see that it stems the word support to suppor (which is incorrect by the way, but that should not matter). However, having the word stemmed to suppor, I want it to search for matches with the the original query word as well.
I'd appreciate any help on this!
Afaik, there is no way to keep the original word when stemming...
I assume that you are using solr.SnowballPorterFilterFactory. Snowball algorithm is too aggressive.
You should try a Hunspell stemmer or maybe solr.SwedishLightStemFilterFactory.
A workaround you can do is to reformat your query into "support support*" or "support support~". * is wildcard matching and ~ is fuzzy matching using Lucene syntax. I know you didn't mention the need to do wildcard and fuzzy search, but I found under these circumstances, the stemming on query will not take effect, so "support" is preserved. And stemming will still be effective on the first word, so both results will be returned if any. Plus, fuzzy search will help reduce the tolerance of typos in users' queries, so it's an added benefit.