How to ditch single match results in solr query? - solr

I'm a newbie to solr.
Here is my problem:
When I search 天天爱消除,my tokenizer will split it to: 天天 & 爱 & 消除.
I use disMax mode, and the query goes like this:
http://search.dev/solr/collection1/select?q=%E7%88%B1%E6%B6%88%E9%99%A4&defType=dismax&qf=name&wt=json&indent=true&omitHeader=true&rows=500&hl=true&hl.fl=name&hl.simple.pre=<em>&hl.simple.post=<%2Fem>&sort=score+desc
This would match results like this: 我爱记单词 & 我爱读书. For me, the relevances of them are to low, so I don't want to show them in the results.
In conclusion, I want to ditch the results that match one term and the term length is 1;
Hope to get some help. Thanks.
update:
Need I to set a custom Collector?

If you are talking about having at least 2 term matches from your query , I do not think you need a special collector for this.
You need to use &mm=2 with the rest of your URL/query, here is the documentation with example => http://wiki.apache.org/solr/DisMaxQParserPlugin

Related

How to save value with wildcard in Solr?

all.
I have the following trouble with Solr. I need to implement "reverse" search with wildcards. I mean I want to keep value like "auto*" and this item should be found with request like "autocar", "autoplan" or "automate". Could someone help me with this, please? Thanks.
If you want to match shorter indexed value (auto) against longer searched value (autobus), you want a custom analysis chain that includes EdgeNGramFilter on the query side only. Then, the incoming search word will get split into possible prefixes and matched against the indexed term.

Using solr shingle filter at query time

I am trying to build a field in my Solr Schema which will be able to join words together at query time and then search for this new joined word in the index.
Lets say I have the word "bluetooth" in my index and I want this to come up in results when I search "blue tooth".
So far I have been unsuccessful in trying varying combinations of shinglefilterfactory and positionfilterfactory as well as keyword, standard and whitespace tokenizers.
I'm hoping someone might be able to point me in the right direction to solve this!
Your goal is looking obscure to me and strange a little bit. But for your specific use-case the following filter can be used:
"solr.PatternReplaceCharFilterFactory"
"pattern"="[\\W]"
"replacement"=""
It will make "blue tooth" to be replaced into "bluetooth". And also you can specify that field-analysis for query-time only.
But let me tell you that usually tokenization is used instead of concatenation. And let me also offer you the following filter - WordDelimiterFilter. In such case this guy can split "BlueTooth" into "blue" and "tooth" based on cases.

SOLR / Lucene MultiFieldQueryParser

I wish to query a Lucene index and ask the question "..does the string ABC occur in Field A AND string DEF in Field B ..."
BOTH conditions (ABC in Field A and DEF in Field B) must be true ....I've fooled around
with a few searches and don't seem to be hit the proper combination.
Any ideas / examples ...seems that the MultiFieldQueryParser may be the answer but I've had no luck so far.
The standard query parser supports this sort of query, like:
+fielda:ABC +fieldb:DEF
The + character is the required operator, so this query will require a match on both fielda:ABC and fieldb:XYZ.
See the query parser syntax documentation, for more information.
MultiFieldQueryParser is used to automatically search for the same content in multiple fields, so not quite what you are looking for.
Turns out on a SOLR browser search, the q.OP=AND on the URL will provide the ANDING condition I was looking for.

Terms Prevalence in SolR searches

Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.
For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails
When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources