I am implementing solr fuzzy search using complex phrase query.
But i am phasing a weired case:
q={!complexphrase}name:"woo~1 grou~2" return "wood group" as a result.
q={!complexphrase}name:"woo~1 gro~2" does not return "wood group".
althouth distance between gro and group is 2!
searching for this query:
q={!complexphrase}name:"Anderso~1 Interes~2" returns 'Anderson Interests'.
duistance between Interes and Interests is same as gro and group!!!
any idea whats the reason?
I believe you are running into a problem with query rewrites.
Any multi-term query (fuzzy queries, prefix queries, etc.) gets expanded, in Lucene, into the exact terms that it matches. There is a maximum to the number of terms that can be generated this way though, so when rewriting the query, it will just try to pick the best within that limit. I suspect there are just too many matches for gro~2.
Perhaps you'll find it odd that there are so many matches that it can't incorporate all of them into the query. It looks like you are trying to search for words begining with gro, and with up to two more letters tacked onto the end. How many could there be? But that isn't what you're searching for. Fuzzy queries are based on levenshtein distance. The matches for that term include:
g__ -- Three-letter words beginning with g
_r_ -- Three-letter words with an r in the middle
__o -- Three-letter words with an o on the end
gr__ -- Any four-letter word beginning with gr
etc.
In short, it could match a massive list of terms, and in terms of similarity algorithm, "arm" and "cron" match just as well as "group".
If you really just want to match terms that start with "gro", use a prefix query instead: "woo* gro*".
If you want to actually search with a fuzzy query, including the list of possible matches seen above, you can enlarge the MaxBooleanClauses, in your solrconfig's query section.
<query>
<maxBooleanClauses>1024</maxBooleanClauses>
Related
I'm looking into the possibility of de-boosting a set of documents during
query time. In my application, when I search for e.g. "preferences", I want
to de-boost content tagged with ContentGroup:"Developer" or in other words,
push those content back in the order. Here's the catch. I've the following
weights on query fields and boost query on source
qf=text^6 title^15 IndexTerm^8
As you can see, title has a higher weight.
Now, a bunch of content tagged with ContentGroup:"Developer" consists of a
title like "Preferences.material" or "Preferences Property" or
"Preferences.graphics". The boost on title pushes these documents at the
top.
What I'm looking is to see if there's a way to deboost all documents that are
tagged with ContentGroup:"Developer" irrespective of the term occurrence is
text or title. I tried something like, but didn't make any difference.
Source:simplecontent^10 Source:Help^20 (-ContentGroup-local:("Developer"))^99
I'm using edismax query parser.
Any pointers will be appreciated.
Thanks,
Shamik
You're onto something with your last attempt, but you have to start with *:*, so that you actually have something to subtract the documents from. The resulting set of documents (those not matching your query) can then be boosted.
From the Solr Relevancy FAQ
How do I give a negative (or very low) boost to documents that match a query?
True negative boosts are not supported, but you can use a very "low" numeric boost value on query clauses. In general the problem that confuses people is that a "low" boost is still a boost, it can only improve the score of documents that match. For example, if you want to find all docs matching "foo" or "bar" but penalize the scores of documents matching "xxx" you might be tempted to try...
q = foo^100 bar^100 xxx^0.00001 # NOT WHAT YOU WANT
...but this will still help a document matching all three clauses score higher then a document matching only the first two. One way to fake a "negative boost" is to give a large boost to everything that does not match. For example...
q = foo^100 bar^100 (*:* -xxx)^999
NOTE: When using (e)dismax, people sometimes expect that specifying a pure negative query with a large boost in the "bq" param will work (since Solr automatically makes top level purely negative positive queries by adding an implicit ":" -- but this doesn't work with "bq", because of how queries specified via "bq" are added directly to the main query. You need to be explicit...
?defType=dismax&q=foo bar&bq=(*:* -xxx)^999
What I want to achieve is searching for 'deodorant spray' matches 'antiperspirant spray', 'deo spray' etc.
I'm using a SynonymFilterFactory to add synonyms at index time for deodorant, deo and antiperspirant. I can see this working correctly in the analyzer.
After this I'm running a ShingleFilterFactory (maxShingleSize="3") to split into combinations of words. This, again gives me the correct result, e.g. analysing 'test shingle phrase' gives:
test
test shingle
test shingle phrase
shingle
shingle phrase
phrase
Which is the desired result. The problem comes when I combine synonym terms with shingles. For example, searching for 'deodorant spray' should give me:
deodorant spray
deo spray
antiperspirant spray
for all my synonyms. But what I actually see is:
deodorant
deodorant deo
deodorant deo antiperspirant
deo
deo antiperspirant
deo antiperspirant spray
antiperspirant
antiperspirant spray
Which clearly is making shingles from each of the synonym terms too. I've tried swapping the order of my filter factories but can't seem to get it to work. What am I doing wrong?
The only thing you can do is to use synonym filter without expanding - the one that reduces all synonyms to the first in the list. Then you have to use it at index time, as well as at query time.
Such approach would not cause problem described in the documentation, since you have to apply the filter also on the index.
Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
However, you might still run into problems if you want to support multi-word synonyms as described in the documentation.
I do not know if shingles consisting of synonyms will affect search results anyhow, but if not, then only what it costs you is extra space in the index, so consider if it is something you want to save on.
I have a series of docs containing a nickname ( even with spaces ) and an ID.
The nickname can be like ["example","nick n4me", "nosp4ces","A fancy guy"].
I have to find a query that allow me to find profiles by a perfect matching, a fuzzy, or event with partial character.
So if a write down "nick" or "nick name" or "nick name", the document "nickname" has always to come out.
I tried with something like:
nickname:(%1%^4 %1%~^3 %1%*^1)
where "%1%" is what I'm searching, but it doesn't work, especially for spaces or numbers nicknames. For example if i try to search "nick n" the query would be:
nickname:(nick n^4 nick n~^3 nick n*^1)
Boosting with ^ will only affect the scoring and not the matching, i.e. if your query does not match at all, boosting the terms or not won't make any difference.
In your specific example, the query won't match because:
1) nick n won't match because that would require that either the token nick or n have been tokenized;
2) EDIT: I found out that fuzzy queries work only on single terms, if you use the standard query parser. In your case, you should probably rewrite nick n~ using ComplexPhraseQueryParser, so you can do a fuzzy query on the whole PhraseQuery. Also, you can specify a threshold for your fuzzy query (technically, you are specifying a minimum Levenshtein distance). Obviously you have to adjust the threshold, and that usually requires some trial and error.
An easier tactic is to load all nicknames into one field -- in your example you would have 4 values for your nickname field. If you want embedded spaces in your nicknames, you will need to use a simpler analyzer than StandardAnalyzer or use phrase searches.
I've got an 11,000,000-document index. Most documents have a unique ID called "flrid", plus a different ID called "solrid" that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it "never" that any two searches will have the same set of FLRIDs to limit on.
What we're doing right now is, roughly:
q=title:dogs AND
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))
Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.
The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.
How can we do this better?
Things we've tried or considered:
Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.
What we're hoping for:
An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.
I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.
solr search within subset defined by list of keys
Searching within a subset of data - Solr
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html
I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution