Lucene and Nickname matching - solr

I have a series of docs containing a nickname ( even with spaces ) and an ID.
The nickname can be like ["example","nick n4me", "nosp4ces","A fancy guy"].
I have to find a query that allow me to find profiles by a perfect matching, a fuzzy, or event with partial character.
So if a write down "nick" or "nick name" or "nick name", the document "nickname" has always to come out.
I tried with something like:
nickname:(%1%^4 %1%~^3 %1%*^1)
where "%1%" is what I'm searching, but it doesn't work, especially for spaces or numbers nicknames. For example if i try to search "nick n" the query would be:
nickname:(nick n^4 nick n~^3 nick n*^1)

Boosting with ^ will only affect the scoring and not the matching, i.e. if your query does not match at all, boosting the terms or not won't make any difference.
In your specific example, the query won't match because:
1) nick n won't match because that would require that either the token nick or n have been tokenized;
2) EDIT: I found out that fuzzy queries work only on single terms, if you use the standard query parser. In your case, you should probably rewrite nick n~ using ComplexPhraseQueryParser, so you can do a fuzzy query on the whole PhraseQuery. Also, you can specify a threshold for your fuzzy query (technically, you are specifying a minimum Levenshtein distance). Obviously you have to adjust the threshold, and that usually requires some trial and error.

An easier tactic is to load all nicknames into one field -- in your example you would have 4 values for your nickname field. If you want embedded spaces in your nicknames, you will need to use a simpler analyzer than StandardAnalyzer or use phrase searches.

Related

Solr, Ignore wildcard query for certain field

I'd like to query a certain 2 fields in solr. Let's say I have "description" and "keywords". Now I want to search for "dogs" or "cats" by doing this:
q=dog* OR cat*
I'm also passing the fields to be searched:
qf=description^1 keywords^1
So far so good. Now I want to have "description" to ignore the wildcards so the search is being more performant. Is there any way to do this in the fieldTypes or in the query itself?
yes, well, not exactly that, but you can get the same functionality while at the same time gaining performance:
use different analysis for description and keywords. In keywords, use a EdgeNGramFilterFactory. This can give you the same functionality as the *, but with much better perf (at the expense of a bigger index, but it is worth it!).
in description, just don't use the ngram filter, and partial matches will not be found.

Solr Fuzzy Search Weird Case

I am implementing solr fuzzy search using complex phrase query.
But i am phasing a weired case:
q={!complexphrase}name:"woo~1 grou~2" return "wood group" as a result.
q={!complexphrase}name:"woo~1 gro~2" does not return "wood group".
althouth distance between gro and group is 2!
searching for this query:
q={!complexphrase}name:"Anderso~1 Interes~2" returns 'Anderson Interests'.
duistance between Interes and Interests is same as gro and group!!!
any idea whats the reason?
I believe you are running into a problem with query rewrites.
Any multi-term query (fuzzy queries, prefix queries, etc.) gets expanded, in Lucene, into the exact terms that it matches. There is a maximum to the number of terms that can be generated this way though, so when rewriting the query, it will just try to pick the best within that limit. I suspect there are just too many matches for gro~2.
Perhaps you'll find it odd that there are so many matches that it can't incorporate all of them into the query. It looks like you are trying to search for words begining with gro, and with up to two more letters tacked onto the end. How many could there be? But that isn't what you're searching for. Fuzzy queries are based on levenshtein distance. The matches for that term include:
g__ -- Three-letter words beginning with g
_r_ -- Three-letter words with an r in the middle
__o -- Three-letter words with an o on the end
gr__ -- Any four-letter word beginning with gr
etc.
In short, it could match a massive list of terms, and in terms of similarity algorithm, "arm" and "cron" match just as well as "group".
If you really just want to match terms that start with "gro", use a prefix query instead: "woo* gro*".
If you want to actually search with a fuzzy query, including the list of possible matches seen above, you can enlarge the MaxBooleanClauses, in your solrconfig's query section.
<query>
<maxBooleanClauses>1024</maxBooleanClauses>

Searching for words that are contained in other words

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks
We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.
perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

SOLR / Lucene MultiFieldQueryParser

I wish to query a Lucene index and ask the question "..does the string ABC occur in Field A AND string DEF in Field B ..."
BOTH conditions (ABC in Field A and DEF in Field B) must be true ....I've fooled around
with a few searches and don't seem to be hit the proper combination.
Any ideas / examples ...seems that the MultiFieldQueryParser may be the answer but I've had no luck so far.
The standard query parser supports this sort of query, like:
+fielda:ABC +fieldb:DEF
The + character is the required operator, so this query will require a match on both fielda:ABC and fieldb:XYZ.
See the query parser syntax documentation, for more information.
MultiFieldQueryParser is used to automatically search for the same content in multiple fields, so not quite what you are looking for.
Turns out on a SOLR browser search, the q.OP=AND on the URL will provide the ANDING condition I was looking for.

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources