Solr fullname search - solr

I'm trying to set up a fullname search in Solr. Until now I thought my work was fine until I've found something strange, and I can't figure out how to correct it.
So I want to be able to do searches on fullnames. My index is a database where I get first name and last name and put them in one multivalued field with keyword tokenizer.
Here's my fieldtype :
<fieldType name="text_auto" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Everything works fine, I can search only a first name OR lastname and it gives me the full names that exists, and it also works for full names in any order if there's no mispelling.
I just noticed something wrong ! For example, if I ask for Dupont dupont, it'll give me every Dupont that exists, even the ones for which the first name doesn't match with dupont. I guess it's because dup is found a second time in the fullname... The problem is that if they're looking for "dupont d", they'll find every Dupont that exist because "d" is contained in Dupont ! That's not what I want, I want to find every Dupont with a d in their first name (the other string).
So I need to find a way to make it work, I tried many different tokenizers and filters but I'm affraid it's not possible...
Thank you for any help you could provide me !

Sounds like you are searching with something like:
q=dupont d
Which will have no problem with finding the terms in any order, or even as the same term in the index, in the case of dupont dupont (I'm assuming, by the way, that you are setting the default operator to AND, since this sort of behavior is surprising). If you want to find the phrase "dupont d" in that order, you should search with a quoted phrase query:
q="dupont d"
or for dupont dupont
q="dupont dupont"

Related

Solr special character search on person name

I am a newbie to Solr. I have indexed people name in my collection as below.
When searching for a name like Àlvarez Rubén,I am unable to retrieve the results.I tried escaping using / however I didnt get the correct result.Please help
Use mapping-ISOLatin1Accent.txt before Tokenizing your incoming values.
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

Solr search index on different tokens of a sentence

Folks,
We wanted to make a search on solr such that it will give a priority to partial match in the sentences.
Lets say for example :
Sentence is like "Have wonderful evening today here"
If user is supplying "today here" then it should match.
If user is supplying "wonderful evening" then it should match.
If user is supplying "Have wonderful" then it should match.
We want to give low priority to key word search compared to above.
keyword match could be : "today" "wonderful" "evening" etc.
Is there any way this can be achieve is solr since solr works on inverted index of words on a given sentence.
You can use a separate field with a SingleFilter defined - this will combine runs of tokens into separate tokens, so that "Have wonderful evening today here" can be indexed as "have wonderful", "wonderful evening", "evening today" and "today here".
Make hits in this field a higher priority than hits in your regular search field by using qf=shinglefield^<boostvalue> - what the exact boost value needs to be depends on the scoring profile of your index and if you're doing other boosts.
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
</analyzer>

Solr - search word immediately followed by partial match (with wildcard)

I have a Solr index filled with documents, with a field named issuer.
There is a document with issuer=first issuer.
I'm trying to implement matching of two consequent words. The first word needs to match completely, the second needs to match partially.
What I am trying to achieve is:
I search for something like: issuer:first\ iss*
I expect it to match "first iss uer"
I tried the following solutions but none is working:
issuer:first\ iss* -> returns nothing
issuer:"first iss"* -> returns everything
issuer:(first iss*) -> also returns "issuer first"
Does anybody have a clue on how to achieve the desired result?
My suggestion is to add a shiringle filter based field type to your schema. Below is a simple definition:
<fieldtype name="shingle">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
</analyzer>
</fieldtype>
You then add another field with this type as shown below:
<field name="issuer_sh" type="shingle" indexed="true" stored="false"/>
At query time, you can issue the following query:
issuer_sh:"first iss*"
The shingleFilter creates n-gram tokens from your text. For instance, if the issuer field contains "first issue", then Solr will create and index the following tokens:
first
issue
first issue
You can't search with wildcards in phrase queries. Without changing how you are indexing (see #ameertawfik's answer), the standard query parser doesn't provide a good way to do this. You can, however, use the surround query parser to search using spans. This query would then look like:
1N(first, iss*)
Keep in mind, surround query parser does not analyze, so 1N(first, iss*) and 1N(First, iss*) will not find the same results.
You could also construct this query using lucene's SpanQueries directly, of course, like:
SpanQuery[] queries = new SpanQuery[2];
queries[0] = new SpanTermQuery(new Term("issuer","first"));
queries[1] = new SpanMultiTermQueryWrapper(new PrefixQuery(new Term("issuer","iss")));
Query finalQuery = new SpanNearQuery(queries, 0, true);

Solr: Localities & solr.ICUCollationField usage?

I'm learning Solr and have become confused trying to figure out ICUCollation, what it does, what it is for and how to use it. From here. I haven't found any good explanation of this online. The doc appear to be saying that I need to use this ICUCollation and implies that it does magical things for me, but does not seem to explain exactly why or exactly what, and how it integrates with anything else.
Say I have a text field in French and I want stopwords removed, accents, punctuation and case ignored and stemming... how does ICUCollation come into this? Do I set solr.ICUCollationField and locale='fr' and it will do everything else automatically? Or do I set solr.ICUCollationField and then tokenizer and filters on this in addition? Or do I not use solr.ICUCollationField at all because that's for something completely different? And if so, then what?
Collation is the organisation of written information into an order - ICUCollactionField (the API documentation also provides a good description) is meant to enable you to provide locale aware sorting, as the sort order is defined by cultural norms and specific language properties. This is useful to allow different sorting based on those rules, such as the difference between Norwegian and Swedish, where a Swede would order Å before Æ/Ä and Ø/Ö, while a Norwegian would order it Æ/Ä, Ø/Ö and then Å.
Since you usually don't want to sort by a tokenized field (exception: KeywordTokenizer) or a multivalued field, these fields are usually not processed any more than allowing for the sorting / collation to be performed.
There is a case to be made for collation filters for searching as well, as search in practice is just comparison. This means that if you're aiming to search for two words that would be identical when compared in the locale provided, it would be a hit. The tokens indexed will not make any sense when inspected, but as long as the values are reduced to the same token both when indexing and searching, it would work. There's an example of this on the wiki under UnicodeCollation.
Collation does not affect stopwords (StopFilterFactory), accents (ICUFoldingFilterFactory), punctuation, case (depending on locale - if the locale for sorting is case aware, then it does not) (LowercaseFilterFactory or ICUNormalizer2FilterFactory) or stemming (SnowballPorterFilterFactory). Have a look at the suggested filters for that. Most filters or tokenizers in Solr does very specific tasks, and try to avoid doing "everything and the kitchen sink" in one single filter.
You normally have two or more fields for one text input if you want to do different things like:
search: text analysis
sort: language sensitive / case insensitive sorting
facet: string
For search use something like:
<fieldType name="textFR" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
For sorting use:
<fieldType name="textSortFR" class="solr.ICUCollationField"
locale="fr"
strength="primary" />
or simply:
<fieldType name="textSort" class="solr.ICUCollationField"
locale=""
strength="primary" />
(If you have to support many languages. Should work fine enough in most cases.)
Do make use of the Analysis UI in the SOLR Admin: open the analysis view for your index, select the field type (e.g. your sort field), add a representative input value in the left text area and a test value in the right field (in case of sorting, this right side value is not as interesting as the sort field is not used for matching).
The output will show you whether:
accents are removed
elisions are removed
lower casing is applied
etc.
For example, if you see that elisions (l'atelier) are not remove (atelier) but you would like to discard it for sorting you would have to add the elision filter (see example for search field type above).
https://cwiki.apache.org/confluence/display/solr/Language+Analysis

SOLR eDISMAX product search

I'm new to SOLR and am implementing it to search our product catalog. I'm creating ngrams and edge ngrams on the brand name, display name and category fields.
I'm using edismax and have qf defined as displayname_nge displayname_ng category_nge category_ng brandname_nge brandname_ng.
When I search for 'vitamin c' (without the quotes) I get all of the vitamins. If I surround it with quotes then I only get vitamin c. The problem is that I can't always surround the query string with quotes because a person might enter 'chewable vitamin c', or 'vendor x vitamin c'. I've tried the mm parameter without luck. I've also tried applying different boost levels and still not getting the expected results.
Any suggestions would be greatly appreciated. Thank you
Was there a reason for using only ngrams fields for searching? I'm not sure this is the problem in your case, but you may want to look at your ngrams analysis configuration in schema.xml. One from one of my indexes looks like this:
<fieldType name="ngram" class="solr.TextField" >
<analyzer type="index">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
Though you can see this is actually using the safer EdgeNGramFilterFactory, the important thing to note here is minGramSize="2". This means that during the indexing process only grams of at least two characters will be created. The word 'c'? That doesn't get any grams at all. While you could set minGramSize="1" and rebuild your index, single character grams are a very bad idea, as your search for 'c' would match against any document with a word that starts with 'c' (or contains the letter 'c' with NGramFilterFactory).
If you're currently using NGrams with minGramSize="2", a search for 'ca' would find any documents with any words containing the letters 'ca' consecutively in that order. This may not be exactly what you want, either.
My top suggestion would be to drop the ngrams in favor of a more vanilla Text field. Whether you want to keep the edge-ngrams around for better truncation support is up to you, but I suspect you'll have better luck if the Text field is at least in the mix.
You could also take a look at this question on StackOverflow: "Can I protect short words from an n-gram filter in Solr?" if you want to pursue the ngrams further.
Also, you should consider using Solr's built-in analysis tool to figure out where your searches are failing. You choose a field or fieldType, and provide values for what was entered into the index and what is being searched. It will show you how the analysis works against both values so you can see how each string is broken down and why it does or doesn't create matching tokens. The URL for the tool depends on whether you're in a multi-core environment, but if you go to Solr's web interface you should be able to find the Analysis link on the left.
Update:
Now that I have a little more detail from you and am thinking about it again, the results you're getting are very explainable.
With minGramSize="1", your unquoted search for 'vitamin c' is looking for records with the word 'vitamin' (or a longer word containing 'vitamin'), and the word 'c' (or a longer word containing 'c'). Since most records are likely to have a 'c' somewhere, this is hardly a limiting factor and your results will be very close to or exactly the same as your results for just the word 'vitamin'.
In the quoted search for 'vitamin c', the 'c' now has to appear in a word immediately following vitamin, making it a much more useful search, but still not great. You should be able to test this by finding records that have a word following vitamin that isn't a vitamin designation. For example, a record mentioning "vitamin tablets" should be found when searching for "vitamin b" (because there's a 'b' in "tablets"). and a record mentioning "vitamin chart" or "vitamin deficiency" should be found when searching for "vitamin c".
The upshot of this is that I strongly recommend having a set of fields for searching separate from your fields for autocomplete. The NGrams with minGramSize="1" are just not going to give you reasonable results for the actual search step.
Other option is to use edismax - 'mm', there you can give matching %. if you give 100% it will give you accurate matching. 75% will give you list of vitamin... you can programatically handle % according to your need
You may consider to replace the query keyword this way: "'vitamin c' vitamin c". In such case, records matching 'vitamin c' can get higher score than those matching 'vitamin' and 'c' separately. Your search results will still return all matching records. Please see if this help, and feel free to comment.

Resources