Solr special character search on person name - solr

I am a newbie to Solr. I have indexed people name in my collection as below.
When searching for a name like Àlvarez Rubén,I am unable to retrieve the results.I tried escaping using / however I didnt get the correct result.Please help

Use mapping-ISOLatin1Accent.txt before Tokenizing your incoming values.
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

Related

DSE Search And Solr - Issues with whitespace in UDT search queries

I'm trying to get my DSE search query working (with Solr). However, while constructing queries with User Defined types (UDTs), I'm running into issues with whitespace character.
For eg: I have a Student table and a Name type, where the Student table has a list<frozen<Name> names. Name type has say, firstname and lastname. If I do the below query, it throws an error:
Unable to execute CQL Script : no field name specified in query and no default specified via ‘df’ param.
SELECT * from Student where solr_query= '{!tuple}names.firstname:John
Smith';
So I tried escaping the whitespace as below and it works just fine.
SELECT * from Student where solr_query= '{!tuple}names.firstname:John\
Smith';
But, when I use the above UDT field with an AND operator, it FAILS again.
SELECT * from Student where solr_query= 'student_id:123456 AND {!tuple}names.firstname:John\
Smith';
Unable to execute CQL Script : org.apache.solr.search.SyntaxError: Cannot parse names.firstname … Lexical error at line 1, column … Encountered: after : “”
This is the field type for first name:
<fieldType class="org.apache.solr.schema.TextField" name="DelimitedTextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[,\s]"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
As a beginner with Solr, I've been banging my head trying to make these queries work. Any help would be deeply appreciated. Thanks!
I am no expert at all of the DSE system you seem to be using, but taking a look to this resource[1] it seems you may be building boolean queries in a wrong way.
This seems a correct approach :
+{!tuple v='father.name.firstname:Sam'} +{!tuple v='mother.name.firstname:Anne'}
Hope it helps
[1] http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchTupleUDTqueries.html
I was able to get this working by doing a couple of things.
I created a new org.apache.solr.schema.TextField and added
PatternTokenizerFactory tokenizer, with comma (,) as the pattern.
Trimmed the white spaces at the beginning and at the end, and
replaced the whitespaces within the text with '?' which matches any
single character. This was ok to do in my case.
I had to add braces () to the entire query.
Hence, with the updated schema.xml file and the other changes mentioned above, I have the following query working now:
SELECT * from Student where solr_query= '(student_id:123456 AND
{!tuple}names.firstname:John?Smith)';
Eventhough this would match John Smith, John-Smith, or even John.Smith, this was ok in my case since we were supposed to give back these results anyway.

Solr query data with white space needs to be queried

I am new to solr. I have data in solr something like "name":"John Lewis".
Query formed looks and searches perfectly as fq=name%3A+%22John+Lewis%22
This is formed in Solr console and works well.
My requirement is to search a particular word coming from my Java layer as "JohnLewis". It has to be mapped with "John Lewis" in solr repo.
This search is not just restricted to name field(2 words and a space in-between).
I have some other details like "Cash Reward Credit Cards", which has 4 words and user would query like "CashRewardCreditCards".
Could someone help me on this, if this can be handled in schema.xml with any parsers that is available in solr.
You need to create custom fieldType.
First define a fieldType in your solr schema :
<fieldType name="word_concate" class="solr.TextField" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s*" replacement=""/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
Here we named the fieldType as word_concate.
We used CharFilterFactories's solr.PatternReplaceCharFilterFactory
Char Filter is a component that pre-processes input characters. Char Filters can be chained like Token Filters and placed in front of a Tokenizer. PatternReplaceCharFilterFactory filter uses regular expressions to replace or change character patterns
Pattern : \s* means zero or more whitespace character
Second create a field with word_concate as type :
<field name="cfname" type="word_concate"/>
Copy your name field to cfname with copy field
<copyField source="name" dest="cfname"/>
Third reindex the data.
Now you can query : cfname:"JohnLewis" it will return name John Lewis
Assuming your input is CamelCase as shown I would use Solr's Word Delimiter Filter
with the splitOnCaseChange parameter on the query side of your analyzer as a starting point. This will take an input token such as CashRewardCreditCards and generate the tokens Cash Reward Credit Cards
See also:
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
Look at WordDelimiterFilterFactory
It has a splitOnCaseChange property. If you set that to 1, JohnLewis will be indexed as John Lewis.
You'll need to add this to your query analyzer. If the user searches for JohnLewis, the search will be translated to John Lewis.

Query Solr accented and unaccented

I'm working on configuring my core solr that save brazilian portuguese data.
About accents, I need to query something like:
search | return
computação | computacao
computacao | computação
What I need basicly is, with or without accent in a query, return both type of words
I tried:
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
Without success
I'm using Solr 5.2.1
Try by adding the BrazilianStemFilterFactory as a filter for your field type which used for searching the field.
This is specifically written for the Brazilian Portuguese.
This could solve your issue.
When using a multilingual index what I have done is create a new field for each language that uses the language specific text field.
So let's say you have English and Portuguese and thus you would declare two fields:
descriptionPt and use text_pt
descriptionEn and use text
Now when you run your search you would specify which field you would like to use or both via qf and specify deftype=edismax.
Worked fine for me.

Solr - search word immediately followed by partial match (with wildcard)

I have a Solr index filled with documents, with a field named issuer.
There is a document with issuer=first issuer.
I'm trying to implement matching of two consequent words. The first word needs to match completely, the second needs to match partially.
What I am trying to achieve is:
I search for something like: issuer:first\ iss*
I expect it to match "first iss uer"
I tried the following solutions but none is working:
issuer:first\ iss* -> returns nothing
issuer:"first iss"* -> returns everything
issuer:(first iss*) -> also returns "issuer first"
Does anybody have a clue on how to achieve the desired result?
My suggestion is to add a shiringle filter based field type to your schema. Below is a simple definition:
<fieldtype name="shingle">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
</analyzer>
</fieldtype>
You then add another field with this type as shown below:
<field name="issuer_sh" type="shingle" indexed="true" stored="false"/>
At query time, you can issue the following query:
issuer_sh:"first iss*"
The shingleFilter creates n-gram tokens from your text. For instance, if the issuer field contains "first issue", then Solr will create and index the following tokens:
first
issue
first issue
You can't search with wildcards in phrase queries. Without changing how you are indexing (see #ameertawfik's answer), the standard query parser doesn't provide a good way to do this. You can, however, use the surround query parser to search using spans. This query would then look like:
1N(first, iss*)
Keep in mind, surround query parser does not analyze, so 1N(first, iss*) and 1N(First, iss*) will not find the same results.
You could also construct this query using lucene's SpanQueries directly, of course, like:
SpanQuery[] queries = new SpanQuery[2];
queries[0] = new SpanTermQuery(new Term("issuer","first"));
queries[1] = new SpanMultiTermQueryWrapper(new PrefixQuery(new Term("issuer","iss")));
Query finalQuery = new SpanNearQuery(queries, 0, true);

Solr fullname search

I'm trying to set up a fullname search in Solr. Until now I thought my work was fine until I've found something strange, and I can't figure out how to correct it.
So I want to be able to do searches on fullnames. My index is a database where I get first name and last name and put them in one multivalued field with keyword tokenizer.
Here's my fieldtype :
<fieldType name="text_auto" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Everything works fine, I can search only a first name OR lastname and it gives me the full names that exists, and it also works for full names in any order if there's no mispelling.
I just noticed something wrong ! For example, if I ask for Dupont dupont, it'll give me every Dupont that exists, even the ones for which the first name doesn't match with dupont. I guess it's because dup is found a second time in the fullname... The problem is that if they're looking for "dupont d", they'll find every Dupont that exist because "d" is contained in Dupont ! That's not what I want, I want to find every Dupont with a d in their first name (the other string).
So I need to find a way to make it work, I tried many different tokenizers and filters but I'm affraid it's not possible...
Thank you for any help you could provide me !
Sounds like you are searching with something like:
q=dupont d
Which will have no problem with finding the terms in any order, or even as the same term in the index, in the case of dupont dupont (I'm assuming, by the way, that you are setting the default operator to AND, since this sort of behavior is surprising). If you want to find the phrase "dupont d" in that order, you should search with a quoted phrase query:
q="dupont d"
or for dupont dupont
q="dupont dupont"

Resources