Query Solr accented and unaccented - solr

I'm working on configuring my core solr that save brazilian portuguese data.
About accents, I need to query something like:
search | return
computação | computacao
computacao | computação
What I need basicly is, with or without accent in a query, return both type of words
I tried:
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
Without success
I'm using Solr 5.2.1

Try by adding the BrazilianStemFilterFactory as a filter for your field type which used for searching the field.
This is specifically written for the Brazilian Portuguese.
This could solve your issue.

When using a multilingual index what I have done is create a new field for each language that uses the language specific text field.
So let's say you have English and Portuguese and thus you would declare two fields:
descriptionPt and use text_pt
descriptionEn and use text
Now when you run your search you would specify which field you would like to use or both via qf and specify deftype=edismax.
Worked fine for me.

Related

DSE Search And Solr - Issues with whitespace in UDT search queries

I'm trying to get my DSE search query working (with Solr). However, while constructing queries with User Defined types (UDTs), I'm running into issues with whitespace character.
For eg: I have a Student table and a Name type, where the Student table has a list<frozen<Name> names. Name type has say, firstname and lastname. If I do the below query, it throws an error:
Unable to execute CQL Script : no field name specified in query and no default specified via ‘df’ param.
SELECT * from Student where solr_query= '{!tuple}names.firstname:John
Smith';
So I tried escaping the whitespace as below and it works just fine.
SELECT * from Student where solr_query= '{!tuple}names.firstname:John\
Smith';
But, when I use the above UDT field with an AND operator, it FAILS again.
SELECT * from Student where solr_query= 'student_id:123456 AND {!tuple}names.firstname:John\
Smith';
Unable to execute CQL Script : org.apache.solr.search.SyntaxError: Cannot parse names.firstname … Lexical error at line 1, column … Encountered: after : “”
This is the field type for first name:
<fieldType class="org.apache.solr.schema.TextField" name="DelimitedTextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[,\s]"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
As a beginner with Solr, I've been banging my head trying to make these queries work. Any help would be deeply appreciated. Thanks!
I am no expert at all of the DSE system you seem to be using, but taking a look to this resource[1] it seems you may be building boolean queries in a wrong way.
This seems a correct approach :
+{!tuple v='father.name.firstname:Sam'} +{!tuple v='mother.name.firstname:Anne'}
Hope it helps
[1] http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchTupleUDTqueries.html
I was able to get this working by doing a couple of things.
I created a new org.apache.solr.schema.TextField and added
PatternTokenizerFactory tokenizer, with comma (,) as the pattern.
Trimmed the white spaces at the beginning and at the end, and
replaced the whitespaces within the text with '?' which matches any
single character. This was ok to do in my case.
I had to add braces () to the entire query.
Hence, with the updated schema.xml file and the other changes mentioned above, I have the following query working now:
SELECT * from Student where solr_query= '(student_id:123456 AND
{!tuple}names.firstname:John?Smith)';
Eventhough this would match John Smith, John-Smith, or even John.Smith, this was ok in my case since we were supposed to give back these results anyway.

Solr special character search on person name

I am a newbie to Solr. I have indexed people name in my collection as below.
When searching for a name like Àlvarez Rubén,I am unable to retrieve the results.I tried escaping using / however I didnt get the correct result.Please help
Use mapping-ISOLatin1Accent.txt before Tokenizing your incoming values.
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>

Synonyms are not working ibm watson retrieve and rank

This is my synonyms.txt
file system => filesystem
file set => fileset
version , release
latest, new
content, information
I have changed the synonyms.txt but synonyms are not working also help me to how to give space separated synonyms.
eg.
foo bar => foobar
The field type "watson_text_en" we use in retrieve and rank doesn't have synonyms filter by default. You would need to update your schema.xml by adding that filter to make it available. Here is an instruction of where and what to add: In your schema.xml, in section, add <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> into the tag list.
Depending on your requirement, you can add it to both/either of and , which tell solr whether to apply it in indexing and/or query time. Adding it to "index" would require reindexing to make the change effective, while adding into "query" does not. Also, list will run in the order you put it, so you can choose where to put this filter to let it run before/after certain filters. For example, if you put it before solr.LowerCaseFilterFactory, it's better to toggle on ignoreCase="true", because it will run before everything is transformed into lower case
Just to note regarding adding the filter into 'Query' - according to the Solr docs, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory this is a Very Bad Thing to Do.

Solr - search word immediately followed by partial match (with wildcard)

I have a Solr index filled with documents, with a field named issuer.
There is a document with issuer=first issuer.
I'm trying to implement matching of two consequent words. The first word needs to match completely, the second needs to match partially.
What I am trying to achieve is:
I search for something like: issuer:first\ iss*
I expect it to match "first iss uer"
I tried the following solutions but none is working:
issuer:first\ iss* -> returns nothing
issuer:"first iss"* -> returns everything
issuer:(first iss*) -> also returns "issuer first"
Does anybody have a clue on how to achieve the desired result?
My suggestion is to add a shiringle filter based field type to your schema. Below is a simple definition:
<fieldtype name="shingle">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
</analyzer>
</fieldtype>
You then add another field with this type as shown below:
<field name="issuer_sh" type="shingle" indexed="true" stored="false"/>
At query time, you can issue the following query:
issuer_sh:"first iss*"
The shingleFilter creates n-gram tokens from your text. For instance, if the issuer field contains "first issue", then Solr will create and index the following tokens:
first
issue
first issue
You can't search with wildcards in phrase queries. Without changing how you are indexing (see #ameertawfik's answer), the standard query parser doesn't provide a good way to do this. You can, however, use the surround query parser to search using spans. This query would then look like:
1N(first, iss*)
Keep in mind, surround query parser does not analyze, so 1N(first, iss*) and 1N(First, iss*) will not find the same results.
You could also construct this query using lucene's SpanQueries directly, of course, like:
SpanQuery[] queries = new SpanQuery[2];
queries[0] = new SpanTermQuery(new Term("issuer","first"));
queries[1] = new SpanMultiTermQueryWrapper(new PrefixQuery(new Term("issuer","iss")));
Query finalQuery = new SpanNearQuery(queries, 0, true);

Solr: Localities & solr.ICUCollationField usage?

I'm learning Solr and have become confused trying to figure out ICUCollation, what it does, what it is for and how to use it. From here. I haven't found any good explanation of this online. The doc appear to be saying that I need to use this ICUCollation and implies that it does magical things for me, but does not seem to explain exactly why or exactly what, and how it integrates with anything else.
Say I have a text field in French and I want stopwords removed, accents, punctuation and case ignored and stemming... how does ICUCollation come into this? Do I set solr.ICUCollationField and locale='fr' and it will do everything else automatically? Or do I set solr.ICUCollationField and then tokenizer and filters on this in addition? Or do I not use solr.ICUCollationField at all because that's for something completely different? And if so, then what?
Collation is the organisation of written information into an order - ICUCollactionField (the API documentation also provides a good description) is meant to enable you to provide locale aware sorting, as the sort order is defined by cultural norms and specific language properties. This is useful to allow different sorting based on those rules, such as the difference between Norwegian and Swedish, where a Swede would order Å before Æ/Ä and Ø/Ö, while a Norwegian would order it Æ/Ä, Ø/Ö and then Å.
Since you usually don't want to sort by a tokenized field (exception: KeywordTokenizer) or a multivalued field, these fields are usually not processed any more than allowing for the sorting / collation to be performed.
There is a case to be made for collation filters for searching as well, as search in practice is just comparison. This means that if you're aiming to search for two words that would be identical when compared in the locale provided, it would be a hit. The tokens indexed will not make any sense when inspected, but as long as the values are reduced to the same token both when indexing and searching, it would work. There's an example of this on the wiki under UnicodeCollation.
Collation does not affect stopwords (StopFilterFactory), accents (ICUFoldingFilterFactory), punctuation, case (depending on locale - if the locale for sorting is case aware, then it does not) (LowercaseFilterFactory or ICUNormalizer2FilterFactory) or stemming (SnowballPorterFilterFactory). Have a look at the suggested filters for that. Most filters or tokenizers in Solr does very specific tasks, and try to avoid doing "everything and the kitchen sink" in one single filter.
You normally have two or more fields for one text input if you want to do different things like:
search: text analysis
sort: language sensitive / case insensitive sorting
facet: string
For search use something like:
<fieldType name="textFR" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
For sorting use:
<fieldType name="textSortFR" class="solr.ICUCollationField"
locale="fr"
strength="primary" />
or simply:
<fieldType name="textSort" class="solr.ICUCollationField"
locale=""
strength="primary" />
(If you have to support many languages. Should work fine enough in most cases.)
Do make use of the Analysis UI in the SOLR Admin: open the analysis view for your index, select the field type (e.g. your sort field), add a representative input value in the left text area and a test value in the right field (in case of sorting, this right side value is not as interesting as the sort field is not used for matching).
The output will show you whether:
accents are removed
elisions are removed
lower casing is applied
etc.
For example, if you see that elisions (l'atelier) are not remove (atelier) but you would like to discard it for sorting you would have to add the elision filter (see example for search field type above).
https://cwiki.apache.org/confluence/display/solr/Language+Analysis

Resources