I would like to index some words with special characters all together.
For example, given m&m, I would like to index it as a whole, rather than delimiting it as m and m (normally & would be considered as a delimiter).
Is there a way to achieve this by using standard tokenizer/filter or should I have to write one myself?
basically text field type filter out special characters before indexing. and you can use string type but it is not advisable for searching on it. you can use types option of WordDelimiterFilterFactory and you can convert those special characters to alphabetical
% => percent
& => and
A Standard Tokenizer factory splits/tokenizes the given text at special characters. To index with special characters you could either write your own custom tokenizer or you can do the following:
Take a list of characters, at which you want to tokenize/split the
text. For eg, my list is {" ",";"}.
Use a PatternTokenizer with the
above list of characters, instead of the StandardTokenizer. Your
configuration will look like:
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=" |;" />
</analyzer>
you can use WhiteSpaceTokenizerFactory.
http://docs.lucidworks.com/display/solr/Tokenizers#Tokenizers-WhiteSpaceTokenizer
It will tokenize only on whitespaces. For example,
"m&m" will be considered as a single token and so it would indexed like that
Related
After changing splitOnNumerics="0" I can search words with mixed number and normal character such as "90s", "omega30", etc but it is still not working with special characters like "80"", "40)", etc even I escaped them: 80\", 40\), etc. Do you have any idea?
I have a requirment where I have to match a substring in a query .
e.g if the field has value :
PREFIXabcSUFFIX
I have to create a query which matches abc. I always know the length of the prefix.
I can not use EdgeNgram and Ngram because of the space constraints.(As they will create more indexes.)
So i need to do this on query time and not on index time. Using a wildcard as prefix something like *abc* will have high impact on performance .
Since I will know the length of the prefix I am hoping to have some way where I can do something like ....abc* where dots represents the exact length of the prefix so that the query is not as bad as searching for the whole index as in the case of wild card query (*abc*).
Is this possible in solr ? Thanks for your time .
Solr version : 4.10
Sure, Wildcard syntax is documented here, you could search something like ????abc*. You could also use a regex query.
However, the performance benefit from this over *abc* will be very small. It will still have to perform a sequential search over the whole index. But if there is no way you can improve your analysis to support your search needs, there may be no getting around that (GIGO).
You could use the RegularExpressionPatternTokenizer for this. For the sample below I guessed that the length of your prefix is 6. Your example text PREFIXabcSUFFIX would become abcSUFFIX. This way you may search for abc*
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=".{6}(.+)" group="1"/>
</analyzer>
About the Tokenizer:
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.
How can we map non ASCII char with ASCII character?
Ex.: In solr index we have word contain char ñ, Ñ [LATIN CAPITAL LETTER N WITH TILDE] or normal n,N
Then what filter/token we use to search with Normal N or Ñ and both mapped.
Merging the answers of Solr, Special Chars, and Latin to Cyrilic char conversion
Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.
Probably the ASCIIFoldingFilterFactory does exactly what you want.
When changing an analyzer to remove the accents, keep in mind that you need to reindex. Otherwise the accented characters will stay within the index, but no user input can be created to match them.
Update
I tried using the ICUFoldingFilterFactory this works fine with those accents. If this one is tricky to set up, have a look into the SO question Can not use ICUTokenizerFactory in Solr
This analyzer
<fieldType name="spanish" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory" />
</analyzer>
</fieldType>
got me these analysis results, the screen-shot is taken from solr-admin
I need to index words in Spanish and have test with ASCIIFoldingFilterFactory. This filter works great for accented characters (converts á -> a) but also converts ñ -> n and this is not a valid behaviour (give wrong results with some words).
Is there a way to exclude a letter from ASCIIFoldingFilterFactory or another filter to try?
Thanks
You can use MappingCharFilter and customise the mappings that are in mapping-FoldToASCII.txt
<charFilter class="solr.MappingCharFilterFactory"
mapping="/solr/trunk/solr/example/solr/conf/mapping-FoldToASCII.txt"/>
(change location file to location on your system)
you can try extending BaseTokenFilterFactory and in the schema.xml file point to it as one of your index/search filter
I have an index containing AT&T as a field , but when I search for this field we cannot put & sign in the query , so it is encoded to AT%26T. Searching for AT%26T returns nothing ,
Is there any way to use analyzer or filters to index this type of terms.
NOTE : I have used WordDelimiter analyzer with reserveOriginal=1 ...but that didn't work
You can try to search for AT&T
Else you can find out in the admin/analysis what happens to the term AT&T in query and index stage. With verbose on, you can see excactly what analyzers do with your terms.
The other reason than that shown by others is escaping special characters. You should escape all from the list:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
Just try use backslash before ampersand.
You need to tune WordDelemiter a bit further. See my adjustments I had made for jetwick to search for hashtags ala #java
https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/JetwickFilterFactory.java#L49
The background: AT&T is normally tokenized as AT and T because '&' is removed as its no digit or character but with the class above you can make that the '&' sign is handled as digit and all stuff containing '&' signs will then be tokenized as 'AT&T' (and 'AT' and 'T' I think) but only if preserveOriginal=1 or you handle them as char, but then it won't split into 'AT' and 'T' I think as all positions of the string are detected as chars
BTW: you'll need to reindex and apply the same analyzer/tokenizer on the query string too!
Maybe you can try to use catenateWords="1". So that AT&T will me also indexed as ATT.
Also make sure your analyzer appears under both:
<analyzer type="query"> //this will define how the query is parsed and split into tokens before searching it
and
<analyzer type="index">// this will define how the field is indexed
If you only have this tag <analyzer> than the analyzer will be used both on query and index time.