How to process AT&T token in solr index - solr

I have an index containing AT&T as a field , but when I search for this field we cannot put & sign in the query , so it is encoded to AT%26T. Searching for AT%26T returns nothing ,
Is there any way to use analyzer or filters to index this type of terms.
NOTE : I have used WordDelimiter analyzer with reserveOriginal=1 ...but that didn't work

You can try to search for AT&T
Else you can find out in the admin/analysis what happens to the term AT&T in query and index stage. With verbose on, you can see excactly what analyzers do with your terms.

The other reason than that shown by others is escaping special characters. You should escape all from the list:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
Just try use backslash before ampersand.

You need to tune WordDelemiter a bit further. See my adjustments I had made for jetwick to search for hashtags ala #java
https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/JetwickFilterFactory.java#L49
The background: AT&T is normally tokenized as AT and T because '&' is removed as its no digit or character but with the class above you can make that the '&' sign is handled as digit and all stuff containing '&' signs will then be tokenized as 'AT&T' (and 'AT' and 'T' I think) but only if preserveOriginal=1 or you handle them as char, but then it won't split into 'AT' and 'T' I think as all positions of the string are detected as chars
BTW: you'll need to reindex and apply the same analyzer/tokenizer on the query string too!

Maybe you can try to use catenateWords="1". So that AT&T will me also indexed as ATT.
Also make sure your analyzer appears under both:
<analyzer type="query"> //this will define how the query is parsed and split into tokens before searching it
and
<analyzer type="index">// this will define how the field is indexed
If you only have this tag <analyzer> than the analyzer will be used both on query and index time.

Related

How to config solr that use Synonym base on KeywordTokenizerFactory

synonym eg: "AAA" => "AVANT AT ALJUNIED"
If i search AAA*BBB
I can get AVANT AT ALJUNIEDBBB.
I was used StandardTokenizerFactory.But it's always breaking field data into lexical units,and then ignore relative position for search words.
On other way,I try to use StandardTokenizerFactory or other filter like WordDelimiterFilterFactory to split word via * . It don't work
You can't - synonyms works with tokens, and KeywordTokenizer keeps the whole string as a single token. So you can't expand just one part of the string when indexing if you're using KT.
In addition the SynonymFilter isn't MultiTermAware, so it's not invoked on query time when doing a wildcard search - so you can't expand synonyms for parts of the string there, regardless of which tokenizer you're using.
This is probably a good case for preprocessing the string and doing the replacements before sending it to Solr, or if the number of replacements are small, having filters to do pattern replacements inside of the strings when indexing to have both versions indexed.

DSE Search And Solr - Issues with whitespace in UDT search queries

I'm trying to get my DSE search query working (with Solr). However, while constructing queries with User Defined types (UDTs), I'm running into issues with whitespace character.
For eg: I have a Student table and a Name type, where the Student table has a list<frozen<Name> names. Name type has say, firstname and lastname. If I do the below query, it throws an error:
Unable to execute CQL Script : no field name specified in query and no default specified via ‘df’ param.
SELECT * from Student where solr_query= '{!tuple}names.firstname:John
Smith';
So I tried escaping the whitespace as below and it works just fine.
SELECT * from Student where solr_query= '{!tuple}names.firstname:John\
Smith';
But, when I use the above UDT field with an AND operator, it FAILS again.
SELECT * from Student where solr_query= 'student_id:123456 AND {!tuple}names.firstname:John\
Smith';
Unable to execute CQL Script : org.apache.solr.search.SyntaxError: Cannot parse names.firstname … Lexical error at line 1, column … Encountered: after : “”
This is the field type for first name:
<fieldType class="org.apache.solr.schema.TextField" name="DelimitedTextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[,\s]"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
As a beginner with Solr, I've been banging my head trying to make these queries work. Any help would be deeply appreciated. Thanks!
I am no expert at all of the DSE system you seem to be using, but taking a look to this resource[1] it seems you may be building boolean queries in a wrong way.
This seems a correct approach :
+{!tuple v='father.name.firstname:Sam'} +{!tuple v='mother.name.firstname:Anne'}
Hope it helps
[1] http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchTupleUDTqueries.html
I was able to get this working by doing a couple of things.
I created a new org.apache.solr.schema.TextField and added
PatternTokenizerFactory tokenizer, with comma (,) as the pattern.
Trimmed the white spaces at the beginning and at the end, and
replaced the whitespaces within the text with '?' which matches any
single character. This was ok to do in my case.
I had to add braces () to the entire query.
Hence, with the updated schema.xml file and the other changes mentioned above, I have the following query working now:
SELECT * from Student where solr_query= '(student_id:123456 AND
{!tuple}names.firstname:John?Smith)';
Eventhough this would match John Smith, John-Smith, or even John.Smith, this was ok in my case since we were supposed to give back these results anyway.

substring match in solr query

I have a requirment where I have to match a substring in a query .
e.g if the field has value :
PREFIXabcSUFFIX
I have to create a query which matches abc. I always know the length of the prefix.
I can not use EdgeNgram and Ngram because of the space constraints.(As they will create more indexes.)
So i need to do this on query time and not on index time. Using a wildcard as prefix something like *abc* will have high impact on performance .
Since I will know the length of the prefix I am hoping to have some way where I can do something like ....abc* where dots represents the exact length of the prefix so that the query is not as bad as searching for the whole index as in the case of wild card query (*abc*).
Is this possible in solr ? Thanks for your time .
Solr version : 4.10
Sure, Wildcard syntax is documented here, you could search something like ????abc*. You could also use a regex query.
However, the performance benefit from this over *abc* will be very small. It will still have to perform a sequential search over the whole index. But if there is no way you can improve your analysis to support your search needs, there may be no getting around that (GIGO).
You could use the RegularExpressionPatternTokenizer for this. For the sample below I guessed that the length of your prefix is 6. Your example text PREFIXabcSUFFIX would become abcSUFFIX. This way you may search for abc*
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=".{6}(.+)" group="1"/>
</analyzer>
About the Tokenizer:
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

Solr and searching phrases with double quotes

I have an ecommerce site where I am implementing Solr (using the Solarium library) and there are product names and descriptions that contain double quotes (usually standing for inches). Before I started to grasp the analyzer and tokenizer portion of Solr, I simply assigned the datatype of text_en_splitting to fields that would contain this data. If someone searches for the phrase - blue 1" binder - the double quote is being removed and the first 10 results being returned are not necessarily binders. The results returned seem to be matching the word blue and the number 1 (they aren't binders). Looking through the analysis of the query in Solr admin, I see the double quotes are getting removed from the WordDelimiterFilterFactory. I like WordDelimiterFilterFactory for other reasons (like dealing with the phrase post-it note) so I'm trying to find a happy medium here. Is there a better way to both index and query fields that contain double quotes that should be kept in place when performing searches (because they actually mean something)?
What I ended up doing was adding a replacement filter before the word delimiter and used the word inch.
<filter class="solr.PatternReplaceFilterFactory" pattern='(\d)"' replacement='$1 inch' replace="all"/>
Solr Query Parsers (such as DisMax) use a call to
SolrPluginUtils.stripUnbalancedQuotes(userQuery))
to remove unbalanced quotes. Balanced quotes are for phrase queries.
So you should really design your own query parser.
You may also consider replacing quotes to feet at the front end, before query comes to Solr.

How to index words with special character in Solr

I would like to index some words with special characters all together.
For example, given m&m, I would like to index it as a whole, rather than delimiting it as m and m (normally & would be considered as a delimiter).
Is there a way to achieve this by using standard tokenizer/filter or should I have to write one myself?
basically text field type filter out special characters before indexing. and you can use string type but it is not advisable for searching on it. you can use types option of WordDelimiterFilterFactory and you can convert those special characters to alphabetical
% => percent
& => and
A Standard Tokenizer factory splits/tokenizes the given text at special characters. To index with special characters you could either write your own custom tokenizer or you can do the following:
Take a list of characters, at which you want to tokenize/split the
text. For eg, my list is {" ",";"}.
Use a PatternTokenizer with the
above list of characters, instead of the StandardTokenizer. Your
configuration will look like:
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=" |;" />
</analyzer>
you can use WhiteSpaceTokenizerFactory.
http://docs.lucidworks.com/display/solr/Tokenizers#Tokenizers-WhiteSpaceTokenizer
It will tokenize only on whitespaces. For example,
"m&m" will be considered as a single token and so it would indexed like that

Resources