Solr ignoring slash order - solr

I have an index field called: texts
The field contains values like: 12/1
And also: 1/12
The problem is when I query: texts:"1/*"
It's finding also 12/1 it's like the slash don't have any meaning.
How I can limit the results by order?
(I've tried texts:"1\/*" and it's not working)
The type of the field:
<fieldType class="org.apache.solr.schema.TextField" name="TextField">

The problem is that you're using the TextField type that performs tokenization of your text, and then additional filtering, like, lower-casing, etc. In your case, you don't have value 12/1 in your index, but you have 2 values, 12 and 1, for both first & second values, so you search for 1/* will match to both records because search will be performed for value 1 that was generated after tokenization of your input.
To keep string from tokenization you need:
either use StrField type instead - but in this case, the string will be indexed as-is, without lower-casing, etc.
if you want to have lower-casing, etc., then define a new type for your field, but use solr.KeywordTokenizerFactory as tokenizer, and add corresponding filters.
You can read more in the DataStax documentation. Also note, that starting with version 6, default type for text data is StrField, and you need explicitly define TextField if you need tokenization, etc.

Related

Solr query not working as expected when it contains the `#` character

I have a field called email_txt of type text_general that holds a list of emails of type abc#xyz.com,
and I'm trying to create a query that will only search the username and disregard the domain.
My query looks something like this:
email_txt:*abc*#*
This produces 0 results. I expect to receive results where the username contains abc, like abcdefg#xyz.com, fooabc#xyzbuzz.com, barabcefg#fizzxyz.com, abc#fizz.com. And yes, I am confident that I have data of that type, it doesn't work even if I try email_txt:*#*.
If I try something like:
email_txt:*abc*
It works, and produces multiple results, including the desired ones from above, but also cases where the domain contains abc, like fizz#helpmeabc.com, which is not desired.
I've had a look at the documentation (just in case I'm going crazy) and it confirms that # is not a special character. Even so, I have tried to escape it like this (just in case, I am going crazy):
email_txt:*abc*\#*
still, 0 results
Now the actual question. Is # a special character? If so, how can it be escaped, if not what am I doing wrong in the query? I genuinely can't tell if there is a flaw in my logic, or if there is something that I am missing.
Notes: I'm using solr version 6.3.0, the doc is for 6.6 (the closest available)
When you're using the StandardTokenizer (which the default field types text_general, text_en, etc. use by default), the content will be split into tokens when the # sign occurs. That means that for your example, there are actually two or three tokens being stored, (izz and helpmeabc.com) or (izz, helpmeabc and com).
A wildcard match is applied against the tokens by themselves (unless using the complex phrase query parser), where no tokenization and filtering taking place (except for multi term aware filters such as the lowercase filter).
The effect is that your query, *abc*#* attempts to match a token containing #, but since the processing when you're indexing splits on # and separate the tokens based on that character, no tokens contain # - and thus, giving you no hits.
You can use the string field type or a KeywordTokenizer paired with filters such as the lower case filter, etc. to get the original input more or less as a complete token instead.

SOLR: facet.field is working for each word in a field differently, how to apply facet.field for whole field sentence?

In facet.field, I have added "MerchantName" field, so I got result as below
"facet_fields":{
"MerchantName":[
"amazon",133281,
"factory",99566,
"club",99566,
"fashion",4905,
"swish",4905,
"store",1001,
"swank",1001,
"the",1001
]
}
In the above array, "club factory", "swish fashion" and "the swank store" are in a single field, but an array as you can see these are treated as a different word.
So how to apply facet query on the whole field which returns an array with whole field value?
The field MerchantName used for faceting. This field should be defined in schema.xml as a string (type="string") in order for the facet to use the whole text.
As you are using a text based field with field type as text_general, the value will be split into multiple tokens. The same is the case with MerchantName field.
Otherwise it will divide it according to the way it has been tokenized.
You can also add docValues="true" for a field MerchantName, then DocValues will automatically be used any time the field is used for sorting, faceting or function queries.
For faceting Solr could get use of DocValues - which is special way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.

Solr exact match field boosting

I have this requirement: if the query text match exactly with a particular field value (the title field) the result must be first or al least be boosted.
So I need to boost the results with the exact match.
My solution is to create the title as an untokenized field, so it'll match only exactly, and boost this the title with an edismax query.
Is there any othere way?
How can I index a field untokenized? So without tokenize on spaces?
Use a KeywordTokenizer - this will index the field as a single value, but still allow you to attach filters - for example to lowercase the text before storing the token.
If you don't want to perform lowercasing either, you can use a string (StrField) field - a string field will only give a hit if the value is exactly the same.
This is usually what you'll do to give exact hits a larger boost than other hits - and you can use the qf parameter to dismax (which you probably are already) to give this list. Use copyField to index the content into separate fields with different definitions.

Solr Text field and String field - different search behaviour

I am working on Solr 4+.
I have several fields into my solr schema with different solr field types.
Does the search on text field and string field differs?
Because I am trying to search on string field (which is a copy field of few facet fields) which does not work as expected. The destination string field is indexed and stored both.
However, when I change destination field which a text field (only indexed), it works fine.
Can you suggest why this happens? What is exactly the difference between text and string fields in solr in respect to searches?
TextFields usually have a tokenizer and text analysis attached, meaning that the indexed content is broken into separate tokens where there is no need for an exact match - each word / token can be matched separately to decide if the whole document should be included in the response.
StrFields cannot have any tokenization or analysis / filters applied, and will only give results for exact matches. If you need a StrField with analysis or filters applied, you can implement this using a TextField and a KeywordTokenizer.
A general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt" (empty by default), and down cases. At query time only, it also applies synonyms.
The StrField type is not analyzed, but indexed/stored verbatim.

Solr copyField mixed with RegexTransformer

Scenario:
In the database I have a field called Categories which of type string and contains a number of digits pipe delimited such as 1|8|90|130|
What I want:
In Solr index, I want to have 2 fields:
Field Categories_ pipe which would contain the exact string as in the DB i.e. 1|8|90|130|
Field Categories which would be a multi-valued field of type INT containing values 1, 8, 90 and 130
For the latter, in the entity specification I can use a regexTransformer then I specify the following field in data-config.xml:
<field column="Categories" name="Navigation" splitBy="\|"/> and then specify the field as multi-valued in schema.xml
What I do not know is how can I 'copy' the same field twice and perform regex splitting only on one. I know there is the copyField facility that can be defined in schema.xml however I can't find a way to transform the copied field because from what I know (and I maybe wrong here), transformers are only available in the entity specification.
As a workaround I can also send the same field twice from the entity query but in reality, the field Categories is a computed field (selects nested) which is somewhat expensive so I would like to avoid it.
Any help is appreciated, thanks.
Instead of splitting it at data-config.xml. You could do that in your schema.xml. Here is what you could do,
Create a fieldType with tokenizer PatternTokenizerFactory that uses regex to split based on |.
FieldSplit: Create a multivalued field using this new fieldType, will eventually have 1,8,90,130
FieldOriginal: Create String field (if you need no analysis on that), that preserves original value 1|8|90|130|
Now you can use copyField to copy FieldSplit , FieldOriginal values based on your need.
Check this Question, it is similar.
You can create two columns from the same data and treat them separately.
SELECT categories, categories as categories_pipe FROM category_table
Then you can split the "categories" column, but index the other one as-is.

Resources