Solr comma separated field - facet search - solr

I got a field in my solr index which holds comma separated values like "area1,area2,area3,area4". There are documents in it where the value is just one value like "area6".
Now i want to make a facet search over all this values.
Example (This is what i want):
area1:10
area2:4297
area3:54
area4:65
area6:87
This is what i get
area1,area2,area3,area4: 7462
area6: 87
Does solr delivers any solutions for this problem or must i seperate the different values on my own.

While indexing you need to get tokens out of the data using ,. You can use the PatternTokenizerFactory tokenizer with , as the pattern. This would split your text whenever it finds a ,.
The field in your schema.xml should be multivalued.

Related

SOLR: facet.field is working for each word in a field differently, how to apply facet.field for whole field sentence?

In facet.field, I have added "MerchantName" field, so I got result as below
"facet_fields":{
"MerchantName":[
"amazon",133281,
"factory",99566,
"club",99566,
"fashion",4905,
"swish",4905,
"store",1001,
"swank",1001,
"the",1001
]
}
In the above array, "club factory", "swish fashion" and "the swank store" are in a single field, but an array as you can see these are treated as a different word.
So how to apply facet query on the whole field which returns an array with whole field value?
The field MerchantName used for faceting. This field should be defined in schema.xml as a string (type="string") in order for the facet to use the whole text.
As you are using a text based field with field type as text_general, the value will be split into multiple tokens. The same is the case with MerchantName field.
Otherwise it will divide it according to the way it has been tokenized.
You can also add docValues="true" for a field MerchantName, then DocValues will automatically be used any time the field is used for sorting, faceting or function queries.
For faceting Solr could get use of DocValues - which is special way of recording field values internally that is more efficient for some purposes, such as sorting and faceting, than traditional indexing.

SOLR - Searching record based on SOLR field in passed string

I have a CSV string field say "field1" in SOLR which can have value similar to 1,5,7
Now, I want to get this record if I pass values:
1,5,6,7
OR
1,5,7,10
OR
1,5,7
Basically any of these inputs should return me this record from SOLR.
Is there anyway to achieve this. I am open for schema change if it helps.
The Standard Tokenizer (used in text fields like text_general) will not split on commas if there is no space in between characters.
That means that "1,2,3" will be indexed as a single token ("1,2,3") but it will index "1, 2, 3" as three tokens ("1", "2", "3").
If you can make sure there will be a space after the comma in the value that you are indexing and the value that you are using in your search query you might be able to achieve what you want by indexing your field as a text_general.
You can use the Analysis Screen in Solr to see how your value will be indexed and searched and see if any of the built-in field types gives you what you want.

Solr Text field and String field - different search behaviour

I am working on Solr 4+.
I have several fields into my solr schema with different solr field types.
Does the search on text field and string field differs?
Because I am trying to search on string field (which is a copy field of few facet fields) which does not work as expected. The destination string field is indexed and stored both.
However, when I change destination field which a text field (only indexed), it works fine.
Can you suggest why this happens? What is exactly the difference between text and string fields in solr in respect to searches?
TextFields usually have a tokenizer and text analysis attached, meaning that the indexed content is broken into separate tokens where there is no need for an exact match - each word / token can be matched separately to decide if the whole document should be included in the response.
StrFields cannot have any tokenization or analysis / filters applied, and will only give results for exact matches. If you need a StrField with analysis or filters applied, you can implement this using a TextField and a KeywordTokenizer.
A general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt" (empty by default), and down cases. At query time only, it also applies synonyms.
The StrField type is not analyzed, but indexed/stored verbatim.

Solrnet facet returning spaces

I'm using Solrnet to return search results and am also requesting the facets, in particular categories which is a multi-valued field.
The problem I'm coming up against is that the category "house products" is being returned as two seperate facets because of the space.
Is there a way of ensuring this is returned as a single facet value, or should I be escaping the value when it is added to the index?
Thanks in advance
Al
If the tokens are generated for house products then you are using text analysis for the field.
Text fields are not suggested to be used for Faceting.
You won't get the desired behavior as the text fields would be tokenized and filtered leading to the generation of multiple tokens which you see from the facets returned as response.
Use a copy field to copy the field to a String field to be able to facet on it without splitting the words.
SolrFacetingOverview :-
Because faceting fields are often specified to serve two purposes,
human-readable text and drill-down query value, they are frequently
indexed differently from fields used for searching and sorting:
They are often not tokenized into separate words
They are often not mapped into lower case
Human-readable punctuation is often not removed (other than double-quotes)
There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for
value retrieval.
Try to use String fields and it would be good enough without any overheads.
The faceting works on tokens, so if you have a field that is tokenized in many words it will split the facet too.
I suggest you create another field of type string used only for faceting.

Solr copyField mixed with RegexTransformer

Scenario:
In the database I have a field called Categories which of type string and contains a number of digits pipe delimited such as 1|8|90|130|
What I want:
In Solr index, I want to have 2 fields:
Field Categories_ pipe which would contain the exact string as in the DB i.e. 1|8|90|130|
Field Categories which would be a multi-valued field of type INT containing values 1, 8, 90 and 130
For the latter, in the entity specification I can use a regexTransformer then I specify the following field in data-config.xml:
<field column="Categories" name="Navigation" splitBy="\|"/> and then specify the field as multi-valued in schema.xml
What I do not know is how can I 'copy' the same field twice and perform regex splitting only on one. I know there is the copyField facility that can be defined in schema.xml however I can't find a way to transform the copied field because from what I know (and I maybe wrong here), transformers are only available in the entity specification.
As a workaround I can also send the same field twice from the entity query but in reality, the field Categories is a computed field (selects nested) which is somewhat expensive so I would like to avoid it.
Any help is appreciated, thanks.
Instead of splitting it at data-config.xml. You could do that in your schema.xml. Here is what you could do,
Create a fieldType with tokenizer PatternTokenizerFactory that uses regex to split based on |.
FieldSplit: Create a multivalued field using this new fieldType, will eventually have 1,8,90,130
FieldOriginal: Create String field (if you need no analysis on that), that preserves original value 1|8|90|130|
Now you can use copyField to copy FieldSplit , FieldOriginal values based on your need.
Check this Question, it is similar.
You can create two columns from the same data and treat them separately.
SELECT categories, categories as categories_pipe FROM category_table
Then you can split the "categories" column, but index the other one as-is.

Resources