solr facet field values appear to be generated by solr - solr

I want to facet on a specific field. The field is defined as
<field name="specials_de" type="textgen" indexed="true" stored="true" multiValued="true"/>
Two of the values in specials_de are "Cunard Hamburg Specials" and "Cunard the New Yorker". I want to use these two values as facets, but the solr query returns facet fields with values like
<int name="cunard">11</int>
<int name="new">9</int>
<int name="yorker">9</int>
<int name="hamburg">5</int>
<int name="hamburgspecialscunard">3</int>
<int name="hamburgspecials">2</int>
What am I doing wrong?
Just to clarify: I'm not referring to the counts (11, 9, etc.), but to the names, i.e. "cunard", "new", etc.

Text fields are not suggested to be used for Faceting.
You won't get the desired behavior as the text fields would be tokenized and filtered leading to the generation of multiple tokens which you see from the facets returned as response.
SolrFacetingOverview :-
Because faceting fields are often specified to serve two purposes,
human-readable text and drill-down query value, they are frequently
indexed differently from fields used for searching and sorting:
They are often not tokenized into separate words
They are often not mapped into lower case
Human-readable punctuation is often not removed (other than double-quotes)
There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for
value retrieval.
Try to use String fields and it would be good enough without any overheads.

Related

Solr supporting varying facets for different types of products

I am using Solr for indexing different types of products. The product types (category) have different facets. For example:
camera: megapixel, type (slr/..), body construction, ..
processors: no. of cores, socket type, speed, type (core i5/7)
food: type, origin, shelf-life, weight
tea: type (black/green/white/..), origin, weight, use milk?
serveware: type, material, color, weight
...
And they have common facets as well:
Brand, Price, Discount, Listing timeframe (like new), Availability, Category
I need to display the relevant facets and breadcrumbs when user clicks on any category, or brand page or a global search across all types of products. This is same as what we see on several ecommerce sites.
The query that I have is:
Since the facet types are more or less unique across different types of products, do I put them in separate schemas? Is that the way to do it? The fundamental worry is that those fields will not have any data for other types of products. And are there any implementation principles here that makes retrieving the respective faces for a given product type easier?
I would like to have a design that is scalable to accommodate lots of items in each product type as we go forward, as well as that is easy to use and performance oriented, if possible. Right now I am having a single instance of Solr.
The only risk of underpopulated facets are when they misrepresent the search. I'm sure you've used a search site where the metadata you want to facet on is underpopulated so that when you apply the facet you also eliminate from your result set a number of records that should have been included. The thing to watch is that the facet values are populated consistently where they are appropriate. That means that your "tea" records don't need to have a number of cores listed, and it won't impact anything, but all of your "processor" records should, and (to whatever extent possible) they should be populated consistently. This means that if one processor lists its number of cores as "4", and another says "quadcore", these are two different values and a user applying either facet value will eliminate the other processor from their result. If a third quadcore processor is entirely missing the "number of cores" stat from the no_cores facet field (field name is arbitrary), then your facet could be become counterproductive.
So, we can throw all of these records into the same Solr, and as long as the facets are populated consistently where appropriate, it's not really necessary that they be populated for all records, especially when not applicable.
Applying facets dynamically
Most of what you need to know is in the faceting documentation of Solr. The important thing is to specify the appropriate arguments in your query to tell Solr which facets you want to use. (Until you actually facet on a field, it's not a facet but just a field that's both stored="true" and indexed="true".) For a very dynamic effect, you can specify all of these arguments as part of the query to Solr.
&facet=true
This may seem obvious, but you need to turn on faceting. This argument is convenient because it also allows you to turn off faceting with facet=false even if there are lots of other arguments in your query detailing how to facet. None of it does anything if faceting is off.
&facet.field=no_cores
You can include this field over and over again for as many fields as you're interested in faceting on.
&facet.limit=7
&f.no_cores.facet.limit=4
The first line here limits the number of values for returned by Solr for each facet field to 7. The 7 most frequent values for the facet (within the search results) will be returned, with their record counts. The second line overrides this limit for the no_cores field specifically.
&facet.sort=count
You can either list the facet field's values in order by how many appear in how many records (count), or in index order (index). Index order generally means alphabetically, but depends on how the field is indexed. This field is used together with facet.limit, so if the number of facet values returned is limited by facet.limit they will either be the most numerous values in the result set or the earliest in the index, depending on how this value is set.
&facet.mincount=1
There are very few circumstances that you will want to see facet values that appear zero times in your search results, and this can fix the problem if it pops up.
The end result is a very long query:
http://localhost/solr/collecion1/search?facet=true&facet.field=no_cores&
facet.field=socket_type&facet.field=processor_type&facet.field=speed&
facet.limit=7&f.no_cores.facet.limit=4&facet.mincount=1&defType=dismax&
qf=name,+manufacturer,+no_cores,+description&
fl=id,name,no_cores,description,price,shipment_mode&q="Intel"
This is definitely effective, and allows for the greatest amount of on-the-fly decision-making about how the search should work, but isn't very readable for debugging.
Applying facets less dynamically
So these features allow you to specify which fields you want to facet on, and do it dynamically. But, it can lead to a lot of very long and complex queries, especially if you have a lot of facets you use in each of several different search modes.
One option is to formalize each set of commonly used options in a request handler within your solrconfig.xml. This way, you apply the exact same arguments but instead of listing all of the arguments in each query, you just specify which request handler you want.
<requestHandler name="/processors" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="fl">id,name,no_cores,description,price,shipment_mode</str>
<str name="qf">name, manufacturer, no_cores, description</str>
<str name="sort">score desc</str>
<str name="rows">30</str>
<str name="wt">xml</str>
<str name="q.alt">*</str>
<str name="facet.mincount">1</str>
<str name="facet.field">no_cores</str>
<str name="facet.field">socket_type</str>
<str name="facet.field">processor_type</str>
<str name="facet.field">speed</str>
<str name="facet.limit">10</str>
<str name="facet.sort">count</str>
</lst>
<lst name="appends">
<str name="fq">category:processor</str>
</lst>
</requestHandler>
If you set up a request hander in solrconfig.xml, all it does is serve as a shorthand for a set of query arguments. You can have as many request handlers as you want for a single solr index, and you can alter them without rebuilding the index (reload the Solr core or restart the server application (JBoss or Tomcat, e.g.), to put changes into effect).
There are a number of things going on with this request handler that I didn't get into, but it's all just a way of representing default Solr request arguments so that your live queries can be simpler. This way, you might make a query like:
http://localhost/solr/collection1/processors?q="Intel"
to return a result set with all of your processor-specific facets populated, and filtered so that only processor records are returned. (This is the category:processor filter, which assumes a field called category where all the processor records have a value processor. This is entirely optional and up to you.) You will probably want to retain the default search request handler that doesn't filter by record category, and which may not choose to apply any of the available (stored="true" and indexed="true") fields as active facets.

solr - string search field multiple value, all word must match

Currently i have sample data like this :
<doc>
<int name="name">Nice Dress</int>
<arr name="keyword">
<str>best cocktail dress</str>
<str>platform complete pumps</str>
<str>platform pumps</str>
<str>slip dress</str>
</arr>
I used multiple value for "keyword" field.
case 1
defType:edismax
qf:keyword
q:cocktail dress
solr will return the data.
case 2
defType:edismax
qf:keyword
q:coctail dress pump
it still return the data, If we see from the sample data, no keyword contain all this 3 word ('coctail' 'dress' 'pump') in one row of each keyword.
How to make solr not to return this result?
Thanks.
Check for two parameters
positionIncrementGap - For multivalued fields this parameter would decide what is it distance between the two fields in the multivalued fields. If this value is 100 so the distance between the two multivalued fields would be 100 positions.
Note - The default positionIncrementGap is 0
Check for the qs query slop parameter for dismax which will will decide the slop match between the terms.
Try this query:
q:(coctail dress pump)~100
with your positionIncrementGap set to something like 300.
Those values will need to change depending on how long are your data.

Searching multi-valued fields in the same position

Let's say we have a document like this:
<arr name="pvt_rate_type">
<str>CORPORATE</str>
<str>AGENCY</str>
</arr>
<arr name="pvt_rate_set_id">
<str>1</str>
<str>2</str>
</arr>
Now I do a search where I want to return the document only if it contains pvt_rate_set_id = 1 AND pvt_rate_type = AGENCY in the same position in their mutli-valued fields so the above document should NOT be returned (because pvt_rate_set_id 1 has a pvt_rate_type of CORPORATE)
Is this possible at all in SOLR ? or is my schema badly designed ? how else would you design tat schema to allow for the searching I want?
This may not be available Out of the Box.
You would need to modify the schema to have fields with pvt rate type as field name and id as its value
e.g.
CORPORATE=1
AGENCY=2
This can be achieved by having dynamic fields defined.
e.g.
<dynamicField name="*_pvt_rate_type" type="string" indexed="true" stored="true"/>
So you can input data as corporate_pvt_rate_type or agency_pvt_rate_type with the respective values.
The filter queries will be able to match the exact mappings fq=corporate_pvt_rate_type:1
Unfortunately Solr does not seem to support this.
Another way to do this in Solr would be to store a concatenated string field type_and_id with a delimiter (say comma) separating the type and the id and query like:
q=type_and_id:AGENCY%2C1
(where %2C is the URL encoding for comma).

Solr query must match all words/tokens in a field

I have a text-field called name in my schema.xml. A query q=name:(organic) returns the following documents:
<doc>
<str name="id">ontology.category.1483</str>
<str name="name">Organic Products</str>
</doc>
<doc>
<str name="id">ontology.keyword.4896</str>
<str name="name">Organic Stores</str>
</doc>
This is perfectly right in a normal Solr Search, however I would like to construct the query so that it doesn't return anything because 'organic' only matches 1 of the 2 words available in the field.
A better way to say it could be this: Only return results if all tokens in the field are matched. So if there are two words (tokens) in a field and I only match 1 ('organic', 'organics','organ' etc.) I shouldn't get a match because only 50% of the field has been searched on.
Is this possible in Solr? How do I construct the query?
you are probably using StandardTokenizerFactory (or something similar), one solution is to use KeywordTokenizerFactory and issue a phrase query and then only perfect matches will work. Of course remember other filters you might want to use (like LowerCaseFilterFactory etc). Note that: "stores organic" will not match your doc either
Due to time contraints, I had to resort to the following (hacky) solution.
I added the term count to the index via a DynamicField field called tc_i.
<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
Now at query time I count the terms and append it to the query, so q=name:(organic) becomes q=name:(organic) AND tc_i:(1) and this won't return documents for "organic stores" / "organic products" obviously because their tc_i fields are set at 2 (two words).

How would I search for blank facets in a multi valued facet field and at the same time in Solr?

I have an application where users can pick car parts. They pick their vehicle and then pick vehicle attributes as facets. After they select their vehicle, they can pick facets like engine size, for example, to narrow down the list of results. The problem was, not all documents have an engine size (it's an empty value in Solr), as it doesn't matter for all parts. For example, an engine size rarely matters for an air filter. So even if a user picked 3.5L for their engine size, I still wanted to show the air filters on the screen as a possible part the user could pick.
I did some searching and the following facet query works perfectly:
enginesize:"3.5" OR enginesize:(*:* AND -enginesize:[* TO *])
This query would match either 3.5 or would match records where there was no value for the engine size field (no value meant it didn't matter, and it fit the car). Perfect...
THE PROBLEM: I recently made the vehicle attribute fields multivalued fields, so I could store attributes for each part as a list. I then applied faceting to it, and it worked fine. However, the problem came up when I applied the query previously mentioned above. While selecting the enginesize facet narrowed down the number of documents displayed to only documents that have that engine size, records (I also use the word record to mean document) that had empty values (i.e. "") for enginesize were not appearing. The same query above does not work for multivalued facets the same way it did when enginesize was a single valued field.
Example:
<doc>
<str name="part">engine mount</str>
<arr name="enginesize">
<str/>
<str/>
<str>3.5</str>
<str>3.5</str>
<str>3.5</str>
<str>3.5</str>
<str>3.5</str>
</arr>
<doc>
<doc>
<str name="part">engine bolt</str>
<arr name="enginesize">
<str>6</str>
<str>6</str>
<str>6</str>
<str>6</str>
<str>6</str>
</arr>
<doc>
<doc>
<str name="part">air filter</str>
<arr name="enginesize">
<str/>
<str/>
<str></str>
<str></str>
<str></str>
<str></str>
<str></str>
</arr>
<doc>
What I am looking for is a query that will pull back documents 1 and 3 above when I do a facet search for the engine size for 3.5. The first document (the engine mount) matches, because it contains the value in one of the multivalued fields "enginesize" that I am looking for (contains 3.5 in one of the fields). However, the third document for the air filter doesn't get returned because of the empty <str> values. I do not want to return the second document at all because it doesn't match the facet value
I basically want a query that will match empty string values for a given facet and also match the actual value, so I get both documents returned.
Does someone have a query that would return document 1 and document 3 (the engine bracket and the air filter), but not the engine bolt document?
I tried the following without success (including the one at the very top of this question):
// returns everything
enginesize:"3.5" OR (enginesize:[* TO *] )
// only returns document 1
enginesize:"3.5" OR (enginesize:["" TO ""] AND -enginesize:"3.5")
// only returns document 1
enginesize:"3.5" OR (enginesize:"")
I imported the data above using a CSV file, I set the field keepEmpty=true. I tried instead manually inserting a space into the field when I generated the CSV file (which would give you <str> </str>, instead of the previous , and then retried the queries. Doing that, I got the following results:
// returns document 1
enginesize:"3.5" OR enginesize:(*:* AND -enginesize:[* TO *])
// returns all documents
enginesize:"3.5" OR (enginesize:["" TO ""] AND -enginesize:"3.5")
// returns all documents
enginesize:"3.5" OR (enginesize:"")
Does anyone have a query that would work for either situation, whether I have a space as the blank value or simply no value at all?
How about changing how you index, instead of how you query?
Instead of trying to index "engine size doesn't matter" as an empty record, index it as "ANY".
Then your query simply becomes enginesize:"3.5" OR (enginesize:ANY)
i've just been playing with this and found a hint that seems to do the trick for me. translated to your query it should be:
enginesize:"3.5" OR (-enginesize:["" TO *])
hth,
andi
update: after some more testing i don't think this works reliably — for some indexes it had to be the other way round and without the minus sign, i.e. enginesize:[* TO ""]. this might depend on the index type, if it's multi-valued or even on the actual values.
in any case it seems too much of a hack. i'll probably resolve to substituting the empty value with a special marker...
I had the same problem, but solved it in https://stackoverflow.com/a/35633038/13365:
enginesize:"3.5" OR (*:* NOT enginesize:["" TO *])
The -enginesize solution didn't work for me.

Resources