Solr Distinct on multi-valued field - solr

I have Solr documents with a multi-valued field, and need the distinct values from it. I have to filter by a different field, but my result doesn't have to incklude anything other than the distinct categories.
Documents:
{CountryCode: 'US', Product:'A', Categories:[1,2,3]},
{CountryCode: 'US', Product:'B', Categories:[1,3,77,88]},
{CountryCode: 'JP', Product:'B', Categories:[1,2]}
{CountryCode: 'JP', Product:'B', Categories:[444,555]}
Filter for only CountryCode = 'US'
Result:
{[1,2,3,77,88]}
I tried field collapsing/grouping, but it doesn't work on multi-valued fields.
I tried terms(thanks to suggestion by Persimmonium), but it doesn't want to filter only the 'US' categories. The fact that terms gave how many times a category occurs is a bonus, but not required in this case.
Any suggestions?

Edited after your comment.
One way to achieve this is with:
a fq to get the set of docs you are interested in
then facet on Categories, setting 'limit' high enough to get all values
A fancier way might be usingStreaming Expressions. But faceting is just simpler.

Related

MongoDB aggregation pipeline, $gte, $lte?

I have a document for software that contain these fields _id, category, brand etc. There is a price field which is of type string. Some documents have invalid prices or are null. I want to use an aggregation pipeline so that the price is >=4 and <=8 and convert the price to double. There is also a date field that I want to be >=10. I also want to use $out to create a new collection of this document. I have done this so far, I was wondering if someone could let me know how I can retrieve the documents but I don't want to lose or change the other fields only the Price and date.
db.sw.aggregate([{$match: {}},
{$project: {priceLen: {"$strLenCP": "$price"}}},
{"$match": {priceLen: {"$gte": 4, "$lte": 8}}},
{$project: {price: {$trim: {input: "$price", chars: "$"}}}},
{$project: {price: {$toDouble: "$price"}}}])
my thought process for the $match was to retrieve all the fields. Any help will be really appreciated.
No idea what your requirements are in terms of being "correct".
$project removed all fields (apart from _id) and populates the given fields. If you like to keep existing fields use $set or the alias $addFields which names the actual operation.

SOLR - group by field and then get distinct value by another field

I'm using apache solr for searching records. In my case I'm having table which has columns category and sub-category, etc.
I want to group by category and then get the distinct list of sub-category from grouped results. Is that possible in apache solr?
If yes, please do help me to solve this.
Thanks in advance.
You can do that with a pivot facet:
facet=on&facet.pivot=category,subcategory
This will give you a facet with all the sub categories for each category.
You can also use the Facet JSON API. Example adopted from that page:
top_categories:{
type: terms,
field: category,
limit: 5,
facet:{
top_subcategories:{
type: terms,
field: subcategory,
limit: 20
}
}
}

Solr - Nested Edismax Query

I am using Solr (with pySolr) to search products in my database, returning products, facets and facet.pivots:
result = solr.search(query_s, **{
'rows': '24',
'sort': formatted_sort,
'facet': 'on',
'facet.limit': '-1',
'facet.mincount': '1',
'facet.field': ['gender', 'material'],
'facet.pivot': 'brand,series',
'fq': '-in_stock:(0 OR 99 OR 100 OR 101)'
})
The query_s selects specific fields, for example: brand:Target AND gender:Men's.
I would like to combine the above query with a DisMax query which will allow me to combine the above query with a full text search over specified fields. I found an article which demonstrates nested queries. I have tried to implement something like this:
q: "gender:* AND _query_:"{!edismax qf=brand series}Summer""
For some reason 'Target' will return results for Target brand shirts, but only with correct capitalization. 'Summer' which is a series of Target, won't return any results. Why am I not seeing a list of docs ordered by relevancy?
Am I overcomplicating things by using Dismax altogether?
The dismax parsers are useful for making sense of more "natural" queries, i.e. queries where the user is used to just type what they're looking for, and how most search engines work.
In your case it sounds like brand:Target AND gender:Men's are filters for which documents should be shown, while the query is the part that the user has typed. Usually you'll want to have the filters in fq as they don't affect score (i.e. they're exact values matching a field value), and the query in q.
I assume that Summer is what the user would have typed into your search box, which would give you:
q=Summer&defType=edismax&qf=series
But this assumes that the series field is defined as a text field that has an analyzer attached, so that the values are lowercased and split appropriately.
If you also have a description field you'd like to search, you can do:
q=Summer&defType=edismax&qf=series^20 description
.. which would search for Summer in both the series and description fields, but give 20 times more weight to a hit in the series field. This is a good way to naturally boost documents that match more exact data in your documents. If you also include the brand field, you'd be able to let your users search for "target summer" and similar queries.

dynamic fields as facet in solr

I am trying to develop a filter system using dynamic fields in solr. These dynamic fields may vary from product to product and have a prefix attribute_filter_ to help me recognize the filter field. So given a search query, I want to get faceted results based on these dynamic fields.
For example, I have 3 products as docs in solr
{ID:1, attribute_filter_color:"white", attribute_filter_brand:"Dell"}
{ID:2, attribute_filter_color:"red", attribute_filter_category:"electronics"}
{ID:3, attribute_filter_size:"mobiles", attribute_filter_brand:"samsung"}
When my search query matches doc 1 and doc2, I want only filters color, brand and category and so facet fields are attribute_filter_color, attribute_filter_brand and attribute_filter_category.
When my search query matches doc 2 and doc3, I want filters color, size, category and brand and so facet fields are attribute_filter_color, attribute_filter_size, attribute_filter_category and attribute_filter_brand.
When my search query matches doc 1 and doc3, I want filters color, brand and size and so facet fields are attribute_filter_color,attribute_filter_brandand attribute_filter_size.
Also these filters can be ~300 total over 10^5 products. This creates another problem for making a GET URL with 300 facet fields which might cross the limit for GET URL.
This jira ticket shows how regex could have helped in this situation.
My solution would be to index the field names to an additional field, so that you have "facet_fields": ["attribute_filter_color","attribute_filter_brand"] for the documents containing the fields as well.
Generate a facet across your document result set, then use that result in a new query to generate facets across the fields you're interest in. It will be an extra query, but should scale decently. The part that will be expensive will be the larger number of different fields you're faceting on anyway - the facet_fields field will be quick to calculate and return.

Solr Stats per Group

I've been experimenting with Solr (5.2.1) groups and stats and I am getting nowhere. I have a bunch of documents grouped by a key. I am returning the groups in my results and I want to return the minimum value of a field for each group. Note that I ONLY need it for the groups being returned in the search query.
I am able to get the stats component working, however it just returns the results for all groups; like regular facets.
Here is the query:
facet=true&stats=true&stats.field={!tag=t1}pr&facet.pivot={!stats=t1}groupid
I also tried to use stats.facet component without any luck. Am I missing something here or is this not in Solr?
For example, you have following fields
id, name, category, score
11,name1,A,1
22,name2,A,2
33,name3,B,1
44,name4,B,2
55,name5,B,3
Then you can group based on category, and inside group, you can get stats based on field score.
q=*%3A*&fl=count&wt=json&indent=true&facet=true&stats=true&stats.field={!tag=t1}score&facet.pivot={!stats=t1}category
Results would be like
"facet_counts":{
"facet_queries":{},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{},
"facet_pivot":{
"sentiment_cat":[{
"field":"sentiment_cat",
"value":"SECOND",
"count":3,
"stats":{
"stats_fields":{
"sentiment_score":{
"min":1.0,
"max":3.0,
"count":3,
"missing":0,
"sum":6.0,
"sumOfSquares":14.0,
"mean":2.0,
"stddev":1.0}}}},
{
"field":"sentiment_cat",
"value":"FIRST",
"count":2,
"stats":{
"stats_fields":{
"sentiment_score":{
"min":1.0,
"max":2.0,
"count":2,
"missing":0,
"sum":3.0,
"sumOfSquares":5.0,
"mean":1.5,
"stddev":0.7071067811865476}}}}]}}
As you can see, min, max, sum are done on score field. This is the capability of facet and stat. let me know, if you need something different from above

Resources