I have a Solr storage with a huge number of documents. Here's an example of my document structure:
{
"country":"USA",
"company":"Corsair",
"product":"RM650X 650W",
"price":"140",
"on_stock":"yes"
},
I'd like to make a facet request to Solr data to receive a certain number of rows (e.g. 200).
Here's a desired result:
The problem is I can't limit the data properly.
In Solr documentation it says that "facet.limit parameter specifies the maximum number of constraint counts (essentially, the number of facets for a field that are returned) that should be returned for the facet fields. This parameter can be specified on a per-field basis to apply a distinct limit to each field with the syntax of f.<fieldname>.facet.limit "
And here comes the tricky part.
I tried to use a limit of 200 for the first column (Country / Region). Here's my request:
country:{
type: terms,
field: country,
limit: 200, # Limit's here
facet:{
company:{
type: terms,
field: company,
limit: -1
facet:{
product:{
type: terms,
field: product,
limit: -1
}
}
}
This query returns 200 results for a country facet, but since every country has a different number of nested companies and every company has a different number of nested products, I get thousands of rows of data.
Then I tried to use a limit of 200 for the last column (Product). Here's my request:
country:{
type: terms,
field: country,
limit: -1,
facet:{
company:{
type: terms,
field: company,
limit: -1
facet:{
product:{
type: terms,
field: product,
limit: 200 # Limit's here
}
}
}
This query returns 200 results for every product lying withing every company lying within every country. In other words, the limit is local for every nested category, not global. And again I get thousands of rows of data.
Is it possible to achieve my goal in Solr?
Related
My use case is to make a query to Solr, and to extract counts of unique terms for certain fields within the result set. The trick is that within my counts, I need to limit the output to only terms that match a certain input string--without adjusting the main Solr query. E.g., "Solr, give me results for 'War and Peace', and give me the first ten facets on author where the author field has 'doge' in it, and give me a count of all unique author values in the result set where the author field has 'doge' in it."
The Solr JSON Facet API allows me to facet using stat functions; in this case, I'm interested in using the unique() function to get the counts I need. So, e.g.,
{
"author_count": "unique(author)"
}
...tells me the total number of unique values for 'author' in the result set. This is good.
I can limit the output of a facet using the domain change option, like so:
{
"author_facet": {
"type": "terms",
"field": "author",
"mincount": 1,
"limit": 10,
"offset": 0,
"domain": {
"filter": "author:doge"
}
}
}
This is also good.
The problem I'm having is that when I send both of these choices, the result of the unique() call (in author_count) is a count of all unique author values in the base result set, regardless of whether the author contains 'doge'. The author_facet results do correctly limit the output to only authors with 'doge' in them. But I need to also apply that limit to the results of the unique() function.
I cannot alter the base query, because it represents user input that is independent of the facet filtering input. E.g, the user will have searched for "War and Peace," and now want to see only those facets where the author is 'doge', with a count of the total authors matching 'doge'.
If it is meaningful to the answer, I am running Solr 9.0.0.
Is there a way to apply domain filtering to Solr stat functions in the JSON Facet API, such as unique()?
EDIT: To clarify: The number of authors with 'doge' may be very large, and so would exceed the number of actual facets that should be returned. I'm limiting the facet response to 100, but there could be 978 authors with 'doge'. I want to inform the user of that 978 count while only returning the top 100.
I base on facet.field and I have one situation. In my store i have base products and variants, when I use facet.field I get count with base products and variants:
Category:
Chairs(30) <- this is count of base products and variants
Tables(20) <- this is count of base products and variants
I want to add some terms for facet.field in order to that facet return count only of variants, every product has field like "productType":"baseProduct" or "productType":"variantProduct"
I want to use those fields.
Any ideas? how can I use this in some query , please help
You can use facet.pivot to get distinct counts for each type:
&facet.pivot=productType,category
You can also use the JSON Facet API to do two separate facets:
{
base: {
type: terms,
field: category,
domain: { filter: "productType:baseProduct" }
},
variant: {
type: terms,
field: category,
domain: { filter : "productType:variantProduct" }
}
}
I'm running an instance of Solr 6.2. One of the use cases I'm exploring is to return records grouped by a field, including summed columns (facets) and sorted by those columns. I realize Solr is not meant to be utilized as a relational database, but is this possible?
Using the JSON API, I send the following data payload to the query endpoint of my Solr instance:
{
query: "*:*",
filter: ["status:1", "date:[2016-10-11T00:00:00Z-7DAYS/DAY TO 2016-10-11T00:00:00Z]"],
limit: 10,
params: {
group: true,
group.field: name,
group.facet: true
},
facet: {
funcs: {
type: terms,
field: name,
sort: { sum_v1: desc },
limit: 10,
facet: {
sum_v1: "sum(v1)",
sum_v2: "sum(v2)",
sum_v3: "sum(v3)"
}
}
}
This returns 10 records at a time in both the groups key and facets key of the response JSON. However, the sorted facet buckets do not match up with the grouped records. How can I get the facet counts with the relevant groups?
The only workaround I can come up with is to do a query for the grouped records first, then do another query using the id's from that query to get the facet counts. However, the downside is that I'd lose the ability to sort or filter by any of the facet counts.
Imagine a SolR-index with documents similar to this
[
{
ProductId: 123,
Contract: abc
},
{
ProductId: 123,
Contract: def
},
{
ProductId: 123
},
{
ProductId: 567
},
{
ProductId: 567,
Contract: bar
}
]
There is always a document with a specific ProductId and without a Contract
Additionally there may be 0 to n documents with Contract
I need a query, where I can use a Contract and that should return me all ProductIds either the one with the given Contract, if exists, or the single document without a Contract at all.
For example I will make a query with Contract: def (somehow) and it should give me this
[
{
ProductId: 123,
Contract: def
},
{
ProductId: 567
}
]
The document with Contract:abc is not part of the result
The document with ProductId:123 but without Contract is not part of the result
The document ProductId:567 is part of the result, because there is no document with this ProductId and ContractId: def
In other words what I need is something like
Give me one documents per ProductId and with Contract:X XOR -Contract*, but not both.
Step 1 Write your query so that records without Contracts as well as all with matching contracts are returned, but the ones with the appropriate contract have the highest score. This gets around the problem that you will sometimes want items in your results that don't match the contract value: q=Contract:"def" OR (*:* -Contract:[* TO *]). The (*:* -Contract:[* TO *]) matches that all records without contracts, and the Contract:"def" matches records with the correct contract. The records matching Contract:"def" should naturally have a higher score than those with no contract, but if there's any trouble or you just want to be sure, you can add a boost to that clause, Contract:"def"^2.
Step 2 Add Result Grouping to the query, configured so that you are requesting only the highest scoring record for any given ProductId:
q=Contract:"def" OR (*:* -Contract:[* TO *])&group=true&group.field=ProductId
This requires that the ProductId field be configured in your schema.xml as multiValued="false", as multiValued fields cannot be used as groups. I'm also assuming that you are using the Standard Query Parser, either set as a default in your solrconfig.xml or by adding the argument defType=lucene when you make the query.
The results should look something like this:
'grouped'=>{
'ProductId'=>{
'matches'=>5,
'groups'=>[{
'groupValue'=>123,
'doclist'=>{'numFound'=>3,'start'=>0,'docs'=>[
{
'ProductId'=>123,
'Contract'=>'def'}]
}},
{
'groupValue'=>567,
'doclist'=>{'numFound'=>2,'start'=>0,'docs'=>[
{
'ProductId'=>567}]
}}]}}}
Note that neither the matches nor the numFound values in the result set will tell you how many groups have been returned, but the argument rows=XX can be used to define the maximum number of desired groups (in this case ProductIds).
I'm new to Solr and I'm interested in implementing a special facet.
Sample documents:
{ hostname: google.com, time_spent: 100 }
{ hostname: facebook.com, time_spent: 10 }
{ hostname: google.com, time_spent: 30 }
{ hostname: reddit.com, time_spent: 20 }
...
I would like to return a facet with the following structure:
{ google.com: 130, reddit.com: 20, facebook.com: 10 }
Although solr return values are much more verbose than this, the important point is how the "counts" for the facets are the sum of the time_spent values for the documents rather than the actual count of the documents matching the facet.
Idea #1:
I could use a pivot:
q:*:*
&facet=true
&facet.pivot=hostname,time_spent
However, this returns the counts of all the unique time spent values for every unique hostname. I could sum this up in my application manually, but this seems wasteful.
Idea #2
I could use the stats module:
q:*:*
&stats=true
&stats.field=time_spent
&stats.facet=hostname
However, this has two issues. First, the returned results contain all the hostnames. This is really problematic as my dataset has over 1m hostnames. Further, the returned results are unsorted - I need to render the hostnames in order of descending total time spent.
Your help with this would be really appreciated!
Thanks!
With Solr >=5.1, this is possible:
Facet Sorting
The default sort for a field or terms facet is by bucket count
descending. We can optionally sort ascending or descending by any
facet function that appears in each bucket. For example, if we wanted
to find the top buckets by average price, then we would add sort:"x
desc" to the previous facet request:
$ curl http://localhost:8983/solr/query -d 'q=*:*&
json.facet={
categories:{
type : terms,
field : cat,
sort : "x desc", // can also use sort:{x:desc}
facet:{
x : "avg(price)",
y : "sum(price)"
}
}
}
'
See Yonik's Blog: http://yonik.com/solr-facet-functions/
For your use case this would be:
json.facet={
hostname_time:{
type: terms,
field: hostname,
sort: "time_total desc",
facet:{
time_total: "sum(time_spent)",
}
}
}
Calling sum() in nested facets worked for us only in 6.3.0.
I believe what you are looking for is an aggregation component, but be aware that solr is a full text search engine and not the database.
So, answer of your question is , go with idea#1. Otherwise you should have used Elastics Search or MongoDB or even Redis which are equipped with such aggregation components.