I'm new to Solr and I'm interested in implementing a special facet.
Sample documents:
{ hostname: google.com, time_spent: 100 }
{ hostname: facebook.com, time_spent: 10 }
{ hostname: google.com, time_spent: 30 }
{ hostname: reddit.com, time_spent: 20 }
...
I would like to return a facet with the following structure:
{ google.com: 130, reddit.com: 20, facebook.com: 10 }
Although solr return values are much more verbose than this, the important point is how the "counts" for the facets are the sum of the time_spent values for the documents rather than the actual count of the documents matching the facet.
Idea #1:
I could use a pivot:
q:*:*
&facet=true
&facet.pivot=hostname,time_spent
However, this returns the counts of all the unique time spent values for every unique hostname. I could sum this up in my application manually, but this seems wasteful.
Idea #2
I could use the stats module:
q:*:*
&stats=true
&stats.field=time_spent
&stats.facet=hostname
However, this has two issues. First, the returned results contain all the hostnames. This is really problematic as my dataset has over 1m hostnames. Further, the returned results are unsorted - I need to render the hostnames in order of descending total time spent.
Your help with this would be really appreciated!
Thanks!
With Solr >=5.1, this is possible:
Facet Sorting
The default sort for a field or terms facet is by bucket count
descending. We can optionally sort ascending or descending by any
facet function that appears in each bucket. For example, if we wanted
to find the top buckets by average price, then we would add sort:"x
desc" to the previous facet request:
$ curl http://localhost:8983/solr/query -d 'q=*:*&
json.facet={
categories:{
type : terms,
field : cat,
sort : "x desc", // can also use sort:{x:desc}
facet:{
x : "avg(price)",
y : "sum(price)"
}
}
}
'
See Yonik's Blog: http://yonik.com/solr-facet-functions/
For your use case this would be:
json.facet={
hostname_time:{
type: terms,
field: hostname,
sort: "time_total desc",
facet:{
time_total: "sum(time_spent)",
}
}
}
Calling sum() in nested facets worked for us only in 6.3.0.
I believe what you are looking for is an aggregation component, but be aware that solr is a full text search engine and not the database.
So, answer of your question is , go with idea#1. Otherwise you should have used Elastics Search or MongoDB or even Redis which are equipped with such aggregation components.
Related
For 1000+ records satisfying the search criteria in filter query solr collection gives different percentile value every time. I've been using same filter query and using json facet query to get percentile inside one queryfacet.
Sample Query :
`
json.facet = {
time: "sum(time)",
users: "sum(numofusers)",
queryfacet: {
q: "time: [0 TO 50000}",
type: query,
facet: {
timepercentile: "percentile(time, 95)"
}
}
}
`
The percentile function is an approximation and is not an exact value. It's documented under the stats functions:
percentiles
A list of percentile values based on cut-off points specified by the parameter value, such as 1,99,99.9. These values are an approximation, using the t-digest algorithm. This statistic is computed for numeric field types and is not computed by default.
The percentile function in the JSON Facet API uses the same method:
Percentile estimates via t-digest algorithm. When sorting by this metric, the first percentile listed is used as the sort value.
You can read more about the t-digest algorithm on the GitHub repository.
Since these values are based on estimates, I'm guessing there's some minor variance in which elements gets sampled; it might also depend on the structure of your index (number of nodes, when they get updated, when the commits gets issued, etc.).
How can I group my Solr query results using a numeric field into x buckets, where the bucket start and end values are determined when the query is run?
For example, if I want to count and group documents into 5 buckets by a wordCount field, the results should be:
250-500 words: 3438 results
500-750 words: 4554 results
750-1000 words: 9854 results
1000-1250 words: 3439 results
1250-1500 words: 38 results
Solr's faceting API docs assume that the facet buckets are known in advance, but this isn't possible for numeric fields because the lower and upper buckets depend on the search results.
My current query (which doesn't work) is:
curl http://localhost:8983/solr/pages/query -d '
q=*:*&
rows=0&
json.facet={
wordCount : {
type: range,
field : wordCount,
start : max(wordCount),
end : min(wordCount),
gap : 1000
}
}'
I have read this question, which suggests calculating the buckets in the application code prior to sending them to Solr for counting. This is not ideal because it involves querying the database multiple times, and also the answer is several years out of date and since then Solr has added the JSON faceting API, which allows more complicated faceting settings.
In SQL, this type of dynamic bucketing is possible with union queries, in which each query in the union which calculates a specific bucket's lower and upper bounds and counts the results in that bucket. So it seems weird that in Solr, where a lot of effort has gone into making faceting easy, this kind of query is not possible.
I'd like to use the new Semantic Knowledge Graph capability in Solr to answer this question:
Given a set of documents from several different publishers, compute a "relatedness" metric between a given publisher and every other publisher, based on the text content of their respective documents.
I've watched several of Trey Grainger's talks regarding the Semantic Knowledge Graph functionality in Solr (this is a great recent one: https://www.youtube.com/watch?v=lLjICpFwbjQ) I have a reasonably good understanding of Solr faceted search functionality, and I have a working Solr engine with my dataset indexed and searchable. So far I've been unable to construct a facet query to do what I want.
Here is an example curl command which I thought might get me what I want
curl -sS -X POST http://localhost:8983/solr/plans/query -d '
{
params: {
fore:"publisher_url:life.church"
back:"*:*",
},
query:"*:*",
limit: 0,
facet:{
pub_type: {
type: terms,
field: "publisher_url",
limit: 5,
sort: { "r1": "desc" },
facet: {
r1: "relatedness($fore,$back)"
}
}
}
}
}'
Below are the result facets. Notice that after the first bucket (which matches the foreground query), the others all have exactly the same relatedness. Which leads me to believe that the "relatedness" is only based on the publisher_url field rather than the entire text content of the documents.
{
"facets":{
"count":2152,
"pub_type":{
"buckets":[{
"val":"life.church",
"count":141,
"r1":{
"relatedness":0.38905,
"foreground_popularity":0.06552,
"background_popularity":0.06552}},
{
"val":"10ofthose.com/us/products/1039/colossians",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"14DAYMARRIAGECHALLENGE.COM",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"23blast.com",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}},
{
"val":"2911worship.com",
"count":1,
"r1":{
"relatedness":-0.00285,
"foreground_popularity":0.0,
"background_popularity":4.6E-4}}]}}}
I'm not very familiar with the relatedness function, but as far as I understand, the relatedness score is generated from the similarity between your foreground and background set of documents for that facet bucket.
Since your foreground set only contain that single value (and none of the other), the first bucket is the only one that will generate a different similarity score when you're faceting for the same field as you use for selecting documents.
I'm not sure if your use case is a good match for what you're trying to use, as relatedness would indicate that single terms in a field is related between the two sets you're using, and not a similarity score across a different field for the two comparison operators.
You probably want something more structured than a text field to generate relatedness() scores, as that's usually more useful for finding single values that generate statistical insight into the structure of your query set.
The More Like This functionality might actually be a better match for getting the most similar other sites instead.
Again, this is based on my understanding of the functionality at the moment, so someone else can hopefully add more details and correct me as necessary.
I have people indexed into solr based on structured documents. For simplicity's sake, let's say they have the following schema
{
personName: text,
games :[ { gamerScore: int, game: text } ]
}
An example of the above would be
{
personName: john,
games: [
{ gamerScore: 80, game: Zelda },
{ gamerScore: 20, game: Space Invader },
{ gamerScore: 60, game: Tetris},
]
}
'gamerScore' (a value between 1 and 100 to indicate how good the person is in the specified game).
Relevance matching in solr is all done through the Text field 'game'. However, I want my final result list to be a combination of relevance to the query as provided by solr and my own gamerScore. Namely, I need to re-rank the results based on the following formula:
personFinalScore = (0.8 * solrScore) + (0.2 * gamerScore)
What am trying to achieve is the combination of two different scores in a weighted manner in solr. This question was asked a long time ago, and was wondering if there is something in solr v7.x. that can tackle this.
I can change the schema around if a solution requires it.
In effect your formula can be simplified to applying your gamerScore with 0.25 - the absolute value of the score is irrelevant, just how much the gamerScore field affects the score of the document.
The dismax based handlers supports bf:
The bf parameter specifies functions (with optional boosts) that will
be used to construct FunctionQueries which will be added to the user’s
main query as optional clauses that will influence the score.
Since bf is an addtive boost, you can use bf=product(gamerScore,0.25) to make the gamerScore count 20% of the total score.
Is there a way to get the count of different facets in solr?
This is an example facet:
Camera: 20
Computer: 80
Monitor: 40
Laptop: 120
Tablet: 30
What I need to get here is "5" as in 5 different electronic items. Of course I can count the facet but some of them have many items inside and fetching and counting them really slow it down.
You need to apply Solr Patch SOLR-2242 to get the Facet distinct count.
In SOLR 5.1 will be a "JSON Facet API" with function "unique(state)".
http://yonik.com/solr-facet-functions/
Not an exact answer to this question, but if facet distinct count doesn't work for anyone, you can get that information using group.ngroups:
group = true
group.field = [field you are faceting on]
group.ngroups = true
The disadvantage of this approach is that if you want ungrouped results, you will have to run the query a second time.
Of course the patch is recommended, but the cheesy way to do this for a simple facet, as you have here, is just to use the length of the facet divided by 2. For example, since you are faceting on type:
facetCount = facet_counts.facet_fields.type.length/2
Use this GET request:
http://localhost:8983/solr/core_name/select?json.facet={x:"unique(electronic_items)"}&q=*:*&rows=0