How can I group my Solr query results using a numeric field into x buckets, where the bucket start and end values are determined when the query is run?
For example, if I want to count and group documents into 5 buckets by a wordCount field, the results should be:
250-500 words: 3438 results
500-750 words: 4554 results
750-1000 words: 9854 results
1000-1250 words: 3439 results
1250-1500 words: 38 results
Solr's faceting API docs assume that the facet buckets are known in advance, but this isn't possible for numeric fields because the lower and upper buckets depend on the search results.
My current query (which doesn't work) is:
curl http://localhost:8983/solr/pages/query -d '
q=*:*&
rows=0&
json.facet={
wordCount : {
type: range,
field : wordCount,
start : max(wordCount),
end : min(wordCount),
gap : 1000
}
}'
I have read this question, which suggests calculating the buckets in the application code prior to sending them to Solr for counting. This is not ideal because it involves querying the database multiple times, and also the answer is several years out of date and since then Solr has added the JSON faceting API, which allows more complicated faceting settings.
In SQL, this type of dynamic bucketing is possible with union queries, in which each query in the union which calculates a specific bucket's lower and upper bounds and counts the results in that bucket. So it seems weird that in Solr, where a lot of effort has gone into making faceting easy, this kind of query is not possible.
Related
I need to get only n first documents sorted by prevId field from Solr (and not getting all the docs but cut to rows value) It seems to have poor performance and moreover it returns me the wrong value of found docs.Is where any way to do it from SOLR gui
or raw request?
numFound is the total number of documents that matches your query in the index (which in this case is all the documents in the index), it's not the number of documents returned.
You can enable docValues on your field if sorting is slow for that field - but caching usually helps a lot when doing multiple sorts (as long as your index hasn't been modified in between). That being said, your query took 285ms on the Solr side, so maybe the slowness you're experiencing comes from somewhere else than Solr?
Different output formats (&wt=json etc.) might also be more efficient for deserializing in your language of choice (.. and for display in your browser, which does a lot of syntax highlighting for XML).
I've noticed something curious with our SOLR 7 results.
We have faceting enabled on, for example, a manufacturer field.
When a search is performed for a particular manufacturer, the facet data will include a number of results for that manufacturer (in this case, 99 results). Also, all the facet results add up to match the total number of documents matching the query (which makes sense).
If a "blank" search is performed (resulting in a : query), all documents are returned from SOLR (~242,000). The facet results for the manufacturer field are no longer adding up to the total number of documents returned, however. It ends up being ~36,000 documents short. The specific manufacturer that I searched for in the prior example, which DID return a count of 99 in the facet data for that manufacturer, now returns nothing for that manufacturer. There is no facet result shown for that manufacturer.
If I query solr for the specific manufacturer value in the specific field we're faceting on, then it finds the 99 matches, and the facet data also shows the 99 results.
I think this problem is only happening when a : (or blank q) query is done.
Any suggestions?
Please let me know if you require more information.
Thanks,
Bill
I'm not sure I get your problem true but I suggest you some typical solution.
you can use "enum" facet method for huge facets.
facet.method=enum
Furthermore you need to control facet counts with:
facet.limit=10000 //maximum number of returned facets
facet.offset= 0
for more information about Solr facet params go to:
https://wiki.apache.org/solr/SimpleFacetParameters
I would like to check, will using the results grouping with group.ngroups (which will include the number of groups that have matched the query) in the search affects the performance of the Solr? I found that the searching speed has slowed down quite significantly after I added in the group.ngroups parameters.
I required the value of the number of groups that have matched the query. Besides this, is there other way which I can retrieve that value?
I have more than 10 million documents, with an index size of more than 500GB, and I'm using Solr 5.4.0.
Regards,
Edwin
Yes, it will affect performance. Everything that needs to be done to a result set (such as grouping) will affect performance in some way. How much depends on way too many factors to say exactly how much (but you've already observed that).
You can get the number of unique values (which should be the same as grouping for that field and counting the number of groups) for a field in a number of ways, which Yonik shows in his Count Distinct Values blog post.
The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
x : "unique(manu_exact)" // manu_exact is the manufacturer indexed as a single string
}'
On our webshop I want to implement a feature which should do the following:
If a user e.g. searches for "phone magnum", there will be no results.
If there were no results I want to give him the possibility to see
that search for "phone" will give him 139 results
and search for "magnum" will get 12 results.
I don't want to start several queries only for getting those counts. But at the moment I have no Idea how to do that.
I read the Solr-wiki for faceting, but didn't find anything useful for my problem. Maybe I missed something ....
Not sure why you want to avoid multiple queries. If your first search on the phrase "phone magnum" does not return any results, you could issue one query per search keyword with rows=0 which will give you only the counts. This should be efficient, since you are not building any result documents and only getting the result counts.
However, if you really want to avoid the subsequent queries, here is one apporach: Have a field in your index which does not take IDF into account. (See this on how to do that.) Once that field is available (call it say name_no_idf) issue a query against this field name_no_idf:(phone magnum). Notice that this is not a phrase search.
The documents which contain both phone and magnum in the name_no_idf field will get a score of 2, while the docs matching only one word will get a score of 1. To this query you add facet=true&facet.field=name. Then the facet counts you get for these two words will be the counts you are looking for. But few warnings:
if one of the words is very infrequent, you may need to increase facet.limit
facet queries are expensive
We want to be able to return the "n" most frequent indexed terms for certain documents selected from a base query. Is that possible using solar?
Yes, you can do this by turning faceting on and faceting on the field from which you're trying to get the frequently indexed terms. You might actually get more information then you need (Solr will return all terms ordered by frequency rather than the top n):
?q=keyword&facet=true&facet.field=myfield
If you use &rows=0 as well then Solr will return only the faceting information and not the actual search results as well.
EDIT: Actually, by default Solr returns the top 100 facet terms. Use the facet.limit parameter to change this number. So, to return the top n terms, do the following:
?q=keyword&facet=true&facet.field=myfield&facet.limit=n
Use a negative number for facet.limit to return all terms. More information here: http://wiki.apache.org/solr/SimpleFacetParameters