I have an index with around 35 million documents. When a user issues a query with any combination of search words and filters, I need to get a count of unique values on another field. The purpose is to answer the question "How many unique (field x) are there with a given query?".
I'm pretty sure that Azure Search doesn't have any capability to do this, so I thought I would try to do another query where I select just the field I want to count distinct values of, but I think this would be very time consuming with such a large index. I'm also under the impression that I can only skip at max 100,000 records, which would make it impossible for me to do this if a query returned more than 100k results.
Any ideas on how to go about this?
Thanks!
Azure Search doesn't directly support distinct count of values today. In order to support it in a single query combined with $filter, it would either have to be supported as a new facet type, or maybe with a combination of $count and $filter where the field being counted is the key field (note that $count and $filter can't be combined today).
Feel free to add distinct count to the Azure Search feedback forum to help prioritize the feature.
Original Answer
If you wanted a count of documents per unique value, you could use facets. For example, if you're searching for shoes under $100 dollars and you want to know, out of the hits, how many shoes of each color there are, you would do this:
GET /indexes/products/docs?search=shoes&$filter=price+lt+100&facet=color&api-version=2015-02-28
The response will contain a #search.facets property that contains buckets for each unique value along with a count. You can find more info here and here.
Related
We have a use case where we need to find out the distinct (unique) records.
We have 5 different keys in a document they are all searchable, need to find the distinct records using one key.
I also need to implement pagination on that distinct records.
See https://docs.vespa.ai/documentation/grouping.html. The Vespa grouping language also supports pagination.
Example:
select ... | all(group(key) max(10) each( max(3) each(output(summary()))))
Will group hits by the key field, display at max 10 unique key values and for each unique key value render 3 best hits. Groups are by default ordered by the max relevancy of a hit in the group. When using max() you'll be able to paginate using the continuation parameter to fetch more groups or more hits.
I use Solr to index and search on a system with about 100,000 products and 300,000 users. The field to index is "price". But for each user, the price may be different.
For example:
- For Product1 and User1, User2.
- User1 sees the Product1's price 100$.
- But User2 cannot not see the price (User2 has to fulfill some conditions to see the price) although User2 still sees Product1 when searching.
At the time of indexing, we cannot determine to set the price for a specific user or not. The product has a flag called "Required Contract". And when a user log in, we will check if the user has applied the "contract" for that product to show or hide the price.
The straight forward solution for this problem is to create different "price" fields for each user. So when indexing, we loop through the list of users and index the "price" field for that user. And when searching, we use the correct "price" field for the login user. Obviously, this is not a practical solution.
My question is how to index "price" field in this case or is there any other approaches on the solution?
It seems that you don't have specific prices for each user, but for groups of users. Presumably a not-very-high number of unique groups. If so, a simple solution is to have multiple price fields (price_group_1, price_group_2, ...), one for each user group, and let your display-code show the one matching the users group.
After researching for a while I came up with 2 solutions:
1/ The first one is to duplicate a document that requires the condition. The duplicated document contains no price (or any value assigned to it). Because the numbers of documents requiring conditions are not too much so duplicating is not a waste of resources. So when doing a searching, by combining some other flags the result is the duplicated documents not the original one that fulfill my requirements.
2/ Another approach is using Function Queries. I figured this out by searching on Google for my problem as I didn't know about that feature at the beginning.
Hope that can help someone's problems!
I have few near duplicate documents stored in solr. Schema has a autogenerated uuid as the unique key so duplicates can get into the index. I need to get the counts of duplicated documents based on field/fields in the schema.
I am trying to get quick numbers without writing a client program and going through the full result set, something on solr console itself.
Tried to use facets but not able to get the total counts. below query gives the duplicates for each value of 'idfield' but they need to be iterated till last page and summed up (over couple of million entries).
q=*:*&facet=true&facet.mincount=2&facet.field=idfield
jason facet query can be used to find out unique values as explained in this blog
http://yonik.com/solr-count-distinct/
or it can be done using collapse filter and finding the difference
q=*:*&fq={!collapse=true field=idfield} - get the numfound and subtract from MatchAllDocs query (*:*)
You can also use facet.mincount=2 to get duplicate documents by faceting on unique id field. Ex: /solr/core/select?q=:&facet=on&facet.field=uniqueidfield&facet.mincount=2&facet.missing=true
Also you can add facet.limit=-1&rows=0 to get the document ids with duplicate ids.
We have a situation where we are keeping two indexes with different schemas.
For example: suppose we have an index for seller where the key value is seller id and other attributes are seller information. Now another index is book where book id is unique key and it keeps book related information.
Is it possible to query both these indexes in a single query and get collective results?
I have checked Solr but as per my findings we can do this through distributed search in Solr but it works on same kind of schema being distributed in at max 3 indexes.
I am a newbie to Solr so please ignore if this is a stupid question.
You need to think about what makes sense for a search query but there are some rules.
The first requirement is that the unique keys need to have the same name and be unique across collections or Solr cannot collate results.
If you are then hoping to get some kind of sensible ranking of your results you need some common fields. For example I have two collections: one of product data and one containing product related documents. I have a unique key: id and I have common title and contents fields for when I want to query across the two collections. I also have an advanced search interface where I can query on specific fields like product id.
A "unification core" is a typical way of handling search across two or more cores, see this Stack Overflow answer on how to set that up
Query multiple collections with different fields in solr
Other techniques are to use federated search with something like Carrot or to issue two queries and show the results in different tabs in the search results.
I'm trying to understand how to approach search requirements I have.
The first one is a normal product search that I know Solr can handle appropriately, where you search for a term and Solr returns relevant documents.
The second one is a search for products within a certain category. I have a hierarchical structure in my database that consists in a category with many subcategories and those have products.
The thing is, when some very specific words are searched for, the first approach shouldn't be used, instead a search for a category should be done and only products within that category should be returned, which for me is a very basic SQL query (select * from products where categoryId = 1000).
Does Solr should or can be used in the second case? If so, what is the normal approach to use?
Besides what #Mysterion proposed of filter queries you should take a look at Solr Facets which gives you very powerful catogory-like searching.
You also might want to consider multivalue field for categoryParentIds which will contain the parent categories that the product is in and thus combined with filter query and or facets will get your parent category searching.
Yes, you could use similar approach in Solr, by attributing your products with categoryId and later, while searching add filter query similiar to SQL, categoryId:10000
For more info about filter query, take a look here - http://wiki.apache.org/solr/CommonQueryParameters#fq