How to find distinct records in vespa.ai? - vespa

We have a use case where we need to find out the distinct (unique) records.
We have 5 different keys in a document they are all searchable, need to find the distinct records using one key.
I also need to implement pagination on that distinct records.

See https://docs.vespa.ai/documentation/grouping.html. The Vespa grouping language also supports pagination.
Example:
select ... | all(group(key) max(10) each( max(3) each(output(summary()))))
Will group hits by the key field, display at max 10 unique key values and for each unique key value render 3 best hits. Groups are by default ordered by the max relevancy of a hit in the group. When using max() you'll be able to paginate using the continuation parameter to fetch more groups or more hits.

Related

How to find the number of duplicate documents in solr based on a indexed field

I have few near duplicate documents stored in solr. Schema has a autogenerated uuid as the unique key so duplicates can get into the index. I need to get the counts of duplicated documents based on field/fields in the schema.
I am trying to get quick numbers without writing a client program and going through the full result set, something on solr console itself.
Tried to use facets but not able to get the total counts. below query gives the duplicates for each value of 'idfield' but they need to be iterated till last page and summed up (over couple of million entries).
q=*:*&facet=true&facet.mincount=2&facet.field=idfield
jason facet query can be used to find out unique values as explained in this blog
http://yonik.com/solr-count-distinct/
or it can be done using collapse filter and finding the difference
q=*:*&fq={!collapse=true field=idfield} - get the numfound and subtract from MatchAllDocs query (*:*)
You can also use facet.mincount=2 to get duplicate documents by faceting on unique id field. Ex: /solr/core/select?q=:&facet=on&facet.field=uniqueidfield&facet.mincount=2&facet.missing=true
Also you can add facet.limit=-1&rows=0 to get the document ids with duplicate ids.

Grouping documents in solr

I want to index orders and corresponding order entries in Solr to display it in our e-commerce site.
I am planning to adopt a De-normalized approach by repeating order details with every order entries to reduce request latency. But at the same time I need to group records by orderid to find order total for a specified duration.
Is it possible to achieve this without going for a separate index for orders alone?
Yes this is possible, you can user Result Grouping / Field Collapsing for your query. In your case the group field should be orderid and you should add group=true&group.field=orderid to your request to Solr to enable this feature.

How could I go about getting distinct field counts in Azure Search

I have an index with around 35 million documents. When a user issues a query with any combination of search words and filters, I need to get a count of unique values on another field. The purpose is to answer the question "How many unique (field x) are there with a given query?".
I'm pretty sure that Azure Search doesn't have any capability to do this, so I thought I would try to do another query where I select just the field I want to count distinct values of, but I think this would be very time consuming with such a large index. I'm also under the impression that I can only skip at max 100,000 records, which would make it impossible for me to do this if a query returned more than 100k results.
Any ideas on how to go about this?
Thanks!
Azure Search doesn't directly support distinct count of values today. In order to support it in a single query combined with $filter, it would either have to be supported as a new facet type, or maybe with a combination of $count and $filter where the field being counted is the key field (note that $count and $filter can't be combined today).
Feel free to add distinct count to the Azure Search feedback forum to help prioritize the feature.
Original Answer
If you wanted a count of documents per unique value, you could use facets. For example, if you're searching for shoes under $100 dollars and you want to know, out of the hits, how many shoes of each color there are, you would do this:
GET /indexes/products/docs?search=shoes&$filter=price+lt+100&facet=color&api-version=2015-02-28
The response will contain a #search.facets property that contains buckets for each unique value along with a count. You can find more info here and here.

Is there a way to subfacet a table which has been already “faceted”?

I have a table on which I'm applying a customized facet in order to find duplicates (on a column). Now I'd like to apply a new facet (on another column) on the table with the facet.
Is that possible? It seems that it can be used only one facet per time, and not combined ones together.
Cheers,
elisa
Facets can be combined on the same column or on multiple one to narrow down your data.
When you facet on two different column in the same time, result are a combination of: facet 1 AND facet 2. So in you case it will be within your duplicate records, records that match the criteria of your second facet.
You can also combine facet within the same column to create. You can read more about faceting here: http://googlerefine.blogspot.ca/2011/09/use-google-refine-to-navigate-data.html

How to select lots of counts of lots of different criterias

I'm trying to do a detailed Member Search page. It uses Ajax in every aspect like Linkedin did on search pages.
But I don't know how I can select counts of multiple criterias. You can see what I meant by the attachment. I mean, if I select every count with different queries it's gonna take forever.
Should I store the count values on another table? Then, further development will be hard and time consuming.
I need your advices.
In this web site, you enter just a keyword and it shows you the all available fields order by count DESC;
You can create an Indexed View that groups by your criteria and uses COUNT_BIG to get totals.
CREATE VIEW dbo.TagCount
WITH SCHEMABINDING
AS
SELECT Tag, COUNT_BIG(*) AS CountOfDocs
FROM dbo.Docs
GROUP BY Tag
GO
CREATE UNIQUE CLUSTERED INDEX IX_TagCount ON dbo.TagCount (Tag)

Resources