How to find the number of duplicate documents in solr based on a indexed field - solr

I have few near duplicate documents stored in solr. Schema has a autogenerated uuid as the unique key so duplicates can get into the index. I need to get the counts of duplicated documents based on field/fields in the schema.
I am trying to get quick numbers without writing a client program and going through the full result set, something on solr console itself.
Tried to use facets but not able to get the total counts. below query gives the duplicates for each value of 'idfield' but they need to be iterated till last page and summed up (over couple of million entries).
q=*:*&facet=true&facet.mincount=2&facet.field=idfield

jason facet query can be used to find out unique values as explained in this blog
http://yonik.com/solr-count-distinct/
or it can be done using collapse filter and finding the difference
q=*:*&fq={!collapse=true field=idfield} - get the numfound and subtract from MatchAllDocs query (*:*)

You can also use facet.mincount=2 to get duplicate documents by faceting on unique id field. Ex: /solr/core/select?q=:&facet=on&facet.field=uniqueidfield&facet.mincount=2&facet.missing=true
Also you can add facet.limit=-1&rows=0 to get the document ids with duplicate ids.

Related

How could I go about getting distinct field counts in Azure Search

I have an index with around 35 million documents. When a user issues a query with any combination of search words and filters, I need to get a count of unique values on another field. The purpose is to answer the question "How many unique (field x) are there with a given query?".
I'm pretty sure that Azure Search doesn't have any capability to do this, so I thought I would try to do another query where I select just the field I want to count distinct values of, but I think this would be very time consuming with such a large index. I'm also under the impression that I can only skip at max 100,000 records, which would make it impossible for me to do this if a query returned more than 100k results.
Any ideas on how to go about this?
Thanks!
Azure Search doesn't directly support distinct count of values today. In order to support it in a single query combined with $filter, it would either have to be supported as a new facet type, or maybe with a combination of $count and $filter where the field being counted is the key field (note that $count and $filter can't be combined today).
Feel free to add distinct count to the Azure Search feedback forum to help prioritize the feature.
Original Answer
If you wanted a count of documents per unique value, you could use facets. For example, if you're searching for shoes under $100 dollars and you want to know, out of the hits, how many shoes of each color there are, you would do this:
GET /indexes/products/docs?search=shoes&$filter=price+lt+100&facet=color&api-version=2015-02-28
The response will contain a #search.facets property that contains buckets for each unique value along with a count. You can find more info here and here.

How do I get the first and last document per SOLR facet, sorted by some field?

I have documents with multiple facets. I have different views on the website I'm creating to view the facet stats.
As well as showing the facet stats, I would like to show example documents from each facet - specifically, the first and last documents ordered by another field.
For example, properties for sale, I want to see the first and last (based on price) for each facet (the facet can be street, area, city, post code etc).
I can solve this by calling SOLR multiple times for each facet, but it seems like something that should be built in and if so, it would reduce roundtrips a LOT. (it would mean probably 2 SOLR calls per page instead of 30 or possibly more)
Instead of faceting, you can look into
https://wiki.apache.org/solr/FieldCollapsing
Then you need to do only two queries with group.sort ASC or DESC on the field by which you want to sort.

Is there any way to convert a solr multifield value to single field for sort?

I have records that have multiple values so I put those fields in a multifield value for its solr document. The issue is I also need to return an ordered list of these values. I have way to many records to pull all document values and sort myself. I tried to create separate solr documents to store just these values with needed information but managing this has become a nightmare. Attempting to keep comments low and managing memory has not been ideal for this solution.
Is there anyway to copy these multifield values into single field values for the same document and sort on these multiple single field values in solr?
Thanks for any help.
doesn't faceting help you? you won't need to have a copyfield for multivalued/non-multivalued, just store them in a multivalued field, facet them and set the sorting criteria for the facet (default: number of occurrencies for each value)

Solr MultiValued scoring boost with search based on array

First, I'm fairly new to solr and I'm far from sure that solr is the right solution
to solve this problem. The documents that I'm working on is already there so, if solr can solve it, then it would be great :)
One of our fields in a document is of type string and have attribute multiValued set to true. It contains a list of id's that the current document relates to.
The task a head now is that I know have a second list of id's (same domain) and, if any of these id's matches (if more then one id matches then I want a higher score), then I would like to boost the score of the document.
Use Boost Query if you are using dismax or edismax.
For example, bq=id:1 OR id:2 OR id3 will boost documents which have at least one of the 3 ids. It will also give a higher boost to documents with more matching ids.

Is there a way to subfacet a table which has been already “faceted”?

I have a table on which I'm applying a customized facet in order to find duplicates (on a column). Now I'd like to apply a new facet (on another column) on the table with the facet.
Is that possible? It seems that it can be used only one facet per time, and not combined ones together.
Cheers,
elisa
Facets can be combined on the same column or on multiple one to narrow down your data.
When you facet on two different column in the same time, result are a combination of: facet 1 AND facet 2. So in you case it will be within your duplicate records, records that match the criteria of your second facet.
You can also combine facet within the same column to create. You can read more about faceting here: http://googlerefine.blogspot.ca/2011/09/use-google-refine-to-navigate-data.html

Resources