Is there a way to subfacet a table which has been already “faceted”? - google-refine

I have a table on which I'm applying a customized facet in order to find duplicates (on a column). Now I'd like to apply a new facet (on another column) on the table with the facet.
Is that possible? It seems that it can be used only one facet per time, and not combined ones together.
Cheers,
elisa

Facets can be combined on the same column or on multiple one to narrow down your data.
When you facet on two different column in the same time, result are a combination of: facet 1 AND facet 2. So in you case it will be within your duplicate records, records that match the criteria of your second facet.
You can also combine facet within the same column to create. You can read more about faceting here: http://googlerefine.blogspot.ca/2011/09/use-google-refine-to-navigate-data.html

Related

How to find distinct records in vespa.ai?

We have a use case where we need to find out the distinct (unique) records.
We have 5 different keys in a document they are all searchable, need to find the distinct records using one key.
I also need to implement pagination on that distinct records.
See https://docs.vespa.ai/documentation/grouping.html. The Vespa grouping language also supports pagination.
Example:
select ... | all(group(key) max(10) each( max(3) each(output(summary()))))
Will group hits by the key field, display at max 10 unique key values and for each unique key value render 3 best hits. Groups are by default ordered by the max relevancy of a hit in the group. When using max() you'll be able to paginate using the continuation parameter to fetch more groups or more hits.

Solr group by multiple fields

Does Solr allowing grouping by multiple field much like in SQL with GROUP BY?
If so how?
For example: If I wanted to group by name and email. I have tried adding multiple group fields...
group=true&group.field=name&group.field=email
But it only groups by one of the fields.
I have looked at other posts with similar questions, but none had a verified answer.
I fear you cannot have intersection between grouped results at query time based on group.field.
The remaining solution would be to create the combination of the two fields into a new one at index time and use that new field for grouping which would give you the results.
See : https://lucene.apache.org/solr/guide/6_6/result-grouping.html

How to find the number of duplicate documents in solr based on a indexed field

I have few near duplicate documents stored in solr. Schema has a autogenerated uuid as the unique key so duplicates can get into the index. I need to get the counts of duplicated documents based on field/fields in the schema.
I am trying to get quick numbers without writing a client program and going through the full result set, something on solr console itself.
Tried to use facets but not able to get the total counts. below query gives the duplicates for each value of 'idfield' but they need to be iterated till last page and summed up (over couple of million entries).
q=*:*&facet=true&facet.mincount=2&facet.field=idfield
jason facet query can be used to find out unique values as explained in this blog
http://yonik.com/solr-count-distinct/
or it can be done using collapse filter and finding the difference
q=*:*&fq={!collapse=true field=idfield} - get the numfound and subtract from MatchAllDocs query (*:*)
You can also use facet.mincount=2 to get duplicate documents by faceting on unique id field. Ex: /solr/core/select?q=:&facet=on&facet.field=uniqueidfield&facet.mincount=2&facet.missing=true
Also you can add facet.limit=-1&rows=0 to get the document ids with duplicate ids.

Checking if two fields have same value in solr for a document

I have a huge solr index.I want to find all the documents which have same value in two different fields of a SolrDocument.
Is it possible through solr query?
For example,
if sn1,sn2 are two fields in solr schema. I am interested in finding all results where values in sn1 and sn2 are equal.

How do I get the first and last document per SOLR facet, sorted by some field?

I have documents with multiple facets. I have different views on the website I'm creating to view the facet stats.
As well as showing the facet stats, I would like to show example documents from each facet - specifically, the first and last documents ordered by another field.
For example, properties for sale, I want to see the first and last (based on price) for each facet (the facet can be street, area, city, post code etc).
I can solve this by calling SOLR multiple times for each facet, but it seems like something that should be built in and if so, it would reduce roundtrips a LOT. (it would mean probably 2 SOLR calls per page instead of 30 or possibly more)
Instead of faceting, you can look into
https://wiki.apache.org/solr/FieldCollapsing
Then you need to do only two queries with group.sort ASC or DESC on the field by which you want to sort.

Resources