Retrieving distinct documents from Solr - solr

I've had hard time explaining and finding what I need so please put your self in my shoes for a moment.
My requirement comes from a relational database background. I may be using Solr to do something it wasn't designed to do, or may be it can do what I need, I still need to confirm that. Hopefully you can assist me.
After indexing numerous documents into Solr. I need to retrieve distinct documents based on a filter. Just think about it as retrieving distinct rows while also applying a WHERE condition.
For example, in a relational database, I may have the following columns
(Country) (City) (Whatever)
Egypt Cairo Hospitals
Egypt Alex Schools
Egypt Mansoura Hospitals
Egypt Cairo Schools
If I perform this query: SELECT DISTINCT Country, City FROM mytable
I should get the following rows
(Country) (City)
Egypt Alex
Egypt Mansoura
Egypt Cairo
Now after indexing the original table (SELECT * FROM mytable), how can I achieve the SAME output from Solr ? How can I retrieve documents by saying that I need these documents to be distinct based on some fields ? I will also need to apply a not null filter for a specific field.
I don't need statistics of any kind, I only need to get the documents.
I hope I was clear enough. Thank you for your time.

this would be achievable with field collapsing by grouping by multiple fields, but unfortunately only one field is supported right now. There is an open issue, check it out.

Did you try with facet?
You should do somethings like this:
http://localhost:8983/solr/select/?q=*:*&facet=on&facet.field=city&facet.field=country
he will return you all the city (with a distinct) and the his count.
Here there is the wiki if you want to learn more about it.
I hope this help you.

Another good solution available from Solr 4 is based on Pivot (Decision Tree) Faceting.
Try with:
/solr/collection1/select?q=*:*&facet=true&facet.pivot=Country,City
This should return:
"facet_counts" : {
"facet_queries" : {},
"facet_fields" : {},
"facet_dates" : {},
"facet_ranges" : {},
"facet_pivot" : {
"Country,City" : [ {
"field" : "Country",
"value" : "Egypt",
"count" : 4,
"pivot" : [ {
"field" : "City",
"value" : "Cairo",
"count" : 2
}, {
"field" : "City",
"value" : "Alex",
"count" : 1
}, {
"field" : "City",
"value" : "Mansoura",
"count" : 1
} ]
} ]
}
}

Related

Solr faceting on a Query Function result

Is it possible to produce solr facets for a field which is the result of Query Function?
I have an index of products with a price field for each store they are available in:
{
"id" : "p1",
"name_s" : "Product 1",
"description_s" : "The first product",
"price_l1_d" : 19.99,
"price_l2_d" : 20.00,
"price_l3_d" : 20.99,
"price_l4_d" : 19.99,
"price_l5_d" : 25.00,
"price_l6_d" : 18.00
},
{
"id" : "p2",
"name_s" : "Product 2",
"description_s" : "The second product",
"price_l1_d" : 12.99,
"price_l2_d" : 15.00,
"price_l3_d" : 13.49,
"price_l4_d" : 14.00,
"price_l5_d" : 12.50,
"price_l6_d" : 16.00
}
and I need my query to return the cheapest price in the customer's 3 closest stores.
I know I can return this value using fl=min(price_l2_d, price_l4_d, price_l6_d) and I can even sort on this but is it possible to return a "Price" facet based on this value for each document? Ideally I'd like to be able to show all products whose minimum price (in my 3 stores) is between 0-5, 5-10, 10-15, 15-20 etc etc and filter on this.
I've tried using min(price_l2_d, price_l4_d, price_l6_d) as facet.field but I receive an undefined field error. Is there a better way?
I cannot produce this value at index time because the closest 3 stores could be any combination of three price fields (in this example there is 6 but thee are likely to be over 200)
While not THE solution, I have found A solution which should work. Unfortunately it's not possible to create a traditional facet for price ranges as you would with a single integer attribute, but a two-point slider is possible.
Using the JSON facet API (as suggested by a comment on the original question) and the following:
{
"max" : "max(min(price_l2_d, price_l4_d, price_l6_d))",
"min" : "min(min(price_l2_d, price_l4_d, price_l6_d))"
}
I can return the boundaries of the slider with the smallest minimum price at the three stores and the biggest minimum price.
The values on this slider can then be applied using the {!frange} function as follows:
fq={!frange l=0 u=20}min(price_l2_d, price_l4_d, price_l6_d)
where l is the lower bound and u is the upper bound
Hopefully this helps anyone else looking for an answer to this.

Group by and Count(*) in Datastax Search/Solr

Hi we have a solr index with diff fields in it like business,businessType, regionName, StateName, .....
Now I need a solr query to get the number of business of type businessType ='event' group by regionName.
if I want to write a sql query for this it would be select region_name , Count(business) from solr where businessType='event' group by region_name
Any pointer would be helpful
I finally figured out how to do this. Note, if you need to query on a field with a space or a special character, you need to put the search term in quotes, e.g. businessType:"(fun) event".
curl http://localhost:8983/solr/yourCollection/query -d
{ "query"="*:*",
"fq"="businessType:event",
"rows"=0,
"json.facet"= { "category" : {
"type": "terms",
"field" : "region_name",
"limit" : -1 }}
}
One more Note: if you want to count over 2 fields, you have to do a nested facet.
curl http://localhost:8983/solr/yourCollection/query -d
{ "query"="*:*",
"fq"="businessType:event",
"rows"=0,
"json.facet"= { "category1" : {
"type": "terms",
"field" : "regionName",
"limit" : -1,
"facet" : { "category2" : {
"type": "terms",
"field" : "stateName",
"limit" : -1
}}}}
}
Add another facet chunk after the "limit":-1 item if you need to group by a third dimension. I tried this on my company's Solr and it hung, never returning anything but a timeout error. In general, working with Solr isn't very easy... and the documentation, IMO, is pretty terrible. And absolutely nothing about the syntax or names of the commands seem intuitive at all...
Use facets. Your solr query will look like, q=:&fq=businessType:event&facet=true&facet.field=region_name&rows=0
if want to group by on multiple fields then we need to do facet.pivot=state,region_name

Solr Facet , how exclude only main query?

My core solr "catalog" have theses field :
"category":yyyyy
"s_brand":zzzzz
"visibilty":hhhhh
"color":wwww
The start, that i call "main query", is :
q=category:computer
So , i want only computer on my result.
Then, i set facet to s_brand:
facet.field=s_brand&facet=on&q=category:computer
Here, i have list of computer and facet "s_brand" with constraint like :
s_brand :
ACER (5)
APPLE(10)
HP(4)
BANG&OLUFSEN(0)
ADIDAS(0)
REEBOK(0)
This list include all brand, not only brand they match main query category:computer.
I set facet s_marque with minCount(1) :
s_brand :
ACER (5)
APPLE(10)
HP(4)
It's good. BUT !!
Now i will add other facet like color
s_brand :
ACER (5)
APPLE(11)
HP(4)
color:
PINK(7)
BLUE(13)
Then, i filterquery ( always with mainquery category:computer) on brand
fq=brand:ACER
i have :
brand:
ACER(5)
color:
PINK(2)
BLUE(3)
I want keep the other constraint ( about ux), so i exclude filter from facet.
fq={!tag=dt}brand:ACER&facet=true&facet.field={!ex=dt}brand
Results:
s_brand :
ACER (5)
APPLE(11)
HP(4)
color:
PINK(7)
BLUE(13)
So, i have all facet who respect my main query but the count don't follow my filter query.
How i can get facet:
1 respect my main query ( only category computer)
2 who return all facet who respect this main query
3 but return count after filterquery
Like that :
s_brand :
ACER (5)
APPLE(0)
HP(0)
color:
PINK(2)
BLUE(3)
Edit : My solutions actually 11/10/2019:
Two facets for same field. One exclude / one no exclude. Then i manage result with php. I don't like but work.
I finds that : https://lucene.apache.org/solr/guide/7_5/json-facet-api.html
About bucket and sub facets ( nested facets)
Results are really special and we need to hard process them behind.. but it's really strong seems to be the good way.

Treat two facets as the same value

Assume a list of books with an Author field. How might one facet on the Author field, but treat the values "Stephen King" and "Richard Bachman" as the same? So that these results:
Hemmingway: 8
Stephen King: 10
Edgar Allan Poe: 20
Richard Bachman: 5
Would be displayed as:
Hemmingway: 8
Stephen King: 15
Edgar Allan Poe: 20
Note that it is unimportant if the facet title is "Stephen King", "Richard Bachman", or something else. It is only important that they are faceted together.
Note that a query-time solution is needed. Unfortunately the schema cannot be changed for this index, it is a general-purpose index and if every user could make his own schema 'tweak' it would get out of hand.
You can achieve that by combining facet fields with facet queries.
Add these to your query:
&facet=true
&facet.field=author
&facet.query=author:("Hemmingway" OR "Stephen King")
Facets returned will look like this:
facet_counts: {
facet_queries: {
"author:("Hemmingway" OR "Stephen King")" : 18
}
facet_fields: {
author: {
"Hemmingway" : 8,
"Stephen King" : 10,
"Edgar Allan Poe" : 20,
"Richard Bachman" : 5
}
}
}
You can also add an 'alias' to the facet query. Change this
&facet.query=author:("Hemmingway" OR "Stephen King")
To
&facet.query={!ex=dt key="Hemmingway"}author:("Hemmingway" OR "Stephen King")
And the facet query output will be:
facet_queries: {
"Hemmingway" : 18
}
I'm not sure if you can merge both output fields (facet_queries and facet_fields) from Solr, but doing that from any client should be straight-forward.
You need an analysis chain that converts the strings. I think SynonymFilter will do this for you if you apply it at index time and at query time. You would need to make sure the sysnonym mapping goes one way only.
I assume you do not need the whole list of facets, just top n authors. If this is the case you can do it in a post processing step.
You know your synonyms and if you put a slightly higher facet.limit(let's say 2*n) then you just have to filter out the synonyms from the result set. If you end up with < n results then just repeat the previous step(worse case you have to do one more request(s) depending on the number of synonyms).
in ex ...&facet=true&facet.field=author&facet.limit=100&facet.mincount=1
This one has nothing to do with Solr, but considering all the restrictions it might just cut it.
Best regards,

Update a new field to existing document

is there possibility to update a new field to an existing document?
For example:
There is an document with several fields, e.g.
ID=99999
Field1:text
Field2:text
This document is already in the index, now I want to insert a new field to this document WITHOUT the old data:
ID=99999
Field3:text
For now, the old document will be deleted and a new document with the ID will be created. So if I now search for the ID 99999 the result will be:
ID=99999
Field3:text
I read this at the Solr Wiki
How can I update a specific field of an existing document?
I want update a specific field in a document, is that possible? I only need to index one field for >a specific document. Do I have to index all the document for this?
No, just the one document. Let's say you have a CMS and you edit one document. You will need to re-index this document only by using the the add solr statement for the whole document (not one field only).
In Lucene to update a document the operation is really a delete followed by an add. You will need >to add the complete document as there is no such "update only a field" semantics in Lucene.
So is there any solution for this? Will this function be implemented in a further version (I currently use 3.6.0). As a workaround, I thought about writing a script or an application, which will collect the existing fields, add the new field and update the whole document. But I think this will suffer performance. Do you have any other ideas?
Best regards
I have 2 answers for you (both more or less bad):
To update filed with in document in Solr you have to reindex whole document (to update Field3 within document ID:99999 you have to reindex that document with values for all fields)
In Solr 4 they implemented feature like that, but they have a condition: all fields have to be stored, not just indexed. What is happening that is they are using stored values and reindexing document in the background. If you are interested, there is nice article about it: http://solr.pl/en/2012/07/09/solr-4-0-partial-documents-update/ This solution have obvious flaw and that is size of index when you are storing all fields.
I hope that this will help you with your problem. If you have some more questions, please ask
It is possible to do this in Solr 4. E.g. Consider the following document
{
"id": "book123",
"name" : "Solr Rocks"
}
In order to add an author field to the document the field value would be a json object with "set" attribute and the field value
$ curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d '
[
{"id" : "book123",
"author" : {"set":"The Community"}
}
]'
Your new document
$ curl http://localhost:8983/solr/get?id=book123
will be
{
"doc" : {
"id" : "book123",
"name" : "Solr Rocks"
"author": "The Community"
}
}
Set will add or replace the author field. Along with set you also have the option to increment(inc) and adding(add)
From Solr 4 onwards you can update a field in solr ....no need to reindex the entire indexes .... various modifiers are supported like ....
set – set or replace a particular value, or remove the value if null is specified as the new value
add – adds an additional value to a list
remove – removes a value (or a list of values) from a list
removeregex – removes from a list that match the given Java regular expression
inc – increments a numeric value by a specific amount (use a negative value to decrement)
example :
document
{
"id": "1",
"name" : "Solr"
"views" : "2"
}
now update with
$ curl http://localhost:8983/solr/demo/update -d '
[
{"id" : "1",
"author" : {"set":"Neal Stephenson"},
"views" : {"inc":3},
}
]'
will result into
{
"id": "1",
"name" : "Solr"
"views" : "5"
"author" : "Neal Stephenson"
}

Resources