Distributed search in SOLR - solr

I am using SOLR 1.3.0 for performing a distributed search over already existing lucene indices. The question is, is there any way in which I could find from which shard did a result come up after the search?
P.S : I am using the REST api.

For Solr sharding -
Documents must have a unique key and the unique key must be stored
(stored="true" in schema.xml)
I think the logic should be already there on your side, by which you are feeding the data to the shards, as the ids need to be unique.
e.g. the simplest is the odd even combination, but you may have some complex ones by which you distribute the data into the shards.

You may be able to get some information using debugQuery=on, but if this is something that you'll query often I'd add a specific stored field for the shard name.
PS: Solr doesn't have a REST API.

Related

Is it possible to retrieve Hbase data along with Solr data?

I have the pipeline of Hbase, Lily, Solr and Hue setup for search and visualization. I am able to search on the data indexed in Solr using Hue, except I cannot view all the required data since I do not have all the fields from Hbase stored in Solr. I'm not planning on storing all of the data as well.
So is there a way of retrieving those fields from Hbase along with the Solr response for visualizing the data with Hue?
From what I know, I believe it is possible to setup the Solr searchhandler to perform this, but I haven't been able to find a concrete example to help me understand better(I am very new to both Solr and Hbase, so examples help)
My question is similar to this question. But I am unable to comment there for further information.
Current Solution thanks to suggestion by Romain:
Used HTML widget to provide a link for each record in Hue Search page back to the Hbase record on the Hbase Browser.
One of the approach is, fetch the required id from the solr, and then get the actual data from Hbase. Well solr gives you the count based on your query and also some faceting features. Once those are fetched, and you always have the data in Hbase. Solr is best for index search. So given the speed and space compromise, this design can help. Another main reason is Hbase gives you good fetch times for entire row, when searched based on row key. So, the overall performance depends on your Hbase row key design also.
i think you are using lily Hbase indexer if I am not wrong. so by default the doc id is the hbase row key, which might make things easy

Solr Cloud Document Routing

Currently I have a zookeeper multi solr server, single shard setup. Unique ids are generated automatically by solr.
I now have a zookeeper mult solr server, multi shard requirement. I need to be able to route updates to a specific shard.
After reading http://searchhub.org/2013/06/13/solr-cloud-document-routing/ I am concerned that I cannot allow solr to generate random unique ids if I want to route updates to a specific shard.
Cannot anyone confirm this for me and perhaps give an explanation of the best approach.
Thanks
There is no way you can route your documents to a particular shard since it is being managed by the zookeeper.
Solution to your problem is that you should create two collections instead of two shards. Use your 1st collection with two servers and 2nd collection can use the third server and then you can send your updates to particular servers.The design should look like
collection1---->shard1---->server1,server2
collection2---->shard1----->server3
This way you can separate your indexes as per your requirement.

Document tagging

I have very huge solr index. I want to tag all documents with terms which better represent that document like this. Does this type of clustering results is also come under document tagging?
Which approach is better, Index time Document tagging or Query time document tagging like carrot2 ?
Query time has the obvious drawback that this makes the query more expensive.
However, the clustering results at query time are supposedly better, because at that time, more information has been seen and user feedback can be incorporated.
Note that technically, this is probably more frequent pattern mining than cluster analysis.
Maybe you should just try this variant of frequent pattern mining on your whole data set. You might not even need to store which documents were tagged which way - the solr engine should already be optimized to retrieve them again when needed.
I understood from your question that you want to know how to implement something similar to carrot2 faceting using solr.
IMO you can add a multivalued field tag to your documents (see this Stack Overflow Question for an example) with the cluster names for that doc, and then build facets using that field as explained in Solr wiki here and here.

Is it possible to use Solr to query multiple, lucene and non-lucene indexes

I'm wondering if it's possible to use Solr to query more than one index and combine the results.
The concrete problem is a web site based on various PDFs & DOCs as well as Notes documents. The Notes documents are user-restricted and should not appear in search results unless the user is authorised to view the document.
I think the simple docs could be searched for using Solr and Lucene and the Notes documents using Notes search.
Is there a way to extend Solr to search multiple indexes and merge the results?
Don't think that's possible. Sounds like that logic should be in the application layer. One approach to consider would be to have a field in the schema which will indicate the type of document (like notes) or access level (public, private) then you could exclude them from the search results
q=search+keywords&fq=-DocType:notes

Solr/SolrNet: How can I update a document given a document unique ID?

I need to update few fields of each document in Solr index separately from the main indexing process. According to documentation "Create" and "Update" are mapped onto the "Add()" function. http://code.google.com/p/solrnet/wiki/CRUD
So if I add a document which already exist, will it replace the entire document or just the fields that I have specified?
If it'll replace the entire document then the only way that I can think of in order to update is to search the document by unique id, update the document object and then "Add" it again. This doesn't sound feasible because of the frequency of update ops required. Is there a better way to update?
Thanks!
Unfortunately, Solr does not currently support updating individual fields for a given document in the index. The later scenario you describe of retrieving the entire document contents (either from Solr or the original source) and then resending the document (adding via SolrNet) is the only way to update documents in Solr.
Please see the previous question: Update specific field on Solr index for more details about Solr not supporting individual field updates and an open JIRA issue for adding this support to Solr.
If you need to frequently update a lot of documents in SOLR, you might need to rethink your entire solution. In typical solutions that use SOLR and require lots of frequent updates to documents, the way it is usually done is that the documents reside in some SQL or NoSQL database, and they are modified there. Then you use DIH or something similar to bulk update the SOLR index from the database, possibly just dropping the index and re-indexing all content. SOLR can index documents very quickly so that is typically not a problem.
Partial updating of documents is now supported in the newer versions of Solr, for example 4.10 does pretty well. Please look at the following page for more information:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
The only detail is that you need to declare your fields as stored=true to allow for partial updates.
I also show how to do it in this training:
http://www.pluralsight.com/courses/enterprise-search-using-apache-solr
In this specific module: Content: Schemas, Documents and Indexing

Resources