I am relatively new to this. So I am trying to understand the relationships among zookeeper, solrcloud, and http requests.
My understanding is:
Zookeeper (accessible through 2181) keeps config files for solrcloud.
and all http requests goes to solrcloud instance directly rather than going through zookeeper.
Therefore, zookeeper, in this particular case, is not used for its ability in routing (API) requests? I do not really think that should be the case. But based on the tutorials from solr official sites. It seems all the requests needs to go through solr's 8983 port.
Solr uses Zookeeper to keep its clusterstate (which servers has which cores / shards / parts of the complete collection) as well as configuration files and anything else that should be available all throughout the cluster.
The request itself is made to Solr, and Solr uses information from Zookeeper in the background to route the request internally to the correct location. A client can be Cloud Aware (such as SolrJ) and can query Zookeeper directly by itself and then contact the correct Solr server instantly, instead of having Solr route the request internally. In SolrJ, this is implemented as CloudSolrClient (or CloudSolrServer as it might be named in older versions of SolrJ) (and not the regular SolrServer, which would contact the Solr instance you're referencing and then route the request from there).
If you look at the documentation of CloudSolrClient, you can see that it takes the Zookeeper information as its argument, and not the Solr Server address. SolrJ makes a ZK request to Zookeeper, retrieves the clusterstate, then makes the HTTP request directly to the servers hosting the shard or collection.
Related
I am using the nutch REST API to run nutch searches on a seperate server. I would like to retrieve the crawled data back to my local machine. Is there a way I can use the nutch dump functionality to dump the data and retrieve it via the API, or am I better off indexing the data into Solr and retrieving it from Solr.
Thanks for your help.
Currently, the REST API doesn't provide such functionality. The main purpose of the REST API is to configure and lunch your crawl jobs. At its core, it will allow you to set the configuration of a new crawl job and manage it (to some extent).
The transfer of the crawled data is up to you. That being said I do have a couple of recommendations:
If you're sending the data into Solr/ES (or any other indexer) I would recommend getting the data directly from there. Both Solr&ES already provide a REST API, with the additional benefit that you might filter which data to "copy over".
If you're running Nutch in a distributed mode (i.e in a Hadoop cluster) try to use the Hadoop libraries to copy the data to the destination.
If none of this applies then perhaps relying on something else like rsync or similar might be worth considering.
In a solr cloud setup, there are 8 solr nodes and 3 zookeeper nodes. There is one load balancer that gets all the indexing and search queries and distributes them to these 8 solr nodes in solr cloud. Before sending the solr query to particular solr node, it first checks if the service endpoint is active. Only if it is active then it sends the request to that particular solr node. Zookeeper handles the elections of leaders in shard. In this setup, zookeeper is not handling the query distribution. Is this set-up bad for distributed queries? What other functionality offered by solrcloud is missed due load balancer doing the work of query distribution.
Please note that, load balancer is necessary because there are different clients (Java, Ruby, JavaScript) accessing the solr service. Only SolrJ has the ability to communicate with zookeeper using CloudSolrServer class). Also, it helps to scale zookeeper nodes without changing any setting from client side.
The SolrJ CloudSolrClient has a couple of advantages:
Node autodiscovery: It always knows what nodes are in the cluster, using the same ZK mechanism that the SolrCloud cluster itself uses.
Query-specific routing: Although any request can go to any node in the SolrCloud cluster, many of these will result in a simple proxy to the actual node that should handle the request
2a: Indexing requests are routed directly to the leader of the shard handling that document's id. For a bulk-insert request, this can mean several sub-requests, farming out batches of documents directly to each appropriate shard.
2b: Queries to a collection are routed to a node that has a shard from that collection.
The CloudSolrClient already knows this stuff and routes directly, avoiding the proxy request within the cluster.
All that said, the internal routing requests are pretty lightweight. You'll add some latency to the requests, increase internal network bandwidth, and add the tiniest bit of CPU usage to the SolrCloud cluster.
So what I'm saying is that if it's too difficult to reproduce these advantages, Solr will handle things, and you'll probably get by just fine without them.
I am implementing Solr Cloud for the first time. I've worked with normal Solr and have that down pretty well, but I'm not finding a lot on what you can and can't do with Solr Cloud. So my question is about Managed Resources. I know you can CRUD stop words and synonyms using the new RESTful api in solr. However with the cloud do I need to CRUD my changes to each individual solr server in the cloud, or do I send them to a different url that sends them through to each server? I'm new to cloud and zookeeper. I have not found anything in the solr wiki about working with the managed resources in the cloud setup. Any advice would be helpful.
In SolrCloud configuration and other files like stopwords, are stored and maintained by Zookeeper. Which means you do not need to individually send updates to each server.
Once you have SolrCloud, before putting in any data, you will create a collection. Each collection has its own set of resources/config folder.
So for example if u have a collection called techproducts with 2 servers localhost1 and localhost2 the below command from any of the servers will work on the same resource.
curl "http://localhost1:8983/solr/techproducts/schema/analysis/synonyms/english"
curl "http://localhost2:8983/solr/techproducts/schema/analysis/synonyms/english"
I am making a search query on local Apache Solr Server by browser and see the results.
I want to make Same Query on the production server.
Since tomcat port is blocked on production, I cannot test the query results on the browser.
Is there any method to make query and see the results?
Solr is a java web application: if you can't access the port it's listening to, you can't access Solr itself. There's no other way to retrieve data from a remote location. Usually on production Solr is put behind an apache proxy, so that it protects the whole Solr and makes accessible only the needed contexts, in your case solr/select for example to make queries.
We're running a multi-tenant website (multiple hosts, different configs for each host, but one application) where every customer on every request could get routed to client-specific data bases and solr instances. So, depending on which url is mapped to the application, different connection strings will be provided for each request. This works well for normal databases where IConnectionProvider would provide a different connection string on each request depending on the hostname. We're using SolrNet for our text indexing and will have multiple instances running for the different hosts. Presently the SolrNet facility for Castle Windsor gets registered once with a solrUrl at configuration time. We want to be able to resolve an instance of SolrNet on every request with a different solrUrl depending on the tenant/host configuration. Is this possible?
Use the multi-core / multi-instance support in the SolrNet Windsor facility, then use a IHandlerSelector to select the appropriate ISolrOperations<T> depending on tenant/host config.