Setting up solrcloud with data on two machines

Setting up solrcloud with data on two machines - solr

I want to set up solr cloud with data split across 2 machines. For now, I need no replication, load balancing, or fault tolerance. Is there a simple way of achieving this? Most of the tutorials end up talking a lot about external zookeeper dependencies, which I think aren't needed for the barebones configuration I mentioned, and it has been hard to use those to create what I want.

If you do not need any fault tolerance, you can just start two Solr cloud instances and point them to the embedded Zookeeper. You'll need three nodes for Zookeeper to be able to do establish a quorum on failure anyway.
The embedded zookeeper runs on port <solrport + 1000>, and you'd start the nodes with -z host1:port,host2:port. In this case you could just point the latter instance to the first one, since you don't need any fault tolerance.
This is the same configuration as given in Getting Started with Solr Cloud.

Related

Deploy SolrCloud to multiple servers

I am a little bit confused with solrCloud. But how can I deploy SolrCloud on multiple servers? Will it be multiple nodes one per separate server or maybe will it bee one solrCloud node and multiple shards one per server?
And how all of this will communicate with Zookeeper (as far as I understand Zookeeper has to be also deployed on the separate server, is this correct?)
I am a little bit confused with all of this? Can you help me? Or maybe give a link to a good tutorial?

The SolrCloud section of the reference manual should be able to help you out about the concepts of Solr Cloud.
You can run multiple nodes on a single server, or you can run one node on each server. That's really up to you - but all the nodes running in a single server will disappear when that server goes down. The use case for running multiple nodes on a single server is usually for experimenting or for very particular requirements to try to get certain speedups from the single threaded parts of Lucene, so unless you're doing low-level optimization, having one node per server is what you want.
The exception to that rule is for development and experimenting - running multiple nodes on a single machine is fine when the data doesn't matter.
All the nodes make up a single SolrCloud cluster - so you'd be running multiple nodes, not multiple clusters.
Zookeeper should (usually) be deployed on three to five servers - depending on what kind of resiliency you want for failovers. While Solr bundles a Zookeeper instance you can use if you don't want to set up Zookeeper yourself, that is not recommended for production. In a production environment you'd run Zookeeper as a separate process - but that may not mean that you'll be running it on separate servers. Depending on how much traffic and use you'll see for Zookeeper for your nodes, running them on the same server as your cloud nodes will work perfectly fine. The point is to avoid using the bundled version to have full control over Zookeeper and its configuration, and to be able to upgrade/manage the instances outside of Solr.
If the need arises later you can move Zookeeper to its own cluster of servers then (at least three).

Solr Cloud: Distribution of Shards across nodes

I'm currently using Solr Cloud 6.1, the following behavior can also be observed until 7.0.
I'm trying to create a Solr collection with 5 shards and a replication factor of 2. I have 5 physical servers. Normally, this would distribute all 10 replicas evenly among the available servers.
But, when starting Solr Cloud with a -h (hostname) param to give every Solr instance an individual, but constant hostname, this doesn't work any more. The distribution then looks like this:
solr-0:
wikipedia_shard1_replica1 wikipedia_shard2_replica1 wikipedia_shard3_replica2 wikipedia_shard4_replica1 wikipedia_shard4_replica2
solr-1:
solr-2:
wikipedia_shard3_replica1 wikipedia_shard5_replica1 wikipedia_shard5_replica2
solr-3:
wikipedia_shard1_replica2
solr-4:
wikipedia_shard2_replica2
I tried using Rule-based Replica Placement, but the rules seem to be ignored.
I need to use hostnames, because Solr runs in a Kubernetes cluster, where IP adresses change frequently and Solr won't find it's cores after a container restart. I first suspected a newer Solr version to be the cause of this, but I narrowed it down to the hostname problem.
Is there any solution for this?

The solution was actually quite simple (but not really documented):
When creating a Service in OpenShift/Kubernetes, all matching Pods get backed by a load balancer. When all Solr instances get assigned an unique hostname, this hostnames would all resolve to one single IP address (that of the load balancer).
Solr somehow can't deal with that and fails to distribute its shards evenly.
The solution is to use headless services from Kubernetes. Headless services aren't backed by a load balancer and therefore every hostname resolves to an unique IP address.

Why do we need an external zookeeper for Solrcloud?

I want to set up a solr cloud using 3 different machines. All the examples i came across are advising me to download zookeeper as well.
However Solr comes with an inbuilt Zookeeper. Why cannot we use that?

Right now the embedded zookeeper is supposed to be mostly for testing/developing. Reasons being that it is less battle tested in production and that if Solr process goes down, zk goes down too, making the cluster less resilient than a separate process.
That being said, there are some voices saying that setting up a separate zk ensemble is too troublesome, so maybe this changes in the future.

How to setup Solr Cloud with two search servers?

Hi I'm developing rails project with sunspot solr and configuring Solr Cloud.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why SolrCloud: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up SolrCloud on two servers, but I do not fully understand how it is working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there SolrCloud features other than listed?

Read more about SolrCloud here..! https://wiki.apache.org/solr/SolrCloud
Couple of inputs from my experience.
If your application just reads data from SOLR and does not write to SOLR(in real time but you index using an ETL or so) then you can just go for Master Slave hierarchy.
Define one Master :- Point all writes to here. If this master is down you will no longer be able to index the data
Create 2(or more) Slaves :- This is an feature from SOLR and it will take care of synchronizing data from the master based on the interval we specify(Say every 20 seconds)
Create a load balancer based out of slaves and point your application to read data from load balancer.
Pros:
With above setup, you don't have high availability for Master(Data writes) but you will have high availability for data until the last slave goes down.
Cons:
Assume one slave went down and you bought it back after an hour, this slave will be behind the other slaves by one hour. So its manual task to check for data consistency among other slaves before adding back to ELB.
How about SolrCloud?
No Master here, so you can achieve high availability for Writes too
No need to worry about data inconsistency as I described above, SolrCloud architecture will take care of that.
What Suits Best for you.
Define a external Zookeeper with 3 nodes Quorom
Define at least 2 SOLR severs.
Split your Current index to 2 shards (by default each shard will reside one each in 2 solr nodes defined in step #2
Define replica as 2 (This will create replica for shards in each nodes)
Define an LB to point to above solr nodes.
Point your Solr input as well as application to point to this LB.
By above setup, you can sustain fail over for either nodes.
Let me know if you need more info on this.
Regards,
Aneesh N
-Let us learn together.

Solr - Multi Core vs Multiple Instance for Many Database Tables

I have performance concern and want a suggestion that which will be best, Multi Core or Multi Instance(with different port)? Lets have a look on My Case First:
Currently I am running solr with multiple core and its running OK. There is only one issue that sometime it goes "out of heap memory while processing facets fields", then I have to restart the solr. ( To minimize the no. of restarts, I starts the solr with high memory : java -Xms1000M -Xmx8000M -jar start.jar )
I have amazon ec2 instance with 8core-2.8GHtz /15GB Ram with optimized hard disk.
I have many database-tables(about 100) and have to create different schemas for each(leads to create different core).
Each table have millions of documents, with 7-9 indexed fields and 10-50 stored fields in each document.
My web portals should handle very high traffic (currently I m having 10 request/second, may increase to 50-100/second). I know 'solr' can handle that but it is to just inform you that I am concern about every-smallest performance issue also
Searching solr by PHP and CURL in to specific core, so there is no problem in searching in different solr instance also.
Question:
As per as I know Solr handles one request at a time. So I think if I create multiple instance of solr and starts those at different port, then my web portal can handle more request at a time. (if user search in different table).
So, what you will suggest me? Multi Core in Single Solr Instance? or Multiple Instances with Single/Dual Core in each?
Is there any problem in having multiple solr instances running at different ports?
NOTE: Here, I can/may/will combine less-searched-core(s)/small-core(s) in one instance AND heavy-traffic-core(s) in separate instance OR two-three-heavy-traffic-core in one-instance etc. Coz, creating different Instances for each table(~100 here) will take too much hardware resources.

As I didn't got any answer since more then week AND I had also tried many case with solr (and also read some articles), I want to share my experience as answer to my own question. This may/will help to future viewer. I tried on serverfault also with no success.
Solr can handle more request at a time.
I have tested it by running a long query [qTime=7203, approx. 7sec] and several small-queries-after-long-one [qTime=30], solr respond for small-queries first even they ran after long-one.
This point gives much reason in answer: Use single solr instance with multiple core. Just assign High memory to JVM.
Other Points:
1. Each solr instance will require RAM, so running multiple instances will require more resources, which will be expensive. And if you are using facets, sort fields then you need to allocate more RAM to each instance.
As you can see in my case I need to start the solr with high memory(8GB). You can see a case for Danish Web Archive, Which uses multiple instances and allocated 9GB RAM to each and having 256GM total RAM.
2. You can run multiple instances of solr on different PORT by java -Djetty.port=8984 -jar start.jar. Everything running ok BUT I got one problem.
While indexing it may give "not enough memory error" and then solr instance will be killed. So you again need to start second instance with high memory, which will leads to more RAM requirement.
3. Solr Resource Requirement and Performance Problem can be understand here. According to this 64bit environment and 12GB RAM is recommended for good performance. Solr Optimization are explained here.