I have a SolrCloud running on Solr 5.1.0 which consist of a set of powerful machines for searches and updates.
I have a set of additional and slower machines which are supposed to be only replica. These machines do not receive any direct traffic.
However, the logfiles of the slower machines show a lot of query traffic originating the other nodes.
I want the slow replicas for recovery only, they should not process any searches.
Is there a possibility to configure this behaviour in SolrCloud?
No. Solrcloud does not provide a mechanism by which a node participates in replication, but is not a candidate for per-shard fan-out queries.
That said...
You can do this with master/slave replication.
If all the shards for your collection(s) are on every node, then the preferLocalShards=true parameter might accomplish the same thing. See https://cwiki.apache.org/confluence/display/solr/Distributed+Requests
Related
I see master (searching) and master (replicable) fields on host/solr/#/core_1/replication page and wondering how does it impact me given that I don't have any master/slave replication.
If I don't have any replicas, does it make any impact on performance by enabling or disabling replication?
Also, when i am indexing new documents, I see master (searching) size growing even when I disable the replication from this UI. What does it imply?
Basically nothing. The replication request handler has been set as implicitly defined since Solr (if I remember well) 5, so that's basically the reason why that section is enabled.
The replication handler is doing a kind of versioning of the index (for enabling replication procedures) so
if you are on a standalone instance I don't think it has some performance impact
if you are on master / slave scenarios then it is used for replicating the index
if you are on SolrCloud the replication is used behind the scenes between nodes (e.g. for node recovery)
So in few words: don't worry about that :)
I know there have been many articles written about database replication. Trust me, I spent some time reading those articles including this SO one that explaints the pros and cons of replication. This SO article goes in depth about replication and clustering individually, but doesn't answer these simple questions that I have:
When do you replicate your database, and when do you cluster?
Can both be performed at the same time? If yes, what are the inspirations for each?
Thanks in advance.
MySQL currently supports two different solutions for creating a high availability environment and achieving multi-server scalability.
MySQL Replication
The first form is replication, which MySQL has supported since MySQL version 3.23. Replication in MySQL is currently implemented as an asyncronous master-slave setup that uses a logical log-shipping backend.
A master-slave setup means that one server is designated to act as the master. It is then required to receive all of the write queries. The master then executes and logs the queries, which is then shipped to the slave to execute and hence to keep the same data across all of the replication members.
Replication is asyncronous, which means that the slave server is not guaranteed to have the data when the master performs the change. Normally, replication will be as real-time as possible. However, there is no guarantee about the time required for the change to propagate to the slave.
Replication can be used for many reasons. Some of the more common reasons include scalibility, server failover, and for backup solutions.
Scalibility can be achieved due to the fact that you can now do can do SELECT queries across any of the slaves. Write statements however are not improved generally due to the fact that writes have to occur on each of the replication member.
Failover can be implemented fairly easily using an external monitoring utility that uses a heartbeat or similar mechanism to detect the failure of a master server. MySQL does not currently do automatic failover as the logic is generally very application dependent. Keep in mind that due to the fact that replication is asynchronous that it is possible that not all of the changes done on the master will have propagated to the slave.
MySQL replication works very well even across slower connections, and with connections that aren't continuous. It also is able to be used across different hardware and software platforms. It is possible to use replication with most storage engines including MyISAM and InnoDB.
MySQL Cluster
MySQL Cluster is a shared nothing, distributed, partitioning system that uses synchronous replication in order to maintain high availability and performance.
MySQL Cluster is implemented through a separate storage engine called NDB Cluster. This storage engine will automatically partition data across a number of data nodes. The automatic partitioning of data allows for parallelization of queries that are executed. Both reads and writes can be scaled in this fashion since the writes can be distributed across many nodes.
Internally, MySQL Cluster also uses synchronous replication in order to remove any single point of failure from the system. Since two or more nodes are always guaranteed to have the data fragment, at least one node can fail without any impact on running transactions. Failure detection is automatically handled with the dead node being removed transparent to the application. Upon node restart, it will automatically be re-integrated into the cluster and begin handling requests as soon as possible.
There are a number of limitations that currently exist and have to be kept in mind while deciding if MySQL Cluster is the correct solution for your situation.
Currently all of the data and indexes stored in MySQL Cluster are stored in main memory across the cluster. This does restrict the size of the database based on the systems used in the cluster.
MySQL Cluster is designed to be used on an internal network as latency is very important for response time.
As a result, it is not possible to run a single cluster across a wide geographic distance. In addition, while MySQL Cluster will work over commodity network setups, in order to attain the highest performance possible special clustering interconnects can be used.
In Master-Salve configuration the write operations are performed by Master and Read by slave. So all SQL request first reaches the Master and a queue of request is maintained and the read operation get executed only after completion of write. There is a common problem in Master-Salve configuration which i also witnessed is that when queue becomes too large to be maintatined by master then this achitecture collapse and the slave starts behaving like master.
For clusters i have worked on Cassandra where the request reaches a node(table) and a commit hash is maintained which notices the differences made to a node and updates the other nodes based on that commit hash. So here all operations are not dependent on a single node.
We used Master-Salve when write data is not big in size and count otherwise we use clusters.
Clusters are expensive in space and Master-Salve in time so your desicion of what to choose depends on what you want to save.
We can also use both at the same time, i have done this in my current company.
We moved the tables with most write operations to Cassandra and we have written 4 API to perform the CRUD operation on tables in Cassandra. As whenever an HTTP request comes it first hits our web server and from the code running on our web server we can decide which operation has to be performed (among CRUD) and then we call that particular API to make changes to the cassandra database.
We have a SolrCloud managed by Zookeeper. One concern that we have is with updating the schema or dataConfig on the fly. All changes that we are planning to make is in the indexing server node on the SolrCloud. Once the changes to the schema or dataConfig are made, then we do a full dataimport.
The concern is that the replication of the new indexes on the slave nodes in the cloud would not happen immediately, but only after the replication interval. Also for the different slave nodes the replication will happen at different times, which might cause inconsistent results.
For e.g.
The index replication interval is 5 mins.
Slave node A started at 10:00 => next index replication would be at 10:05.
Slave node B started at 10:03 => next index replication would be at 10:08.
If we make changes to the schema in the indexing server and re-index the results at 10:04, then the results of this change would be available on node A at 10:05, but in node B only at 10:08. Requests made to the SolrCloud between 10:05 and 10:08 would have inconsistent results depending on which slave node the request gets redirected to.
Please let me know if there is any way to make the results more consistent.
#Wish, what you are stating is not the behavior of a SolrCloud.
In SolrCloud indexing are routed to shard leaders and leader sent the copies to all the replicas.
At any point of time, if the ZooKeeper identifies that any of the replica is not in sync with leader, it will brought down to recovering mode. In this mode it will not serve any requests including the query.
P.S: In solr cloud configs are maintained at ZooKeeper and not at the nodes level.
I guess you are little confusing Solr Cloud and Master Slave mode, please confirm which one setup are you in?
I'm new to SolrCloud (and Solr).
I need your help understanding collection shard and replicas.
I have two SolrCLoud instances running on two different server.
I have a collection, mycol, with two shards. Each solrcloud host a shard.
Because I'm running two nodes, I am thinking to add redundancy. I have some questions about it:
First Way:
add a new one core on each SolrCloud, assign it to mycol shard2 on SolrCloud hosting mycol shard1 and assign it to mycol shard1 on SolrCloud hosting mycol shard2. New shards will become replica and on each node I will have the complete collection in the case of hardware failure.
Second way:
add two SOlrcCLoud instances on two more servers. They will become replicas automatically.
Third way:
add two SolrCloud instances, now for each existing server. They will become replicas automatically.
I'm driving me crazy to understand what is the correct way.
Can you help me?
Thank you
Regards
Giova
It's a bit hard to discect what you are looking for based on your question, however the standard practice is to deploy two or more SolrCloud nodes. Make sure they can talk to each other and zookeeper. Once that is set-up, you can configure your collections with numShards and ReplicationFactor parameter. These parameter will determine how many shards are created and how many replicas will be created for each shard.Shards are used to break up the collection into smaller chucks, shards don't provide any redundancy. Shard replicas are exact copies of your shards, this will actually provide redundancy.
Once you fire off this command to any of the replicas in the SolrCloud cluster, your collection will be created. The replicas are created on the second server to provide redundancy if the first one goes down. At this point, you should be able to query any replica and SolrCloud will automatically route the query internally and provide results.
I am wondering how loadbalancer can be set up on top of SolrCloud or a load-balancer is not needed?
If the former, shard leaders need to be added to the loadbalancer? Then what if the shard leader changes for some reason? Or all machines in the cluster (including replica) better be added to the load balancer?
If the latter, I guess a cname needs to point to the SolrCloud cluster and it should be round robin DNS?
Any advice from some actual Solrcloud operation experience would be really appreicated.
Usually SolrCloud is used with combination of ZooKeeper, the client uses CloudSolrServer to access to SolrCloud.
The query will be done in following flow.
Note that I only read the source code of Solr partially and there are lot of guesses. Also what I read was source code of Solr 4.1, so it might be outdated.
ZooKeeper holds the list of IPAddress:Port of all SolrCloud servers.
(Client Side) The instance of CloudSolrServer retrieves the list of servers from ZooKeeper.
(Client Side) The instance of CloudSolrServer chooses one of SolrCloud server randomly and sends query to it. (Also LBHttpSolrServer chooses the server in round-robin?)
(Server Side) The SolrCloud server which recieved the query chooses randomly from replica of shards (one server per shard) from server list and redirects the query to it. (Note that all the SolrCloud server holds the server list which can be recieved from ZooKeeper)
The update will be done in same manner as above but also be populated to all servers.
Note that as for SolrCloud, the leader and replica has small difference and we can send query/update to any of the server. It is automatically redirected to other servers.
In short, the loadbalancing is done in both client side and server side.
So you don't need to worry about it.
A Load Balancer is needed and would be implemented by Zookeeper used in conjunction with SolrCloud.
When you use SolrCloud you must setup sharding and replication through the use of Zookeeper either using the embedded Zookeeper server that comes bundled with SolrCloud or you use a stand-alone Zookeeper ensemble (which is recommended for redundancy).
Then you would use SolrCloudClient to send your queries to Zookeeper which will then forward your query to the correct shard among your cluster. SolrCloudClient will require the name and address of all your Zookeeper instances upon instantiation and your Load-Balancing will be handled as appropriate from there.
Please see the following excllent tutorial:
http://www.francelabs.com/blog/tutorial-solrcloud-amazon-ec2/
Solr Docs:
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
This quote refers to latest version of Solr, at time of writing was ver. 7.1
Solrcloud - Distributed Requests
When a Solr node receives a search request, the request is routed
behind the scenes to a replica of a shard that is part of the
collection being searched.
The chosen replica acts as an aggregator: it creates internal requests
to randomly chosen replicas of every shard in the collection,
coordinates the responses, issues any subsequent internal requests as
needed (for example, to refine facets values, or request additional
stored fields), and constructs the final response for the client.
Solrcloud - Read Side Fault Tolerance
In a SolrCloud cluster each individual node load balances read
requests across all the replicas in collection. You still need a load
balancer on the 'outside' that talks to the cluster, or you need a
smart client which understands how to read and interact with Solr’s
metadata in ZooKeeper and only requests the ZooKeeper ensemble’s
address to start discovering to which nodes it should send requests.
(Solr provides a smart Java SolrJ client called CloudSolrClient.)
I am in a similar situation where I can't rely on CloudSolrServer for loadbalancing, a possible solution that I am evaluating is to use Airbnb's synapse (http://nerds.airbnb.com/smartstack-service-discovery-cloud/) to reconfigure dynamically an existing haproxy loadbalancer based on the status of the SolrCloud cluster that we get from Zookeeper.