Solr 4 Adding Shard to existing Cluster - solr

Background: I just finished reading the Apache Solr 4 Cookbook. In it the author mentions that setting up shards needs to be done wisely b/c new ones cannot be added to an existing cluster. However, this was written using Solr 4.0 and at the present I am using 4.1. Is this still the case? I wish I hadn't found this issue and I'm hoping someone can tell me otherwise.
Question: Am I expected to know how much data I'll store in the future when setting up shards in a SolrCloud cluster?
I have played with Solandra and read up on elastic search, but quite honestly I am a fan of Solr as it is (and its large community!). I also like Zookeeper. Am I stuck for now or is there a workaround/patch?
Edit: If Question above is NO, could I build a SolrCloud with a bunch (maybe 100 or more) shards and let them grow (internally) and while I grow my data start peeling them off one by one and put them into larger, faster servers with more resources?

Yes, of course you can. You have to setup a new Solr server pointing to the same zookeeper instance. During the bootstrap the server connects to zk ensemble and registers itself as a cluster member.
Once the registration process is complete, the server is ready to create new cores. You can create replicas of the existing shards using CoreAdmin. Also you can create new shards, but they won't be balanced due to Lucene index format (not all fields are stored), because it may not have all document information to rebalance the cluster, so only new indexed/updated documents will get to this server (doing this is not recommendable).
When you setup your SolrCloud you have to create the cluster taking into account your document number growth factor, so if you have 1M documents at first and it grows as 10k docs/day, setup the cluster with 5 shards, so at start you have to host this shards in your two machines initial setup, but in the future, as needed, you can add new servers to the cluster and move those shards to this new servers. Be careful to not overgrow you cluster because, in Lucene, a single 20Gb index split across 5 shards won't be a 4Gb index in every shard. Every shard will take about (single_index_size/num_shards)*1.1 (due to dictionary compression). This may change depending on your term frequency.
The last chance you have is to add the new servers to the cluster and instead of adding new shards/replicas to the existing server, setup a new different collection using your new shards and reindex in parallel to this new collection. Then, once your reindex process finished, swap this collection and the old one.

One solution to the problem is to use the "implicit router" when creating your Collection.
Lets say - you have to index all "Audit Trail" data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.
You could do something like the below during the initial setup of your collection:
admin/collections?
action=CREATE&
name=AuditTrailIndex&
router.name=implicit&
shards=2010,2011,2012,2013,2014&
router.field=year
The above command:
a) Creates 5 shards - one each for the current and the last 4 years 2010,2011,2012,2013,2014
b) Routes data to the correct shard based on the value of the "year" field (specified as router.field)
In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) - Do something like:
/admin/collections?
action=CREATESHARD&
shard=2015&
collection=AuditTrailIndex
The above command creates a new shard on the same collection.
When its 2015, all data will get automatically indexed into the "2015" shard assuming your data has the "year" field populated correctly to 2015.
In 2015, if you think you don't need the 2010 shard (based on your data retention requirements) - you could always use the DELETESHARD API to do so:
/admin/collections?
action=DELETESHARD&
shard=2015&
collection=AuditTrailIndex
P.S. This solution only works if you used the "implicit router" when creating your collection. Does NOT work when you use the default "compositeId router" - i.e. collections created with the numshards parameter.
This feature is truly a game changer - allows shards to be added dynamically based on growing demands of your business.

Related

Need to migrate the whole cluster from one DC to another DC

I have a SolrCloud cluster consists of 5 hosts in one DC.
The collection configuration is 5 shards and 3 replicas and max 3 shards per host.
Solr version used is 5.3.1.
Because of some unforeseen maintenance activity, it needs to be moved to some other DC temporarily. In order to minimize the impact we need the indexed data to be available with the new setup. All the nodes has roughly 100GB of indexed data.
I already have tried copying the whole setup to the new DC and restarted after after updating the host information in the config files. It always complains some or other shards not available from hosts while querying data. [error code 503]
Note: The back up was taken from a running setup.
I also have tried creating the whole cluster again with the same configuration and copying only the data directory from the back up. It also results in shards not available from the hosts.
I wanted to understand if there is something wrong in the process I am following. One thing I am suspecting is , the back up should be taken after stoping a particular node.
Is there any simple and better way available? I am using Solr-5.3.1.
The right way to do it is using backup and restore feature. This feature was already available in the 5.3 version, check the appropiate doc and follow the steps. Should work just fine.

How to setup Solr Cloud with two search servers?

Hi I'm developing rails project with sunspot solr and configuring Solr Cloud.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why SolrCloud: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up SolrCloud on two servers, but I do not fully understand how it is working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there SolrCloud features other than listed?
Read more about SolrCloud here..! https://wiki.apache.org/solr/SolrCloud
Couple of inputs from my experience.
If your application just reads data from SOLR and does not write to SOLR(in real time but you index using an ETL or so) then you can just go for Master Slave hierarchy.
Define one Master :- Point all writes to here. If this master is down you will no longer be able to index the data
Create 2(or more) Slaves :- This is an feature from SOLR and it will take care of synchronizing data from the master based on the interval we specify(Say every 20 seconds)
Create a load balancer based out of slaves and point your application to read data from load balancer.
Pros:
With above setup, you don't have high availability for Master(Data writes) but you will have high availability for data until the last slave goes down.
Cons:
Assume one slave went down and you bought it back after an hour, this slave will be behind the other slaves by one hour. So its manual task to check for data consistency among other slaves before adding back to ELB.
How about SolrCloud?
No Master here, so you can achieve high availability for Writes too
No need to worry about data inconsistency as I described above, SolrCloud architecture will take care of that.
What Suits Best for you.
Define a external Zookeeper with 3 nodes Quorom
Define at least 2 SOLR severs.
Split your Current index to 2 shards (by default each shard will reside one each in 2 solr nodes defined in step #2
Define replica as 2 (This will create replica for shards in each nodes)
Define an LB to point to above solr nodes.
Point your Solr input as well as application to point to this LB.
By above setup, you can sustain fail over for either nodes.
Let me know if you need more info on this.
Regards,
Aneesh N
-Let us learn together.

Solr 4.1 Solr Cloud Shard Structure

If I plan on holding 7TB of data in a Solr Cloud, is it bad practice to begin with 1 server holding 100 shards and then begin populating the collection where once the size grew, each shard ultimately will be peeled off into its own dedicated server (holding ~70GB ea with its own dedicated resources and replicas)?
That is, I would start the collection with 100 shards locally, then as data grew, I could peel off one shard at a time and give it its own server -- dedicated w/plenty of resources.
Is this okay to do -- or would I somehow incur a massive bottleneck internally by putting that many shards in 1 server to start with while data was low?
This seems valid strategy for Solr 4.1 as Mark Miller one of the main contributor to the project also mentioned this approach check here
However if you can upgrade or planning to upgrade. It is possible to add new shrad dynamically in Solr after 4.3
check here

Solr 3.6 Distribution dynamically adding shards

I am looking at using distribution and sharding with Solr 3.6 vs Solr 4+ (SolrCloud)
I can see that 3.6 can have multiple shards set up, ideally each shard would rest on a different box. On a large scale once the boxes start to run low on memory I would like to add new shards to the index. From what I have seen this cannot be done/ isn't documented.
Does this require a full re index of the data?
Can a 3 shard indexed be re-indexed into a 4 shard instance?
Can queries still be invoked on the index during a re-index?
What are the space overheads required to re index?
The schema.xml (field names and types) would not be changed, just a new shard location added.
Self Answer- From what I have seen it would be best to stop filling the 3 shards and fill only the new 4th shard with data. Then update the shards parameter to include the new shard to search.

Solr appears to block update requests while committing

We're running a master-slave setup with Solr 3.6 using the following auto-commit options:
maxDocs: 500000
maxTime: 600000
We have approx 5 million documents in our index which takes up approx 550GB. We're running both master and slave on Amazon EC2 XLarge instances (4 virtual cores and 15GB). We don't have a particularly high write throughput - about 100 new documents per minute.
We're using Jetty as a container which has 6GB allocated to it.
The problem is that once a commit has started, all our update requests start timing out (we're not performing queries against this box). The commit itself appears to take approx 20-25mins during which time we're unable to add any new documents to Solr.
One of the answers in the following question suggests using 2 cores and swapping them once its fully updated. However this seems a little over the top.
Solr requests time out during index update. Perhaps replication a possible solution?
Is there anything else I should be looking at regarding why Solr seems to be blocking requests? I'm optimistically hoping there's a "dontBlockUpdateRequestsWhenCommitting" flag in the config that I've overlooked...
Many thanks,
According to bounty reason and the problem mentioned at question here is a solution from Solr:
Solr has a capability that is called as SolrCloud beginning with 4.x version of Solr. Instead of previous master/slave architecture there are leaders and replicas. Leaders are responsible for indexing documents and replicas answers queries. System is managed by Zookeeper. If a leader goes down one of its replicas are selected as new leader.
All in all if you want to divide you indexing process that is OK with SolrCloud by automatically because there exists one leader for each shard and they are responsible for indexing for their shard's documents. When you send a query into the system there will be some Solr nodes (of course if there are Solr nodes more than shard count) that is not responsible for indexing however ready to answer the query. When you add more replica, you will get faster query result (but it will cause more inbound network traffic when indexing etc.)
For those who is facing a similar problem, the cause of my problem was i had too many fields in the document, i used automatic fields *_t, and the number of fields grows pretty fast, and when that reach a certain number, it just hogs solr and commit would take forever.
Secondarily, I took some effort to do a profiling, it end up most of the time is consumed by string.intern() function call, it seems the number of fields in the document matters, when that number goes up, the string.intern() seems getting slower.
The solr4 source appears no longer using the string.intern() anymore. But large number of fields still kills the performance quite easily.

Resources