With master-slave implementation of distributed Solr (prior to Solr 4.x) it was a straight design solution to have master which takes load for indexing, merging and optimizing index. Then the index gets copied to replicas while replicas meanwhile are always serving searches.
Could someone explain how this is done now with SolrCloud?
Seems like SolrCloud sends indexing commands to each replica from leader. But how the search performance could be achieved then? Indexing and searching on each replica makes load on each node server (to index and run merge thread in background) and since my index is quite big it takes a lot of time usually to merge segments or simply optimize.
Should I deliver that all now to merge policy and not worry at all? Does TieredMergePolicy provide both good search performance and low resource load (CPU, I/O) at the same time?
I'll try to answer part of your questions: SolrCloud indeed indexes on all nodes, and therefore it has a performance impact on replicas. This is done due to 'hot replication' model instead of 'cold replication' as you are used to. It comes to solve data integrity issues as well as real time search on a cluster. You get consistent data and faster data availability as a price of performance impact. Actually, you can always split data to shards (at a price of additional hardware), and have comparable performance.
In either case, it's up to you to decide whether SolrCloud suits your needs. You can use Solr 4 without cloud model and manage it yourself as before.
Related
Can't seem to find the answer to this obvious question.
We have 6 servers currently configured as "Search" workload running DSE.
My question is:
Is it possible to run Search (Solr) and Cassandra on the same physical box? (Not) Possible / (Not) Recommended?
I'm very confused with the fact that we currently are running all nodes as Solr nodes and I'm still able to use them as Cassandra (real time queries) - so it's technically both?
The "Services /Best Practice" tells me that:
"Please replace the current search nodes that have vnodes enabled with nodes without vnodes."
Our ideal situation would be:
a. Use all 6 servers as cassandra storage (+ real time queries)
b. Use 1 or 2 of the SAME servers as Solr Search.
The only documentation that I've found that somewhat resemble what we want to is -
http://www.datastax.com/documentation/datastax_enterprise/4.6/datastax_enterprise/deploy/deployWkLdSep.html
but as far as I understand it still says that I need to physically split the load, meaning dedicate 4 servers for cassandra and 2 nodes for solr/search ?
Can anyone explain/suggest anything?
Thank you!
DSE Search - C* and Solr on the Same node:
As Rock Brain mentioned, DSE Search will run Solr and Cassandra on the same node. More specifically, it will run it on the same JVM. This has heap implications. Recommendation is to bump your heap up to 14gb rather than the c* only 8gb.
As RB also mentioned, CPU consumption will be greater with Solr. However, I
often see Search DC's with fewer, beefier, nodes than C* nodes. Again this depends on your workload and how much data you're indexing.
Note: DSE Search Performance Tip
The main rule of thumb for performance is to try to fit all your DSE Indexes in the OS page cache so you may need more RAM than for a Cassandra only node to get optimal performance.
DSE Search and Workload Isolation:
You will find in the DataStax docs, that we recommend for you to run separate data centers for your cassandra workloads and for your search or analytics workloads. This basically prevents Search driven contention from affecting your cassandra ingestions.
The reason behind this recommendation is that many DSE customers have super-tight micro second sla's and very large workloads. You can get away with running search and c* in the same nodes (same DC) if you have looser SLA's and smaller workloads. Your best bet is to POC it with your workload on your hardware and see how it performs.
Can I activate DSE Search on just 2 of my 6 DSE nodes?
Not really, you most likely want to turn on search on your whole DC or not at all. For the following reasons:
the DSESimpleSnitch will automatically split them up into separate DC's so you'd have to use another snitch.
you will get cannot find endpoints errors on your Solr DC's if there aren't enough nodes with the right copies of your data. Remember, Cassandra is still responsible for replication and the Solr core on each node will only index the corresponding data that is on that node.
Turn on search in all 6, but feel free to direct c* queries at all of them and search queries only at 2 if you want. Not sure why you would want to though, you'll clearly see those 2 nodes will be under higher load in OpsCenter.
Remember that you can leverage Search queries right from CQL now as of DSE 4.6.
Vnodes vs. Non Vnodes for DSE Search
For your question on the comment above. Vnodes are not recommended for DSE Search as you will incur a performance hit. Specifically, pre 4.6 it was a large hit, ~300%. But as of 4.6 it's only a 30% performance hit for Search queries. The bigger the num_vnodes the larger the hit.
You can run vnodes on one DC and single tokens on the other DC. DSE will, by default, run single tokens.
Is it possible to run Search (Solr) and Cassandra on the same physical box? (Not) Possible / (Not) Recommended?
Yes, this is how DSE Search works, Cassandra and Solr run in the same process with the full functionality of both available.
Solr uses more CPU than Cassandra, so you will want more Solr nodes than dedicated Cassandra nodes. You will setup separate Cassandra and Solr data centers to divide the work load types.
I am trying to index a large data in solr/lucene. Since It is a legacy system and because of some other reasons, I have to do it via a C++ layer. But before doing that I wanted to optimize the process so I did google for that. I found out following things for that:
Indexing in batches: which will help me in scenario where indexing will fail in between because of some failure. So i can start with remaining batches again.
buffer lookup
indexer concurrency
I found the last 2 terms somewhere while looking for different issues, but I am unable to understand it fully.
So if anyone can help me in understanding these two issues and any other issue which may arise.
I'm not sure what you mean when you're mentioning "Buffer Lookup" - usually this is the case of allowing a server to have a decent in-memory cache, where as many queries as possible can be answered without having to recalculate the intersection between documents and which documents are contained in a certain set for each query. For Solr this is configured using the different *cache-settings. The requirements will be different for most applications, depending on query load, field definitions, etc. Performing a commit (making documents visible in the index) usually expires caches, as the cache might no longer be valid.
Indexer Concurrency allows a server to insert documents into the actual index from many threads at the same time, without locking between the threads. Lucene made concurrent indexing possible back in 2011 (for Lucene 4.0), and allows faster and more efficient updates of the index. Whether this matters depends on your application.
Solr 3x "Repeaters" and Multiple Data Centers:
Solr 3x let a node behave as both a slave and master, pull from one master, and then feed copies downstream to its own slaves. This was so common/useful it even had a name, a "Repeater".
This was useful if you wanted span multiple data centers. You could have the real master in data center A (DCA), and a "repeater" in data center B (DCB). That repeater would then grab content from DCA and feed all of the other nodes in DCB, saving on bandwidth.
Suppose you want to upgrade this setup to Solr 4x and SolrCloud. (Note that Solr 4x still supports Solr 3x-style legacy replication)
It's said that you should NOT have a single SolrCloud cluster span disparate data centers. So data center B should have it's own SolrCloud.
One idea is to have the DCA -> DCB link still use Solr 3x-style Master/Slave replication. And then the "repeater" in DCB, being also a SolrCloud node, would automatically be propagated to other nodes.
Main question:
Can a Solr node participate in both Solr 3x-style master/slave mode (as a slave) and also be part of a SolrCloud cluster? And if so, how is this configured?
Complications:
In the simple case, if it's just 1 shard with replicas, it's easy to see how that might work in terms of data. It's a little less clear if you have multiple shards in DCB, how do I tell each shard to only replicate its own share of data? Note that SolrCloud normally replicates via transactions, whereas 3x uses binary indices.
Another complexity is if you're doing replication. How do you tell just the master node for each shard to pull from the remote DCA node?
Alternatives:
On solution is to upgrade to 4x but continue using 3x-style replication in DCB, so just don't use SolrCloud.
I realize that another solution would be to have the data feed send it's updates to both data centers, or usE something like RabbitMQ. For the sake of this question, let's assume thats not an option (long story...)
Maybe there's some other way I haven't thought of?
Has anybody actually tried having SolrCloud span data centers? How horrible is it?
Somebody must have asked this question before!
But I've looked on Google and, although it finds tons of pages with the keywords, I haven't seen this specific "hybrid" mode fleshed out. I found one thread from 2013 but it didn't really talk about the configuration and complexity.
To answer your first question, a Solr slave in 3.X style cannot be a node in a Solr Cloud. The reason is the slave in a master/slave 3.X Solr config simply replicates, byte for byte, all the index files on the master. That's all it does. It can, in the repeater config, then also be a master for others to replicate from, or be a dedicated query slave or both. But that's it.
A node in a Solr Cloud config is a full participant in a distributed computing cluster where indexing is generally intended to be distributed across all nodes, and all nodes participate in queries. It's a very powerful feature which automatically handles failed nodes and significantly eases the work load of scaling up that was very manual in 3.X style.
However, part of what you pay for that is increased complexity (Zookeeper), requirements for lower latency inter-node communications (because all the nodes now talk to each other and to Zookeeper) and the loss of the simplicity of Master/Slave replication.
At 20M docs you are well within the constraints of a single node master index with an effectively unlimited number of slaves and therefor very high query capacity. I do this today with a production environment where each master has on the order of 60M docs in it with no significant problems.
The question is do you need NRT, multi-node indexing, automated failover, the ability to autoscale well past 100M docs? If so then Master/Slave it probably not going to work for you.
You could take a look at writing the same data to two different Solr Cloud clusters, one in each datacenter. You could do that directly, or use something like Apache Flume to do it for you - in either there are some issues with doing this and so the real question is are dealing with those issues worth it to get the added benefit of Solr Cloud?
My client does not have the budget to setup and maintain a SOLR server to use in their production environment. If I understand the Sitecore 7 Content Search API correctly, it is not a big deal to configure things to use Lucene instead. For the most part the configuration will be similar and the code will be the same, and a SOLR server can be swapped in later.
The site build has
faceted search page
listing components on landing and on other pages that will leverage the Content Search API
buckets with custom facets
The site has around 5,000 pages and components not including media library items. Are there any concerns about simply using Lucene?
The main question is, when, during your architecture or design phase do you know that you should definitely choose SOLR over Lucene? What are the major signs that lead you recommend that?
I think if you are dealing with a customer on a limited budget then Lucene will work perfectly well and perform excellently for the scale of things you are doing. All the things you mention are fully supported by the implementation in Lucene.
In a Sitecore scenario I would begin to consider Solr if:
You need to index a large number of items - id say 50 thousand upwards - Lucene is happy with these sorts of number but Solr has improved query caching and is designed for these large numbers of items.
The resilience of the search tier is of maximum business importance (ie the site is purely driven by search) - Solr provides a more robust replication/sharding and failover system with SolrCloud.
Re-purposing of the search tier in other application is important (non Sitecore) - Solr is a search application so can be accessed over HTTP with XML/JSON etc which makes integration with external systems easier.
You need some specific additional feature of Solr that Lucene doesn't have.
.. but as you say if you want swap out Lucene for Solr at a later phase, we have worked hard to make sure that the process as simple as possible. Worth noting a few points here:
While your LINQ queries will stay the same your configuration will be slightly different and will need attention to port across.
The understanding of how Solr works as an application and how the schema works is important to know but there are some great books and a wealth of knowledge out there.
Solr has slightly different (newer) analyzers and scoring mechanisms so your search results may be slightly different (sometimes customers can get alarmed by this :P)
.. but I think these are things you can build up to over time and assess with the customer. Im sure there are more points here and others can chime in if they think of them. Hope this helps :)
Stephen pretty much covered the question - but I just wanted to add another scenario. You need to take into account the server setup in your production environment. If you are going to be using multiple content delivery servers behind a load balancer I would consider Solr from the start, as trying to make sure that the Lucene index on each delivery server is synchronized 100% of the time can be painful.
I would recommend planning an escape plan from Lucene as early as you start thinking about multiple CDs and here is why:
A) Each server has to maintain its own index copy:
Any unexpected restart might cause a few documents not to be added to the index on the one box, making indexes different from server to server.
That would lead to same page showing differently by CDs
Each server must perform index updates - use CPU & disk space; response rate drops after publish operation is over =/
According to security guide, CDs should have Sitecore Shell UI removed, so index cannot be easily rebuilt from Control Panel =\
B) Lucene is not designed for large volumes of content. Each search operation does roughly following:
Create an array with size equal to total number of documents in the index
If document matches search, set flag in the array
While this works like a charm for low sized indexes (~10K elements), huge performance degradation is produced once the volume of content grows.
The allocated array ends in Large Object Heap that is not compacted by default, thereby gets fragmented fast.
Scenario:
Perform search for 100K documents -> huge array created in memory
Perform one more search in another thread -> one more huge array created
Update index -> now 100K + 10 documents
The first operation was completed; LOH has space for 100K array
Seach triggered again -> 100K+10 array is to be created; freed memory 'hole' is not large enough, so more RAM is requested.
w3wp.exe process keeps on consuming more and more RAM
This is the common case for Analytics Aggregation as an index is being populated by multiple threads at once.
You'll see a lot of RAM used after a while on the processing instance.
C) Last Lucene.NET release was done 5 years ago.
Whereas SOLR is actively being developed.
The sooner you'll make the switch to SOLR, the easier it would be.
I've had this week an issue with a Solr index: http://lucene.472066.n3.nabble.com/corrupted-index-in-slave-td4054769.html,
Today, that error started to happen constantly for almost every request, and I created a JIRA issue becaue I thought it was a bug https://issues.apache.org/jira/browse/SOLR-4707
As you can read, at the end it was due to a fail in the Solr master-slave replication, and now I don't know if we should think about migrating to SolrCloud, since Solr master-slave replications seems not to fit to our requirements:
index size: ~20 million documents, ~9GB
~1200 updates/min
~10000 queries/min (distributed over 2 slaves) MoreLikeThis, RealTimeGet, TermVectorComponent, SearchHandler
I would thank you if anyone could help me to answer these questions:
Would it be advisable to migrate to SolrCloud? Would it have impact on the replication performance?
In that case, what would have better performance? to maintain a copy of the index in every server, or to use shard servers?
How many shards and replicas would you advice for ensuring high availability?
Kind Regards,
Victor
Well, answer to all your questions depends on what exactly you want from solrcloud.
Yes,it would be advisable to move over to solrcloud as it provides High availability,scalability and Near real time search plus automated hot replication.But these features comes at the cost of slightly performance degradation (You want notice even in well configured cluster).
I would suggest you should use shared configuration to allow solr to maintain index data for you (I am sure you will bring smile to TechOps people if you do so). This will reduce human errors and resource requirement as well.
Answer to your last question entirely depends on your cloud deployment.You should try with 2 shard 2 replica configuration and then create test deployment to ensure that it serves your needs.If not, try with different combinations of shard and replica counts until u get what u want(I know its pain !).
At last don't forget to estimate your future growth(How much data you will add to your cluster in next couple of years), and keeping in mind you should decide shards and replicas