Solr/Lucene documentation says the following:
1) High mergefactor leads to better indexing performance as writing the index to disk is minimized and the merging of segments too happens less frequently, but leads to lower querying speed as the number of segments are high and searching them takes time.
2) Low merge factor leads to poor indexing performance but faster query for the same reasons as above.
I have also learnt that the merging happens in parallel in the background and is not part of the indexing request.
Questions:
1) When I have a low mergefactor what is causing low indexing performance ? Having to write the index to the disk more often or the merging ? Writing to disk is understandable bottleneck. But if frequent merging too is a reason which is happening in the background then it should slow down the querying too as the querying thread too would be in contention of the CPU along with the merging threads.
2) Is the querying blocked when the segment merge happens ?
1)Frequent merging (low merge-factor) causes low indexing performance. But a low-merge factor is likely to improve search performance because there are less segments to search in
2)No
Related
In the open source version, Scylla recommends keeping up to 50% of disk space free for “compactions”. At the same time, the documentation states that each table is compacted independently of each other. Logically, this suggests that in a applications with dozens (or even multiple) tables there’s only a small chance that so many compaction will coincide.
Is there a mathematical model of calculating how multiple compaction might overlap in an application with several tables? Based on a cursory analysis, it seems that the likelihood of multiple overlapping compaction is small, especially when we are dealing with dozens of independent tables.
You're absolutely right:
With the size-tiered compaction strategy a compaction may temporarily double the disk requirements. But it doesn't double the entire disk requirements but only of the sstables involved in this compaction (see also my blog post on size-tiered compaction and its space amplification). There is indeed a difference between "the entire disk usage" and just "the sstables involved in this compaction" for two reasons:
As you noted in your question, if you have 10 tables of similar size, compacting just one of them will work on just 10% of the data, so the temporary disk usage during compaction might be 10% of the disk usage, not 100%.
Additionally, Scylla is sharded, meaning that different CPUs handle their sstables, and compactions, completely independently. If you have 8 CPUs on your machines, each CPU only handles 1/8th of the data, so when it does compaction, the maximum temporary overhead will be 1/8th of the table's size - not the full table size.
The second reason cannot be counted on - since shards choose when to compact independently, if you're unlucky all shards may decide to compact the same table at exactly the same time, and worse - may happen to do the biggest compactions all at the same time. This "unluckiness" can also happen at 100% probability if you start a "major compaction" (nodetool compact).
The first reason, the one which you asked about, is indeed more useful and reliable: Beyond it being unlikely that all shards will choose to compact all sstables are exactly the same time, there is an important detail in Scylla's compaction algorithm which helps here: Each shard only does one compaction of a (roughly) given size at a time. So if you have many roughly-equal-sized tables, no shard can be doing full compaction of more than one of those tables at a time. This is guaranteed - it's not a matter of probability.
Of course, this "trick" only helps if you really have many roughly-equal-sized tables. If one table is much bigger than the rest, or tables have very different sizes, it won't help you too much to control the maximum temporary disk use.
In issue https://github.com/scylladb/scylla/issues/2871 I proposed a idea of how Scylla can guarantee that when disk space is low, the sharding (point 1) is also used to reduce temporary disk space usage. We haven't implemented this idea, but instead implemented a better idea - "incremental compaction strategy", which does huge compactions in pieces ("incrementally") to avoid most of the temporary disk usage. See this blog post for how this new compaction strategy works, and graphs demonstrating how it lowers the temporary disk usage. Note that Incremental Compaction Strategy is currently part of the Scylla Enterprise version (it's not in the open-source version).
For a large cassandra partition read latencies are usally huge.
But does write latency get impacted in this case? Since cassandra is columnar database and holds immutable data, shouldn't the write (which appends data at the end of the row) take less time?
In all the experiments I have conducted with Cassandra, I have noticed that write throughput is not affected by data size while read performance takes a big hit if your SSTables are too big, concurrent_reads threads are low ( check using nodetool tpstats if ReadStage is going into pending state) and increase them in cassandra.yaml file. Using LeveledCompaction seems to help as data for same key remains in same SSTable. Make sure your data is distributed evenly across all nodes. Cassandra optimization is tricky and you may have to implement "hacks" to obtain desired performance in minimum possible hardware.
If my index is say 80% fragmented and is used in joins can the overall performance be worse than if that index didn't exist? And if so, why?
Your question is too vague to answer consistently, or even to know what you're actually after, but consider this:
A fragmented index means you'll have a lot of actual disk activity compared to the amount of disk activity you'd need for a certain query.
Take a look at DBCC SHOWCONTIG
Among other useful information, it shows you a figure for Scan Density. A very low "hit rate" on this can imply that you're doing heaps more IO than you'd need to with a properly maintained index. This could even exceed the amount of IO you'd need to perform a table scan, but it all depends on the size of your objects and your data access pattern.
One area where a poorly maintained (= highly fragmented) index will hurt you double, is that it hurts performance in inserts, updates AND selects.
With this in mind, it's a pretty common practice for ETL processes to drop indexes before and recreating them after processing large batches of information. In the mean time, they'd only hurt write performance and be too far fragmented to help lookups.
Besides that: it's easy to do index maintenance. I'd recommend deploying Ola Hallengren's index maintenance solution and no longer worry about it.
I've done 2 performances tests to measures the indexing speed with a collection of 235280 documents:
1st test : 1 solr instance without SolrCloud: indexing speed = 6191 doc/s
2nd test : 4 solr instance (4 shards) linked with SolrCloud : indexing speed = 4506 doc/s
I use 8 CPUs.
So, I've some questions about these results :
Q1 : Usually, Does the number of solr instances improve or degrade indexing speed ?
Q2 : Does SolrCloud degrade indexing speed ?
Q3 : Why do I get a decrease of performances when I use SolrCloud ? Do I missed something (setting ?) ?
Edit :
I use a CSV update handler to index my collection.
Based on the performance test that I carried out, sharing across multiple nodes in a Solr cloud infrastructure improved my indexing performance. Replication of shards in multiple nodes to handle fail overs did slow down the indexing performance for obvious reason. Also consider Bulk indexing over doing single updates.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
There are many settings in Solr as well as the hardware specs that can affect indexing performance. Besides the obvious solution to throw more machines at it tuning Solr is more of an art than science. Here is my experience so take it with a grain of salt. Generally you should see 6K to 8K per second indexing performance.
Hardware specs: 4 x 40 cores (hyperthreaded) with 256GB of RAM with SSD
I also use updateCSV API for importing documents.
My baseline matrix is measured with 1 of those machines (1 shard).
My SolrCloud matrix is measured with all 4 of them (4 shards with 1 replica per collection).
For large collection (82GB), I saw 3.68x throughput.
For medium collection (7GB), 2.17x.
For small collection (1.29GB), 1.17x.
So to answer your question:
Q1: Generally the more Solr nodes you have per collection increase indexing speed. It might plateau at some point but certainly indexing performance should not degrade. Maybe your collection is too small to justify the SolrCloud horizontal scaling overhead?
Q2: No, SolrCloud should not degrade indexing speed.
Q3: It really depends on how you set it up. I see performance gain with just default settings. But here are the things I came across that gained performance boost even more:
Don't set commit=true in your updateCSV API call.
You can use more shards per collection than the number of live Solr nodes if system utilization is low.
solr.hdfs.blockcache.slab.count should be between 10 to 20% of available system memory.
autoCommit generally should be 15 seconds.
I have relatively small index about 1 mln documents for high load site. I'm running relatively complex function queries against it and performance is not acceptable. So I'm hesitation about moving current master+slaves topology to SolrCloud with at least 3 shards and n replicas so all function queries will be distributed across shards and response time should be at least 3 times smaller plus small footprint on merging result sets(Is it true?).
So my question is it worth sharding(and adding complexity) to solve performance problems not index size problems(most common reason to shard your index).