Solr sharding for complex query performance optimization - solr

I have relatively small index about 1 mln documents for high load site. I'm running relatively complex function queries against it and performance is not acceptable. So I'm hesitation about moving current master+slaves topology to SolrCloud with at least 3 shards and n replicas so all function queries will be distributed across shards and response time should be at least 3 times smaller plus small footprint on merging result sets(Is it true?).
So my question is it worth sharding(and adding complexity) to solve performance problems not index size problems(most common reason to shard your index).

Related

Solr performance issues

I'm using Solr to handle search on a very large set of documents, I start having performance issues with complex queries with facets and filters.
This is a solr query used to get some data :
solr full request : http://host/solr/discovery/select?q=&fq=domain%3Acom+OR+host%3Acom+OR+public_suffix%3Acom&fq=crawl_date%3A%5B2000-01-01T00%3A00%3A00Z+TO+2000-12-31T23%3A59%3A59Z%5D&fq=%7B%21tag%3Dcrawl_year%7Dcrawl_year%3A%282000%29&fq=%7B%21tag%3Dpublic_suffix%7Dpublic_suffix%3A%28com%29&start=0&rows=10&sort=score+desc&fl=%2Cscore&hl=true&hl.fragsize=200&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%3C%2Fstrong%3E&hl.snippets=10&hl.fl=content&hl.mergeContiguous=false&hl.maxAnalyzedChars=100000&hl.usePhraseHighlighter=true&facet=true&facet.mincount=1&facet.limit=11&facet.field=%7B%21ex%3Dcrawl_year%7Dcrawl_year&facet.field=%7B%21ex%3Ddomain%7Ddomain&facet.field=%7B%21ex%3Dpublic_suffix%7Dpublic_suffix&facet.field=%7B%21ex%3Dcontent_language%7Dcontent_language&facet.field=%7B%21ex%3Dcontent_type_norm%7Dcontent_type_norm&shards=shard1"
When this query is used localy with about 50000 documents, it takes about 10 seconds, but when I try it on host with 200 million documents it takes about 4 minutes. I know naturaly it's going to take a much longer time in the host, but I wonder if anyone had the same issue and was able to get faster results. Knowing that I'm using two Shards.
Waiting for your responses.
You're doing a number of complicated things at once: Date ranges, highlighting, faceting, and distributed search. (Non-solrcloud, looks like)
Still, 10 seconds for a 50k-doc index seems really slow to me. Try selectively removing aspects of your search to see if you can isolate which part is slowing things down and then focus on that. I'd expect that you can find simpler queries that are fast, even if they match a lot of documents.
Either way, check out https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
There are a lot of useful tips there, but the #1 performance issue is usually not having enough memory, especially for large indexes.
Check for how many segments you have on solr
as more the number of segments worse the query response
If you have not set merge factor in your solrConfig.xml then probably you will have close 40 segments which is to bad for query response time
Set your merge factor accordingly
If no new documents are to be added set it 2
mergeFactor
The mergeFactor roughly determines the number of segments.
The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. It can be thought of as the base of a number system.
For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 1000 (or maxBufferedDocs) documents added to the index. When the 10th segment of size 1000 is added, all 10 will be merged into a single segment of size 10,000. When 10 such segments of size 10,000 have been added, they will be merged into a single segment containing 100,000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each index size.
These values are set in the mainIndex section of solrconfig.xml (disregard the indexDefaults section):
mergeFactor Tradeoffs
High value merge factor (e.g., 25):
Pro: Generally improves indexing speed
Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
Pro: Smaller number of index files, which speeds up searching.
Con: More segment merges slow down indexing.

How to decide Solr cloud shard per node?

We have 16 64 GB RAM, 4 core machines. The index size is around 200 GB. Initially we decided to have 64 shards, ie 4 shards per node. We came to 4 shards per nodes because we have 4 core machine (4 core can process 4 shards at a time). When we tested qtime of the query was pretty high. We re-ran performance test on reduced shards. One for 32 total shards(2shards per node) and 16 total shards(1shard per node). The qtime has gone down drastically(by upto 90%) for 16 shards .
So How is shards per nodes decided? Is there a formula based on machine config and index volume?
One other thing you will want to review is the type and volume of queries you are sending to Solr. There is no single magic formula that you can use, my best advice would be to just test a few different alternatives to see which one performs the best.
One thing to keep in mind is the JVM size and index size per server. I think it'd be nice if you could cache the entire index in memory on each box.
Additionally, make sure you are testing query response time with the queries you will actually be running, not just made up things. Things like grouping and faceting will make a huge difference.

How to increase solr query speed when the query is formed with thousands terms?

I have a Solr index, which host 4 millions document and whose size is 65 Gb. When I browse my index using the web UI everything is fast. But my real queries, which are made of about 2000 Term (all coming from the same field), are way too slow.
To increase the speed of my Solr queries I first copied the index into my RAM which makes things much faster but still I need to increase the speed.
I also have created a multi-threaded version of my query, using Java7 RecursiveTask, where I basically divide the number of query terms by 2 until the number of query terms pass below a threshold. Then I aggregate the results of the sub-queries to build a final response. It makes things faster but it creates other kind of problems.
Here is the code I use for the multiple terms query
MultiPhraseQuery query = new MultiPhraseQuery();
query.add(queryTerms); // where queryTerms is an array of Term
TopDocs tops = searcher.search(query, rows);
ScoreDoc[] scoreDoc = tops.scoreDocs;
Does anyone has some nice suggestions to improve the speed performance ?
Thank you
I believe that 2,000 terms are too much for a single index. You may have to refactor your design.
Now, a possibility to scale is by using SolrCloud with many replicates in order to improve the query response time of your index.
Also, do not forget the stored="false" option on the field definition (which might make the index size much smaller)

Indexing speed performances with and without Solrcloud

I've done 2 performances tests to measures the indexing speed with a collection of 235280 documents:
1st test : 1 solr instance without SolrCloud: indexing speed = 6191 doc/s
2nd test : 4 solr instance (4 shards) linked with SolrCloud : indexing speed = 4506 doc/s
I use 8 CPUs.
So, I've some questions about these results :
Q1 : Usually, Does the number of solr instances improve or degrade indexing speed ?
Q2 : Does SolrCloud degrade indexing speed ?
Q3 : Why do I get a decrease of performances when I use SolrCloud ? Do I missed something (setting ?) ?
Edit :
I use a CSV update handler to index my collection.
Based on the performance test that I carried out, sharing across multiple nodes in a Solr cloud infrastructure improved my indexing performance. Replication of shards in multiple nodes to handle fail overs did slow down the indexing performance for obvious reason. Also consider Bulk indexing over doing single updates.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
There are many settings in Solr as well as the hardware specs that can affect indexing performance. Besides the obvious solution to throw more machines at it tuning Solr is more of an art than science. Here is my experience so take it with a grain of salt. Generally you should see 6K to 8K per second indexing performance.
Hardware specs: 4 x 40 cores (hyperthreaded) with 256GB of RAM with SSD
I also use updateCSV API for importing documents.
My baseline matrix is measured with 1 of those machines (1 shard).
My SolrCloud matrix is measured with all 4 of them (4 shards with 1 replica per collection).
For large collection (82GB), I saw 3.68x throughput.
For medium collection (7GB), 2.17x.
For small collection (1.29GB), 1.17x.
So to answer your question:
Q1: Generally the more Solr nodes you have per collection increase indexing speed. It might plateau at some point but certainly indexing performance should not degrade. Maybe your collection is too small to justify the SolrCloud horizontal scaling overhead?
Q2: No, SolrCloud should not degrade indexing speed.
Q3: It really depends on how you set it up. I see performance gain with just default settings. But here are the things I came across that gained performance boost even more:
Don't set commit=true in your updateCSV API call.
You can use more shards per collection than the number of live Solr nodes if system utilization is low.
solr.hdfs.blockcache.slab.count should be between 10 to 20% of available system memory.
autoCommit generally should be 15 seconds.

Confused about the affect of mergefactor on searching and indexing

Solr/Lucene documentation says the following:
1) High mergefactor leads to better indexing performance as writing the index to disk is minimized and the merging of segments too happens less frequently, but leads to lower querying speed as the number of segments are high and searching them takes time.
2) Low merge factor leads to poor indexing performance but faster query for the same reasons as above.
I have also learnt that the merging happens in parallel in the background and is not part of the indexing request.
Questions:
1) When I have a low mergefactor what is causing low indexing performance ? Having to write the index to the disk more often or the merging ? Writing to disk is understandable bottleneck. But if frequent merging too is a reason which is happening in the background then it should slow down the querying too as the querying thread too would be in contention of the CPU along with the merging threads.
2) Is the querying blocked when the segment merge happens ?
1)Frequent merging (low merge-factor) causes low indexing performance. But a low-merge factor is likely to improve search performance because there are less segments to search in
2)No

Resources