How to decide Solr cloud shard per node? - solr

We have 16 64 GB RAM, 4 core machines. The index size is around 200 GB. Initially we decided to have 64 shards, ie 4 shards per node. We came to 4 shards per nodes because we have 4 core machine (4 core can process 4 shards at a time). When we tested qtime of the query was pretty high. We re-ran performance test on reduced shards. One for 32 total shards(2shards per node) and 16 total shards(1shard per node). The qtime has gone down drastically(by upto 90%) for 16 shards .
So How is shards per nodes decided? Is there a formula based on machine config and index volume?

One other thing you will want to review is the type and volume of queries you are sending to Solr. There is no single magic formula that you can use, my best advice would be to just test a few different alternatives to see which one performs the best.
One thing to keep in mind is the JVM size and index size per server. I think it'd be nice if you could cache the entire index in memory on each box.
Additionally, make sure you are testing query response time with the queries you will actually be running, not just made up things. Things like grouping and faceting will make a huge difference.

Related

Solr performance issues

I'm using Solr to handle search on a very large set of documents, I start having performance issues with complex queries with facets and filters.
This is a solr query used to get some data :
solr full request : http://host/solr/discovery/select?q=&fq=domain%3Acom+OR+host%3Acom+OR+public_suffix%3Acom&fq=crawl_date%3A%5B2000-01-01T00%3A00%3A00Z+TO+2000-12-31T23%3A59%3A59Z%5D&fq=%7B%21tag%3Dcrawl_year%7Dcrawl_year%3A%282000%29&fq=%7B%21tag%3Dpublic_suffix%7Dpublic_suffix%3A%28com%29&start=0&rows=10&sort=score+desc&fl=%2Cscore&hl=true&hl.fragsize=200&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%3C%2Fstrong%3E&hl.snippets=10&hl.fl=content&hl.mergeContiguous=false&hl.maxAnalyzedChars=100000&hl.usePhraseHighlighter=true&facet=true&facet.mincount=1&facet.limit=11&facet.field=%7B%21ex%3Dcrawl_year%7Dcrawl_year&facet.field=%7B%21ex%3Ddomain%7Ddomain&facet.field=%7B%21ex%3Dpublic_suffix%7Dpublic_suffix&facet.field=%7B%21ex%3Dcontent_language%7Dcontent_language&facet.field=%7B%21ex%3Dcontent_type_norm%7Dcontent_type_norm&shards=shard1"
When this query is used localy with about 50000 documents, it takes about 10 seconds, but when I try it on host with 200 million documents it takes about 4 minutes. I know naturaly it's going to take a much longer time in the host, but I wonder if anyone had the same issue and was able to get faster results. Knowing that I'm using two Shards.
Waiting for your responses.
You're doing a number of complicated things at once: Date ranges, highlighting, faceting, and distributed search. (Non-solrcloud, looks like)
Still, 10 seconds for a 50k-doc index seems really slow to me. Try selectively removing aspects of your search to see if you can isolate which part is slowing things down and then focus on that. I'd expect that you can find simpler queries that are fast, even if they match a lot of documents.
Either way, check out https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
There are a lot of useful tips there, but the #1 performance issue is usually not having enough memory, especially for large indexes.
Check for how many segments you have on solr
as more the number of segments worse the query response
If you have not set merge factor in your solrConfig.xml then probably you will have close 40 segments which is to bad for query response time
Set your merge factor accordingly
If no new documents are to be added set it 2
mergeFactor
The mergeFactor roughly determines the number of segments.
The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. It can be thought of as the base of a number system.
For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 1000 (or maxBufferedDocs) documents added to the index. When the 10th segment of size 1000 is added, all 10 will be merged into a single segment of size 10,000. When 10 such segments of size 10,000 have been added, they will be merged into a single segment containing 100,000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each index size.
These values are set in the mainIndex section of solrconfig.xml (disregard the indexDefaults section):
mergeFactor Tradeoffs
High value merge factor (e.g., 25):
Pro: Generally improves indexing speed
Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
Pro: Smaller number of index files, which speeds up searching.
Con: More segment merges slow down indexing.

solr multicore vs sharding vs 1 big collection

I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing.
The data in the collection is an amalgamation of more than 1000+ customer records. The number of documents per each customer is around 100,000 records on average.
Now that being said, I 'm trying to get an handle on the growing deleted document size. Because of the growing index size both the disk space and memory is being used up. And would like to reduce it to a manageable size.
I have been thinking of splitting the data into multiple core, 1 for each customer. This would allow me manage the smaller collection easily and can create/update the collection also fast. My concern is that number of collections might become an issue. Any suggestions on how to address this problem.
Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million
Thanks
I had the similar sort of issue having multiple customer and big indexed data.
I have the implemented it with version 3.4 by creating a separate core for a customer.
i.e One core per customer. Creating core is some sort of creating indexes or splitting the data as like we do in case of sharding...
Here you are splitting the large indexed data in different smaller segments.
Whatever the seach will happen it will carry in the smaller indexed segment.. so the response time would be faster..
I have almost 700 core created as of now and its running fine for me.
As of now I did not face any issue with managing the core...
I would suggest to go with combination of core and sharding...
It will help you in achieve
Allows to have a different configuration for each core with different behavior and that will not have impact on other cores.
you can perform action like update, load etc. on each core differently.

DSE SOLR OOMing

We have had a 3 node DSE SOLR cluster running and recently added a new core. After about a week of running fine, all of the SOLR nodes are now OOMing. The fill up both the JVM Heap (set at 8GB) and the system memory. Then are also constantly flushing the memtables to disk.
The cluster is DSE 3.2.5 with RF=3
here is the solrconfig from the new core:
http://pastie.org/8973780
How big is your Solr index relative to the amount of system memory available for the OS to cache file system pages. Basically, your Solr index needs to fit in the OS file system cache (the amount of system memory available after DSE is started but has not yet processed any significant amount of data.)
Also, how many Solr documents (Cassandra rows) and how many fields (Cassandra columns) are populated on each node? There is no hard limit, but 40 to 100 million is a good guideline as an upper limit - per node.
And, how much system memory and how much JVM heap is available if you restart DSE, but before you start putting load on the server?
For RF=N, where N is the total number of nodes in the cluster or at least the search data center, all of the data will be stored on all nodes, which is okay for smaller datasets, but not okay for larger datasets.
For RF=n, this means that each node will have X/N*n rows or documents, where X is the total number of rows or documents all column families in the data center. X/N*n is the number that you should try to keep below 100 million. That's not a hard limit - some datasets and hardware might be able to handle substantially more, and some datasets and hardware might not even be able to hold that much. You'll have to discover the number that works best for your own app, but the 40 million to 100 million range is a good start.
In short, the safest estimate is for X/N*n to be kept under 40 million for Solr nodes. 100 may be fine for some data sets and beefier hardware.
As far as tuning, one common source of using lots of heap is heavy use of Solr facets and filter queries.
One technique is to use "DocValues" fields for facets since DocValues can be stored off-heap.
Filter queries can be marked as cache=false to save heap memory.
Also, the various Solr caches can be reduced in size or even set to zero. That's in solrconfig.xml.

Indexing speed performances with and without Solrcloud

I've done 2 performances tests to measures the indexing speed with a collection of 235280 documents:
1st test : 1 solr instance without SolrCloud: indexing speed = 6191 doc/s
2nd test : 4 solr instance (4 shards) linked with SolrCloud : indexing speed = 4506 doc/s
I use 8 CPUs.
So, I've some questions about these results :
Q1 : Usually, Does the number of solr instances improve or degrade indexing speed ?
Q2 : Does SolrCloud degrade indexing speed ?
Q3 : Why do I get a decrease of performances when I use SolrCloud ? Do I missed something (setting ?) ?
Edit :
I use a CSV update handler to index my collection.
Based on the performance test that I carried out, sharing across multiple nodes in a Solr cloud infrastructure improved my indexing performance. Replication of shards in multiple nodes to handle fail overs did slow down the indexing performance for obvious reason. Also consider Bulk indexing over doing single updates.
You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.
There are many settings in Solr as well as the hardware specs that can affect indexing performance. Besides the obvious solution to throw more machines at it tuning Solr is more of an art than science. Here is my experience so take it with a grain of salt. Generally you should see 6K to 8K per second indexing performance.
Hardware specs: 4 x 40 cores (hyperthreaded) with 256GB of RAM with SSD
I also use updateCSV API for importing documents.
My baseline matrix is measured with 1 of those machines (1 shard).
My SolrCloud matrix is measured with all 4 of them (4 shards with 1 replica per collection).
For large collection (82GB), I saw 3.68x throughput.
For medium collection (7GB), 2.17x.
For small collection (1.29GB), 1.17x.
So to answer your question:
Q1: Generally the more Solr nodes you have per collection increase indexing speed. It might plateau at some point but certainly indexing performance should not degrade. Maybe your collection is too small to justify the SolrCloud horizontal scaling overhead?
Q2: No, SolrCloud should not degrade indexing speed.
Q3: It really depends on how you set it up. I see performance gain with just default settings. But here are the things I came across that gained performance boost even more:
Don't set commit=true in your updateCSV API call.
You can use more shards per collection than the number of live Solr nodes if system utilization is low.
solr.hdfs.blockcache.slab.count should be between 10 to 20% of available system memory.
autoCommit generally should be 15 seconds.

Can Apache Solr Handle TeraByte Large Data

I am an apache solr user about a year. I used solr for simple search tools but now I want to use solr with 5TB of data. I assume that 5TB data will be 7TB when solr index it according to filter that I use. And then I will add nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data
There is no available 5TB data, I just want to estimate what will be the result.
Note: You can assume that hardware resourses are not a problem.
if your sizes are for text, rather than binary files (whose text would be usually much less), then I don't think you can pretend to do this in a single machine.
This sounds a lot like Logly and they use SolrCloud to handle such amount of data.
ok if all are rich documents then total text size to index will be much smaller (for me its about 7% of my starting size). Anyway, even with that decreased amount, you still have too much data for a single instance I think.

Resources