multi-thread indexing on solr

multi-thread indexing on solr - solr

I deployed an instance of Solr onto a ubuntu machine with tomcat. Then i have a single thread client program to read and inject data into Solr. I am observing memory and cpu usages, and realized that I still have a lot of resources (in terms of memory and CPUs) to use. I wonder if I should change my indexing code to multi-threading to inject into Solr? To index 20 millions of data using current single thread program, it needs about 14 hours. This is why i wonder if i should change to use multi-threading as well. Thanks in advance for your suggestions and help! :)

Multi-threading while indexing in Solr is widely used.
What you say is not very clear if you can also multi-thread the reading from your source, but I think that is the way to go.
I suggest you try it, but first try to analize your code and see which part of the code is the slowest and include that in the multi-threading.
Also keep an eye on your commit strategy.
From the Solr documentation: (http://wiki.apache.org/solr/SolrPerformanceFactors)
"In general, adding many documents per update request is faster than one per update request. ...
Reducing the frequency of automatic commits or disabling them entirely may speed indexing. Beware that this can lead to increased memory usage, which can cause performance issues of its own, such as excessive swapping or garbage collection."

Related

Does it make sense to run several SolrCloud nodes on one single server?

I can use only one server to run my application and my Solr server. I was wondering if performance and availability-wise it makes sense to deploy several nodes of SolrCloud and zookeeper on this machine (e.g. using VMs or docker). Since I will be vulnerable to hardware failure, my main concerns are protection against software failure and performance.
Thus, does adding a few nodes (3 maybe?) will help to have a Solr server with higher availability or better performance? Or will it have the opposite effect?

Using multiple JVMs on one piece of hardware isn't generally going to help much.
As you've mentioned, using many JVMs on one machine doesn't reduce your vulnerability to hardware failure, and it adds a bunch of cognitive complexity because now you have to remember that just because you have three replicas, it doesn't mean two can fail unless you're extra careful where you put each of the three.
For most purposes, just using additional shards in a single JVM/Solr instance is simpler, and accomplishes the same performance goal of keeping your index size per core down to manageable levels. This is a central feature of SolrCloud.
The only exception to this I'm aware of is if you're dealing with an index or usage pattern that requires a very large JVM heap. A very large JVM heap can lead to high max GC pause times, and GC tuning can only help so much. In this case, using multiple JVMs, with a single replica/shard per JVM, can constrain the worst-case GC pause to that required for a single replica.
You also mention Zookeeper, so it's worth noting that ZK is a somewhat different beast. You should probably host ZK separately, you should always use an odd number of ZK nodes, and never more than one per physical host.

Write once read many in memory key value store

I have a particular use case for multiple in memory key value maps that need very fast lookup time. They are set just set once a day so can be considered immutable for all practical purposes. Redis is not an option since it gets CPU throttled in case of multiple threads accessing it. Multi instance redis takes up too much memory because of data replication. The important thing to consider here is that the read rate is very high in bursts. Around 10 million requests in bursts from around 40-50 workers simultaneously.
I was thinking of creating a simple client server architecture with multiple readers connecting to a server to read from shared memory maps. However I wonder if such an architecture already exists and has been tested profusely for this use case in which case I should not be reinventing the wheel.
So to sum up what is my best alternative? TIA.

Might not be suitable for you but you could try RBLDNSD and store your values in DNS. It's high performance and results will be cached, and it's easy to read the values from pretty much any programming environment. To write values to it you'll need to write directly to its zone files, but the format is simple and easy to write.

You don't mention the size of your maps, but given that performance is so critical, it sounds like you may want to consider keeping copies of your 'multiple in memory key value maps' with each worker.
You could then implement a simple mechanism to notify each worker that it's time to refresh their maps (e.g. Redis PUBLISH, or any other pubsub type framework).

At the risk of running afoul of the stackoverlow self-promotion police :-) eXtremeDB might be a consideration. It's not schema-less, but your schema can simply define a key-value pair. It supports MVCC (optimistic, non-blocking) concurrency so even the relatively infrequent writes won't get in the way of readers, and you'll be able to utilize all the CPU cores.

possible issues in indexing of large documents in solr/lucene

I am trying to index a large data in solr/lucene. Since It is a legacy system and because of some other reasons, I have to do it via a C++ layer. But before doing that I wanted to optimize the process so I did google for that. I found out following things for that:
Indexing in batches: which will help me in scenario where indexing will fail in between because of some failure. So i can start with remaining batches again.
buffer lookup
indexer concurrency
I found the last 2 terms somewhere while looking for different issues, but I am unable to understand it fully.
So if anyone can help me in understanding these two issues and any other issue which may arise.

I'm not sure what you mean when you're mentioning "Buffer Lookup" - usually this is the case of allowing a server to have a decent in-memory cache, where as many queries as possible can be answered without having to recalculate the intersection between documents and which documents are contained in a certain set for each query. For Solr this is configured using the different *cache-settings. The requirements will be different for most applications, depending on query load, field definitions, etc. Performing a commit (making documents visible in the index) usually expires caches, as the cache might no longer be valid.
Indexer Concurrency allows a server to insert documents into the actual index from many threads at the same time, without locking between the threads. Lucene made concurrent indexing possible back in 2011 (for Lucene 4.0), and allows faster and more efficient updates of the index. Whether this matters depends on your application.

Solr performance tweaking for faster query / indexing, rather than the storage requirements

We are looking for ways to improve query/index performance on solr. We are not concerned about the storage requirements. We have lot of storage space.
Essentially we want to speed up solr query/indexing by throwing more storage at the solr index.
I have already reviewed http://wiki.apache.org/solr/SolrPerformanceFactors. But it doesn't cover this particular scenario.
p.s. you can tell me that this is a stupid question, and i won't mind :)

Indexing side:
You could potentially increase the speed of indexing by using very high mergeFactors. Thus there will be very many Lucene index segments that will merge at slow speed (basically, merging is what takes quite a lot of time). Then by the time you are done with indexing, set the mergeFactor back to something sensible, like 10. This will make sure you have only few segments to read through.
Also consider SSDs, some folks have reported better perf with these.
Querying side:
It is nearly impossible to recommend anything without knowing your particalur setup and use case. But monitor the cache usage. If you have high utilization rate of a particular cache, yet evictions, get rid of them by giving more RAM.

Optimizing Solr 4 on EC2 debian instance(s)

My Solr 4 instance is slow and I don't know why.
I am attempting to modify the configurations of JVM, Tomcat6 and Solr 4 in order
to optimize performance, with queries per second as the key metric.
Currently I am running on an EC2 small tier with Debian squeeze, but ready to switch to Ubuntu if needed.
There is nothing special about my use case. The index is small. Queries do include a moderate number of unions (e.g. 10), plus faceting, but I don't think that's unusual.
My understanding is that these areas could need tweaking:
Configuring the JVM Garbage collection schedule and memory allocation ("GC tuning is a precise art form", ref)
Other JVM settings
Solr's Query Result cache, Filter cache, Document cache settings
Solr's Auto-warming settings
There are a number of ways to monitor the performance of Solr:
SolrMeter
Sematext SPM
New Relic
But none of these methods indicate which settings need to be adjusted, and there's no guide that I know of that steps through an exhaustive list of settings that could possibly improve performance. I've reviewed the following pages (one, two, three, four), and gone through some rounds of trial and error so far without improvement.
Questions:
How to tell JVM to use all the 2 GB memory on the small EC2 instance?
How to debug and optimize JVM Garbage Collection?
How do I know when I/O throttling, such as the new EBS IOPS pricing, is the issue?
Using figures like the NewRelic examples below, how to detect what is problematic behavior, and how to approach solutions.
Answers:
I'm looking for link to good documentation for setting up and optimizing Solr 4, from a DevOps or server admin perspective (not index or application design).
I'm looking for the top trouble spots in catalina.sh, solrconfig.xml, solr.xml (other?) that are most likely causes of problems.
Or any tips you think address the questions.

First, you should not focus on switching your linux distribution. A different distribution might bring some changes but considering the information you gave, nothing prove that these changes may be significant.
You are mentionning lots of possibilities for your optimisations, this can be overwhelming. You should consider an tweaking area only once you have proven that the problem lies in that particular part of your stack.
JVM Heap Sizing
You can use the parameter -mx1700m to give a maximum of 1.7GB of RAM to the JVM. Hotspot might not need it, so don't be surprised if your heap capacity does not reach that number.
You should set the minimum heap size to a low value, so that Hotspot can optimise its memory usage. For instance, to set a minimal heap size at 128MB, use -mx128m.
Garbage Collector
From what you say, you have limited hardware (1-core at 1.2GHz max, see this page)
M1 Small Instance
1.7 GiB memory
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit)
...
One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2
GHz 2007 Opteron or 2007 Xeon processor
Therefore, using that low-latency GC (CMS) won't do any good. It won't be able to run concurrently with your application since you have only one core. You should switch to the Throughput GC using -XX:+UseParallelGC -XX:+UseParallelOldGC.
Is the GC really a problem ?
To answer that question, you need to turn on GC logging. It is the only way to see whether GC pauses are responsible for your application response time. You should turn these on with -Xloggc:gc.log -XX:+PrintGCDetails.
But I don't think the problem lies here.
Is it a hardware problem ?
To answer this question, you need to monitor resource utilization (disk I/O, network I/O, memory usage, CPU usage). You have a lot of tools to do that, including top, free, vmstat, iostat, mpstat, ifstat, ...
If you find that some of these resources are saturating, then you need a bigger EC2 instance.
Is it a software problem ?
In your stats, the document cache hit rate and the filter cache hit rate are healthy. However, I think the query result cache hit rate is pretty low. This implies a lot of queries operations.
You should monitor the query execution time. Depending on that value you may want to increase the cache size or tune the queries so that they take less time.
More links
JVM options reference : http://jvm-options.tech.xebia.fr/
A feedback that I did on some application performance audit : http://www.pingtimeout.fr/2013/03/petclinic-performance-tuning-about.html
Hope that helps !

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight