Solr indexing slowdown

Solr indexing slowdown - solr

I'm working on a product which index high volume of small documents.
When starting Solr it provide indexing rate of 35k/sec for around 20 minutes and then start to slowdown down to 24k/sec after a while.
If I restart the server, the server will index again 35k/sec for 20 minutes and then slow down again.
I have a softCommit every 5 seconds and a hard commit every minute.
I was wondering if someone might have some insight about this?
I don't think it is related to merges since I see merger threads kicking in after 2-3 minutes.

you should check the usual suspects:
There is a problem with your Java (or whatever language you are using) application that you're using to index. If that's the case please specify the implementation details and I will provide more guidelines;
You're NRT cache fills up after 20 minutes and the hard commit doesn't happen quickly enough. In order to check this option, please set the maximum number of documents to index before writing the docs from cache to the disc in the following way: <autoCommit> <maxDocs>10000</maxDocs></autoCommit> in case this is the issue then you can tune up the autocommit or the NRT cache management.

Related

Solr PERFORMANCE WARNING: Overlapping onDeckSearchers

We've been having a number of problems with our solr search engine in our test environments. We have a solr cloud setup on version 4.6, single shard, 4 nodes. We see the CPU flat lines to 100% on the leader node for several hours, then the server starts to throw OutOfMemory errors, 'PERFORMANCE WARNING: Overlapping onDeckSearchers' starts appearing in the logs, the leaders enter recovery mode, the filter cache and query cache warmup times hit around 60 seconds (normally less than 2 secs), the leader node goes down, and we suffer a outage for the whole cluster for a few mins while it recovers and elects a new leader. We think we're hitting a number of solr bugs with the 4.6 and 4.x branch, and so are looking to move to 5.3. We also recently dropped our soft commit time down from 10 mins to 2 mins. I am seeing regular CPU spikes every 2 mins on all nodes, but the spikes are low, from 20-50% (max 100) on a 2 min cycle. When CPU's maxed out obviously I can't see those spikes. Hard commits are every 15 seconds, with opennewsearcher set to false. We have a heavy query and index load type of scenario.
I am wondering whether the frequent soft commits are having a significant effect on this issue, or whether the long auto warm times on the caches are caused by the other issues we are experiencing (cause or symptom)? We recently increased the indexing load on the server, but we need to address these issues in the test environment before we can promote to production.
Cache settings:
<filterCache class="solr.FastLRUCache"
size="5000"
initialSize="5000"
autowarmCount="1000"/>
<queryResultCache class="solr.LRUCache"
size="20000"
initialSize="20000"
autowarmCount="5000"/>

We had this problem with Solr 4.10 (and, very rarely, 5.1). In our case, we were indexing quite frequently and commits were starting to become too close together. Sometimes our optimize command would run a bit longer than expected.
We solved it by making sure no indexing or commits occurred for at least ten minutes after the optimize operation started. We also auto warmed fewer queries for our caches. The following links will probably be useful to you if you haven't found them already:
Overlapping onDeckSearchers--Solr mailing list
The Solr Wiki

Best strategy to index document with Solr

Im using Solr Version 4 (api spring data solr to index,get...documents) and i have to decide which strategy im going to apply for index my documents.
I hesitate between 2 strategies:
Launch a batch periodically to index all documents
Only Index the document when this one has changed
Which strategy is the best ? maybe a mix??or another..
I have some ideas about cons and dis of each but i don't have a big experience with solr.

Depends on how long indexing all your documents takes and how soon you want your index to be updated.
We have several Solr cores - some have less than 100K very small docs and a full import via data import handler (with optimize=true) runs under 1 minute. We can tolerate delays of up to 15 minutes for them, so we run a full import for this core every 15 min.
Then there are cores at the other extreme with several million docs, each of fairly large size, and full indexing will take several hours to complete. For such cores, we have a changelog table in MySQL which only records the docs that changed and we do an incremental indexing only for those docs every few min.
Finally, there are cores that are in the middle, having about 500K docs of decent size, but on these we need atomic updates every 5 to 10 min for certain fields and full document update for certain docs every few min as well. We run delta imports for these. Full index itself takes about 1.5 to 2 hours to run, which we do nightly.
So the answer to your question really depends on what your requirements are.

SOLR autoCommit vs autoSoftCommit

I'm very confused about and . Here is what I understand
autoSoftCommit - after a autoSoftCommit, if the the SOLR server goes down, the autoSoftCommit documents will be lost.
autoCommit - does a hard commit to the disk and make sure all the autoSoftCommit commits are written to disk and commits any other document.
My following configuration seems to be only with with autoSoftCommit. autoCommit on its own does not seems to be doing any commits. Is there something I am missing ?
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<autoSoftCommit>
<maxDocs>1000</maxDocs>
<maxTime>1200000</maxTime>
</autoSoftCommit>
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>120000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
</updateHandler>
why is autoCommit working on it's own ?

I think this article will be useful for you. It explains in detail how hard commit and soft commit work, and the tradeoffs that should be taken in account when tuning your system.
I always shudder at this, because any recommendation will be wrong in some cases. My first recommendation would be to not overthink the problem. Some very smart people have tried to make the entire process robust. Try the simple things first and only tweak things as necessary. In particular, look at the size of your transaction logs and adjust your hard commit intervals to keep these “reasonably sized”. Remember that the penalty is mostly the replay-time involved if you restart after a JVM crash. Is 15 seconds tolerable? Why go smaller then?
We’ve seen situations in which the hard commit interval is much shorter than the soft commit interval, see the bulk indexing bit below.
These are places to start.
HEAVY (BULK) INDEXING
The assumption here is that you’re interested in getting lots of data to the index as quickly as possible for search sometime in the future. I’m thinking original loads of a data source etc.
Set your soft commit interval quite long. As in10 minutes. Soft commit is about visibility, and my assumption here is that bulk indexing isn’t about near real time searching so don’t do the extra work of opening any kind of searcher.
Set your hard commit intervals to 15 seconds, openSearcher=false. Again the assumption is that you’re going to be just blasting data at Solr. The worst case here is that you restart your system and have to replay 15 seconds or so of data from your tlog. If your system is bouncing up and down more often than that, fix the reason for that first.
Only after you’ve tried the simple things should you consider refinements, they’re usually only required in unusual circumstances. But they include:
Turning off the tlog completely for the bulk-load operation
Indexing offline with some kind of map-reduce process
Only having a leader per shard, no replicas for the load, then turning on replicas later and letting them do old-style replication to catch up. Note that this is automatic, if the node discovers it is “too far” out of sync with the leader, it initiates an old-style replication. After it has caught up, it’ll get documents as they’re indexed to the leader and keep its own tlog.
etc.
INDEX-HEAVY, QUERY-LIGHT
By this I mean, say, searching log files. This is the case where you have a lot of data coming at the system pretty much all the time. But the query load is quite light, often to troubleshoot or analyze usage.
Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand.
Set your hard commit to 15 seconds, openSearcher=false
INDEX-LIGHT, QUERY-LIGHT OR HEAVY
This is a relatively static index that sometimes gets a small burst of indexing. Say every 5-10 minutes (or longer) you do an update
Unless NRT functionality is required, I’d omit soft commits in this situation and do hard commits every 5-10 minutes with openSearcher=true. This is a situation in which, if you’re indexing with a single external indexing process, it might make sense to have the client issue the hard commit.
INDEX-HEAVY, QUERY-HEAVY
This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start
Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free.
Set your hard commit interval to 15 seconds.
In my case (index heavy, query heavy), replication master-slave was taking too long time, slowing don the queries to the slave. By increasing the softCommit to 15min and increasing the hardCommit to 1min, the performance improvement was great. Now the replication works with no problems, and the servers can handle much more requests per second.
This is my use case though, I realized I don'r really need the items to be available on the master at real time, since the master is only used for indexing items, and new items are available in the slaves every replication cycle (5min), which is totally ok for my case. you should tune this parameters for your case.

You have openSearcher=false for hard commits. Which means that even though the commit happened, the searcher has not been restarted and cannot see the changes. Try changing that setting and you will not need soft commit.
SoftCommit does reopen the searcher. So if you have both sections, soft commit shows new changes (even if they are not hard-committed) and - as configured - hard commit saves them to disk, but does not change visibility.
This allows to put soft commit to 1 second and have documents show up quickly and have hard commit happen less frequently.

Soft commits are about visibility.
hard commits are about durability.
optimize are about performance.
Soft commits are very fast ,there changes are visible but this changes are not persist (they are only in memory) .So during the crash this changes might be last.
Hard commits changes are persistent to disk.
Optimize is like hard commit but it also merge solr index segments into a single segment for improving performance .But it is very costly.
A commit(hard commit) operation makes index changes visible to new search requests. A hard commit uses the transaction
log to get the id of the latest document changes, and also calls fsync on the index files to ensure they have
been flushed to stable storage and no data loss will result from a power failure.
A soft commit is much faster since it only makes index changes visible and does not fsync index files or write
a new index descriptor. If the JVM crashes or there is a loss of power, changes that occurred after the last hard
commit will be lost. Search collections that have NRT requirements (that want index changes to be quickly
visible to searches) will want to soft commit often but hard commit less frequently. A softCommit may be "less
expensive" in terms of time, but not free, since it can slow throughput.
An optimize is like a hard commit except that it forces all of the index segments to be merged into a single
segment first. Depending on the use, this operation should be performed infrequently (e.g., nightly), if at all, since
it involves reading and re-writing the entire index. Segments are normally merged over time anyway (as
determined by the merge policy), and optimize just forces these merges to occur immediately.
auto commit properties we can manage from sorlconfig.xml files.
<autoCommit>
<maxTime>1000</maxTime>
</autoCommit>

<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
References:
https://wiki.apache.org/solr/SolrConfigXml
https://lucene.apache.org/solr/guide/6_6/index.html

Solr appears to block update requests while committing

We're running a master-slave setup with Solr 3.6 using the following auto-commit options:
maxDocs: 500000
maxTime: 600000
We have approx 5 million documents in our index which takes up approx 550GB. We're running both master and slave on Amazon EC2 XLarge instances (4 virtual cores and 15GB). We don't have a particularly high write throughput - about 100 new documents per minute.
We're using Jetty as a container which has 6GB allocated to it.
The problem is that once a commit has started, all our update requests start timing out (we're not performing queries against this box). The commit itself appears to take approx 20-25mins during which time we're unable to add any new documents to Solr.
One of the answers in the following question suggests using 2 cores and swapping them once its fully updated. However this seems a little over the top.
Solr requests time out during index update. Perhaps replication a possible solution?
Is there anything else I should be looking at regarding why Solr seems to be blocking requests? I'm optimistically hoping there's a "dontBlockUpdateRequestsWhenCommitting" flag in the config that I've overlooked...
Many thanks,

According to bounty reason and the problem mentioned at question here is a solution from Solr:
Solr has a capability that is called as SolrCloud beginning with 4.x version of Solr. Instead of previous master/slave architecture there are leaders and replicas. Leaders are responsible for indexing documents and replicas answers queries. System is managed by Zookeeper. If a leader goes down one of its replicas are selected as new leader.
All in all if you want to divide you indexing process that is OK with SolrCloud by automatically because there exists one leader for each shard and they are responsible for indexing for their shard's documents. When you send a query into the system there will be some Solr nodes (of course if there are Solr nodes more than shard count) that is not responsible for indexing however ready to answer the query. When you add more replica, you will get faster query result (but it will cause more inbound network traffic when indexing etc.)

For those who is facing a similar problem, the cause of my problem was i had too many fields in the document, i used automatic fields *_t, and the number of fields grows pretty fast, and when that reach a certain number, it just hogs solr and commit would take forever.
Secondarily, I took some effort to do a profiling, it end up most of the time is consumed by string.intern() function call, it seems the number of fields in the document matters, when that number goes up, the string.intern() seems getting slower.
The solr4 source appears no longer using the string.intern() anymore. But large number of fields still kills the performance quite easily.

Sunspot with Solr 3.5. Manually updating indexes for real time search

Im working with Rails 3 and Sunspot solr 3.5. My application uses Solr to index user generated content and makes it searchable for other users. The goal is to allow users to search this data as soon as possible from the time the user uploaded it. I don't know if this qualifies as Real time search.
My application has two models
Posts
PostItems
I index posts by including data from post items so that a when a user searches based on certain description provided in a post_item record the corresponding post object is made available in the search.
Users frequently update post_items so every time a new post_item is added I need to reindex the corresponding post object so that the new post_item will be available during search.
So at the moment whenever I receive a new post_item object I run
post_item.post.solr_index! #
which according to this documentation instantly updates the index and commits. This works but is this the right way to handle indexing in this scenario? I read here that calling index while searching may break solr. Also frequent manual index calls are not the way to go.
Any suggestions on the right way to do this. Are there alternatives other than switching to ElasticSearch

try to use this gem https://github.com/bdurand/sunspot_index_queue
you will than be able to batch reindex, let's say, every minute, and it definitely will not brake an index

If you are just starting out and have the luxury to choose between Solr and ElasticSearch, go with ElasticSearch.
We use Solr in production and have run into many weird issues as the index and search volume grew. The conclusion was Solr was built/optimzed for indexing huge documents(word/pdf content) and in large numbers(billions?) but updating the index once a day or a couple of days when nobody is searching.
It was a wrong choice for consumer Rails application where documents are small, small in numbers( in millions) updates are random and continuous and the search needs to be somewhat real time( a delay of 5-10 sec is fine).
Some of the tricks we applied to tune the server.
removed all commits (i.e., !) from rails code,
use Solr auto-commit every 5/20 seconds,
have master/slave configuration,
run index optimization(on Master) every 1 hour
and more.
and we still see high CPU usage on slaves when the commit triggers. As a result some searches take a long time(> 60 seconds at times).
Also I doubt if the batching indexing sunspot_index_queue gem can remedy the high CPU issue.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight