How to configure Solr for improved indexing speed

How to configure Solr for improved indexing speed - solr

I have a client program which generates a 1-50 millions Solr documents and add them to Solr.
I'm using ConcurrentUpdateSolrServer for pushing the documents from the client, 1000 documents per request.
The documents are relatively small (few small text fields).
I want to improve the indexing speed.
I've tried to increase the "ramBufferSizeMB" to 1G and the "mergeFactor" to 25 but didn't see any change.
I was wondering if there is some other recommended settings for improving Solr indexing speed.
Any links to relevant materials will be appreciated.

It looks like you are doing a bulk import of data into Solr, so you don't need to search any data right away.
First, you can increase the number of documents per request. Since your documents are small, I would even increase it to 100K docs per request or more and try.
Second, you want to reduce the number of times commits happen when you are bulk indexing. In your solrconfig.xml look for:

<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
You can disable autoCommit altogether and then call a commit after all your documents are posted. Otherwise you can tweak the numbers as follows:
The default maxTime is 15 secs so an auto commit happens every 15 secs if there are uncommitted docs, so you can set this to something large, say 3 hours (i.e. 3*60*60*1000). You can also add <maxDocs>50000000</maxDocs> which means an auto commit happens only after 50 million documents are added. After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall.
Also after you are done with your bulk import, reduce maxTime and maxDocs, so that any incremental posts you will do to Solr will get committed much sooner. Or use commitWithin as mentioned in solrconfig.

In addition to what was written above, when using SolrCloud, you may want to consider using the CloudSolrClient when using SolrJ. The CloudSolrClient client class is Zookeeper aware and is able to directly connect to the leader shard speeding up the indexing in some cases.

Related

Solr still gives old data while faceting from the deleted documents

Solr gives old data while faceting from old deleted or updated documents.
For example we are doing faceting on name. name changes frequently for our application. When we index the document after changing the name we get both old name and new name in the search results. After digging more on this I got to know that Solr indexes are composed of segments (write once) and each segment contains set of documents. Whenever hard commit happens these segments will be closed and even if a document is deleted after that it will still have those documents (which will be marked as deleted). These documents will not be cleared immediately. It will not be displayed in the search result though, but somehow faceting is still able to access those data.
Optimizing fixed this issue. But we cannot perform this each time customer changes data on production. I tried below options and that did not work for me.
1) expungeDeletes.
Added this line below in solrconfig.xml
<autoCommit>
<maxTime>30000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<!-- softAutoCommit is like autoCommit except it causes a
'soft' commit which only ensures that changes are visible
but does not ensure that data is synced to disk. This is
faster and more near-realtime friendly than a hard commit.
-->
<autoSoftCommit>
<maxTime>10000</maxTime>
</autoSoftCommit>
<commit waitSearcher="false" expungeDeletes="true"/>
2) Using TieredMergePolicyFactory might not help me as the threshold might not reach always and user will see old data during this time.
3) One more way of doing it is calling optimize() method which is exposed in solrj daily once. But not sure what impact this will have on performance.
Number of documents we index per server will be maximum 2M-3M.
Please suggest if there is any solution to this.
Let me know if more data needed.

Frequency of Full reindex on SolrCloud

How often do I need to run full reindex on SolrCloud?
It takes more than 12 hours for full reindex to run and we run it every night but is it really necessary to do it as delta runs correctly.
New data comes in at the rate of 2000 documents on every delta per 30 seconds.
Total index size : 20GB
Solr: 6.5.2

If delta runs correctly, there should be no need to run a reindex at all. The exception might be if you do not have disabled any merging while the index is operative; in that case you might end up with a very fragmented index file wise, and the reindex ends up building a complete set as a single index file instead, but isn't usually how Solr is configured, and if it is - it's done for a reason.
So - if your delta is working correctly and you run Solr with fairly standard settings, you can safely skip reindexing unless you're starting over with an empty index (or have a situation where the schema has changed). But be sure that this also includes deletions - a reindex would probably not include deleted elements, so the question then becomes whether your delta import handles deletions as well.
None of our Solr based services reindex at all - everything is done with live updates and a decent merge factor.

How much memory take apache solr until commit

I'm working on add document on the fly to solr, testing on live both method (soft and hard) takes approximate the same time (around of 5 seconds), so I decided use this configuration:
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>86400000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>300000</maxTime>
</autoSoftCommit>
Basically perform a hard commit when reach 10000 document without commit or have been passed 1 day, and perform softcommit each 5 minutes, I couldn't find any difference in time or cpu between hard and soft, is that right? I have a solr database of 1GiB.
My concern is regarding with the memory needed to do this, how do I do to estimate the memory needed with those 10000 document? or solr doesn't use any memory to hold it waiting the commit?
And how solr count the maxtime, from the first document add or the last? because my test on server appear to be from first but don't have any sense right?

Solr doesn't hold any memory while after soft commit, one major difference between soft commit and hard commit is, soft commit is much faster since it only makes index changes visible and does not fsync index files or write a new index descriptor.
You can also read more about Soft commit and hard commit behavior here,
http://www.opensourceconnections.com/2013/04/25/understanding-solr-soft-commits-and-data-durability/

Solr performance according to query frequency

I send query to solr per 10mins (fig.1), you can see many of "invoke time" are greater than 1000ms (the "len" indicates result length form solr, every window corresponding to a server).
However, if I change query frequency - send query per 10s (fig.2), nearly all "invoke time" reduce to 10ms. Am I miss any config in solrconfig? (solr version - 3.6, all config retain its default value).

Check for the Solr Caching configuration in the solrconfig.xml file.
The Cache usually has a max size. If the max size is exceeded would make way for the New ones by removing the old and least recently used ones.
Probably, when you fire Query every 10s the result is always available in Cache and hence fetched by Solr within 10ms.
However, when you fire a query every 10mins the Cache probably has lost the entry and has to re-fetch it again, provided you have lot of queries or the cache is invalidated within that time frame.
Check for the Cache statistics on the Solr admin page and fine the Cache settings/

solr indexing strategy

We have millions of documents in mongo we are looking to index on solr. Obviously when we do this the first time we need to index all the documents.
But after that, we should only need to index the documents as they change. What is the best way to do this? Should we call addDocument and then in cron call commit()? What does addDocument vs commit vs optimize do (I am using Apache_Solr_Service)

If you're using Solr 3.x you can forget the optimize, which merges all segments into one big segment. The commit makes changes visible to new IndexReaders; it's expensive, I wouldn't call it for each document you add. Instead of calling it through a cron, I'd use the autocommit in solrconfig.xml. You can tune the value depending on how much time you can wait to get new documents while searching.

The document won't actually be added to the index until you do commit() - it could be rolled back. optimize() will (ostensibly; I've not had particularly good luck with it) reduce the size of the index (documents that have been deleted still take up room unless the index is optimized).

If you set autocommit for your database, then you can be sure that any documents added to the database via update, have been committed, when the autocommit interval has passed. I have used a 5-minute interval and it works fine even when a few thousand updates happen within the 5 minutes. After a full reindex is complete, I wait 5 minutes and then tell people that it is done. In fact, when people ask how quickly updates get into the db, I will tell them that we poll for changes every minute, but that there are variables (such as a sudden big batch) and it is best to not expect things to be updated for 5 or 6 minutes. So far, nobody has really claimed a business need to have it update faster than that.
This is with a 350,000 record db totalling roughly 10G in RAM.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight