How much memory take apache solr until commit - solr

I'm working on add document on the fly to solr, testing on live both method (soft and hard) takes approximate the same time (around of 5 seconds), so I decided use this configuration:
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>86400000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>300000</maxTime>
</autoSoftCommit>
Basically perform a hard commit when reach 10000 document without commit or have been passed 1 day, and perform softcommit each 5 minutes, I couldn't find any difference in time or cpu between hard and soft, is that right? I have a solr database of 1GiB.
My concern is regarding with the memory needed to do this, how do I do to estimate the memory needed with those 10000 document? or solr doesn't use any memory to hold it waiting the commit?
And how solr count the maxtime, from the first document add or the last? because my test on server appear to be from first but don't have any sense right?

Solr doesn't hold any memory while after soft commit, one major difference between soft commit and hard commit is, soft commit is much faster since it only makes index changes visible and does not fsync index files or write a new index descriptor.
You can also read more about Soft commit and hard commit behavior here,
http://www.opensourceconnections.com/2013/04/25/understanding-solr-soft-commits-and-data-durability/

Related

Frequency of Full reindex on SolrCloud

How often do I need to run full reindex on SolrCloud?
It takes more than 12 hours for full reindex to run and we run it every night but is it really necessary to do it as delta runs correctly.
New data comes in at the rate of 2000 documents on every delta per 30 seconds.
Total index size : 20GB
Solr: 6.5.2
If delta runs correctly, there should be no need to run a reindex at all. The exception might be if you do not have disabled any merging while the index is operative; in that case you might end up with a very fragmented index file wise, and the reindex ends up building a complete set as a single index file instead, but isn't usually how Solr is configured, and if it is - it's done for a reason.
So - if your delta is working correctly and you run Solr with fairly standard settings, you can safely skip reindexing unless you're starting over with an empty index (or have a situation where the schema has changed). But be sure that this also includes deletions - a reindex would probably not include deleted elements, so the question then becomes whether your delta import handles deletions as well.
None of our Solr based services reindex at all - everything is done with live updates and a decent merge factor.

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

Booking system search using solr

I am working on a booking system where in booking is done based on dates selected as start time till end time and I maintain inventory table for availability with a slot datetime column that gives me the value of availability of inventory for that time slot.
We currently do a range query in sql but would like to have this inventory data indexed in solr for faster range query searches.
The problem I see is that whenever a booking is done the inventory data has to be updated and won't this constant update affect solrs performance?
This is what Solr's Near Real Time Search has been built for. You will need to fine tune this, but you should not face performance impediments, if used right.
Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed: additions and updates to documents are seen in 'near' real time. Solr does not block updates while a commit is in progress. Nor does it wait for background merges to complete before opening a new search of indexes and returning.
With NRT, you can modify a commit command to be a soft commit, which avoids parts of a standard commit that can be costly. You will still want to do standard commits to ensure that documents are in stable storage, but soft commits let you see a very near real time view of the index in the meantime.
When searching for further resources about this topic on the web, you should stumble over these articles
Hard commits, soft commits and transaction logs by Erick Erickson
Benchmarking the new Solr ‘Near Realtime’ Improvements by Mark Miller
and of course the Solr reference

How to configure Solr for improved indexing speed

I have a client program which generates a 1-50 millions Solr documents and add them to Solr.
I'm using ConcurrentUpdateSolrServer for pushing the documents from the client, 1000 documents per request.
The documents are relatively small (few small text fields).
I want to improve the indexing speed.
I've tried to increase the "ramBufferSizeMB" to 1G and the "mergeFactor" to 25 but didn't see any change.
I was wondering if there is some other recommended settings for improving Solr indexing speed.
Any links to relevant materials will be appreciated.
It looks like you are doing a bulk import of data into Solr, so you don't need to search any data right away.
First, you can increase the number of documents per request. Since your documents are small, I would even increase it to 100K docs per request or more and try.
Second, you want to reduce the number of times commits happen when you are bulk indexing. In your solrconfig.xml look for:
<!-- AutoCommit
Perform a hard commit automatically under certain conditions.
Instead of enabling autoCommit, consider using "commitWithin"
when adding documents.
http://wiki.apache.org/solr/UpdateXmlMessages
maxDocs - Maximum number of documents to add since the last
commit before automatically triggering a new commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automatically
triggering a new commit.
openSearcher - if false, the commit causes recent index changes
to be flushed to stable storage, but does not cause a new
searcher to be opened to make those changes visible.
-->
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
You can disable autoCommit altogether and then call a commit after all your documents are posted. Otherwise you can tweak the numbers as follows:
The default maxTime is 15 secs so an auto commit happens every 15 secs if there are uncommitted docs, so you can set this to something large, say 3 hours (i.e. 3*60*60*1000). You can also add <maxDocs>50000000</maxDocs> which means an auto commit happens only after 50 million documents are added. After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall.
Also after you are done with your bulk import, reduce maxTime and maxDocs, so that any incremental posts you will do to Solr will get committed much sooner. Or use commitWithin as mentioned in solrconfig.
In addition to what was written above, when using SolrCloud, you may want to consider using the CloudSolrClient when using SolrJ. The CloudSolrClient client class is Zookeeper aware and is able to directly connect to the leader shard speeding up the indexing in some cases.

solr indexing strategy

We have millions of documents in mongo we are looking to index on solr. Obviously when we do this the first time we need to index all the documents.
But after that, we should only need to index the documents as they change. What is the best way to do this? Should we call addDocument and then in cron call commit()? What does addDocument vs commit vs optimize do (I am using Apache_Solr_Service)
If you're using Solr 3.x you can forget the optimize, which merges all segments into one big segment. The commit makes changes visible to new IndexReaders; it's expensive, I wouldn't call it for each document you add. Instead of calling it through a cron, I'd use the autocommit in solrconfig.xml. You can tune the value depending on how much time you can wait to get new documents while searching.
The document won't actually be added to the index until you do commit() - it could be rolled back. optimize() will (ostensibly; I've not had particularly good luck with it) reduce the size of the index (documents that have been deleted still take up room unless the index is optimized).
If you set autocommit for your database, then you can be sure that any documents added to the database via update, have been committed, when the autocommit interval has passed. I have used a 5-minute interval and it works fine even when a few thousand updates happen within the 5 minutes. After a full reindex is complete, I wait 5 minutes and then tell people that it is done. In fact, when people ask how quickly updates get into the db, I will tell them that we poll for changes every minute, but that there are variables (such as a sudden big batch) and it is best to not expect things to be updated for 5 or 6 minutes. So far, nobody has really claimed a business need to have it update faster than that.
This is with a 350,000 record db totalling roughly 10G in RAM.

Resources