When should we apply Hard commit and Soft commit in SOLR? - solr

I want to know when we should do hard commit and when we should do soft commit in SOLR.
Thanks

In the same vein as the question you just asked but deleted, this is explained thoroughly on the internet:
Soft commit when you want something to be made available as soon as possible without waiting for it to be written to disk. Hard commit when you want make sure its being persisted to disk.
From the link above:
Soft commits
Soft commits are about visibility, hard commits are about durability. The thing to understand most about soft commits are that they will make documents visible, but at some cost. In particular the “top level” caches, which include what you configure in solrconfig.xml (filterCache, queryResultCache, etc) will be invalidated! Autowarming will be performed on your top level caches (e.g. filterCache, queryResultCache), and any newSearcher queries will be executed. Also, the FieldValueCache is invalidated, so facet queries will have to wait until the cache is refreshed. With very frequent soft commits it’s often the case that your top-level caches are little used and may, in some cases, be eliminated. However, “segment level caches”, used for function queries, sorting, etc., are “per segment”, so will not be invalidated on soft commit; they can continue to be used.
Hard commits
Hard commits are about durability, soft commits are about visibility. There are really two flavors here, openSearcher=true and openSearcher=false. First we’ll talk about what happens in both cases. If openSearcher=true or openSearcher=false, the following consequences are most important:
The tlog is truncated: A new tlog is started.
Old tlogs will be deleted if there are more than 100 documents in newer, closed tlogs.
The current index segment is closed and flushed.
Background segment merges may be initiated.
The above happens on all hard commits.
That leaves the openSearcher setting
openSearcher=true: The Solr/Lucene searchers are re-opened and all caches are invalidated. Autowarming is done etc. This used to be the only way you could see newly-added documents.
openSearcher=false: Nothing further happens other than the four points above. To search the docs, a soft commit is necessary.

Related

Replicate a database using snapshots and transaction logs

For learning purposes, I want to write my own database, that is able to replicate itself. I have made some progress, but now I am facing a problem that I can not solve. Supposed I have a database (let's call this source) that I would like to replicate to another database (let's call this target).
The basic principle is easy: In the source you don't store actual tables, but instead a log of transactions. It's easy to send over the transaction log to the target, where the database then rebuilds itself. If you want to update the target, you simply request the part of the transaction log that has changed ever since. Basically this is what almost every database does.
While this works, it has one major drawback: If a table already exists for a long time, the transaction log is very long, and hence replicating the table requires lots of time…
To avoid this you can store the current state as well. This means you have an up-to-date snapshot that you can copy fast. Additionally, the target has to subscribe to the transaction log of the source. Once it contains additional entries, the target applies them to its copied table. This works well, too, and it's way better in terms of performance and transferred volume.
But now I am facing a problem: Supposed the snapshot is large, then it may happen that changes are made to it while it is being delivered. That means that the copied snapshot contains some old and some new data. Now, how do I get the target database in a consistent state? Even if I know from where to start the transaction log, I either have to apply a change that was already applied to some of the records, or I have to leave it out, but then a change is not applied at all to some other records.
Of course I could use the isolation level sequential, but then performance drops. Of course I could do what e.g. CouchDB does and remember the current table revision in every record, and keep a copy of every record for every revision. But then the required space grows enormously.
So, what shall I do?
Everything that I was able to find on the web always either relies on the idea of replaying the entire transaction log, or by using a process as in CouchDB which takes up huge amounts of space.
Any ideas?
Your snapshot needs to be consistent and you need to know at what time (in regards to the tx log) it is consistent. You then apply any transactions that have been committed since this point.
Obtaining a consistent snapshot can be done with exclusive locking, which may delay other transactions from committing, or using row versions (MVCC).
Good luck with your project.

Multiple connections updating solr documents

I am designing a web application where multiple connections make changes to a databse and solr 5 using two-phase commit. My questions is, is there a way to isolate the changes each connection make so that changes made by a connection will not be visible to the rest of connections until the changes have been successfully committed to the database and solr?
After numerous searches I read this interesting article it suggests it is not possible. Anyone has done something similar?
Thanks in advance!
It is not possible. Solr's commit is server side to expose all accumulated documents to date.
One alternative approach you could use is to index into separate collections and then make them all available together as a single multi-collection alias. This, of course, implies that no document could theoretically exist in two collections at once (or you will get duplicates).
I faced the same issue. Solr provides 2 ways of commits.
Hard Commit - It commits your documents into disk after given period of time.
Soft Commit - It is real time commits your documents. It is good to use it. If you want to commits your document in real time.
You can config you solrconfig.xml such as
Hard Commit - It will properly commit your documents to disk within interval of 15 seconds.
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
Soft Commit - It will commit your documents in real-time whenever any commit happens.
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

What is the best approach to guarantee commits in Apache SOLR?

Question: How can I get "guarantee commits" with Apache SOLR where persisting data to disk and visibility are both equally important ?
Background: We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and do not want to use another database on the side.
I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ?
Can anyone please suggest a solution for achieving "guaranteed commits" with SOLR ?
As you were told on the mailing list, Solr does not have transactions. If you index from a dozen clients, and a commit happens from somewhere (either autoSoftCommit, commitWithin on the udpate request, or an explicit commit from one of those dozen clients), all of the documents indexed by those dozen clients will be visible to all searchers.
With a transactional database, each of the dozen clients that is sending updates would have to issue a commit, which would only make the changes made by that specific client visible.
Solr usually does not make any guarantees regarding commits. If you issue ten commits in parallel, that will most likely exceed the maxWarmingSearchers configuration, which is typically set to 2. Most of those ten commits wouldn't actually create a new searcher, which is what makes new documents visible.
If you do manual commits in such a way that you are never exceeding maxWarmingSearchers, then when that commit finishes without error, you can take that as a sign that all changes are now visible.
The answer is that Solr is not designed to be the primary data store. Its data structures and indexing/retrieval designed for other use cases, even if it all seems like CRUD on the surface. You should have your data persisted somewhere else and then indexed in Solr - in the way that makes it easy to find - later. Same with Elasticsearch and other search-oriented software.
If you absolutely have to combine those things, look at the commercial products that included Solr on top of Cassandra or other similar databases.
Solr provides two type of commits to persist the data in solr.
Soft Commit: The soft commits persists the into Solr data structure. Solr guarantees visibility of the document after every soft commit. It does not actually stores the data into disk. So if the Solr instance goes down then this information can not be recovered.
Hard Commit: Every time application index the data to solr, it can perform the hard commit of the data. The hard commit persists the data into disk and it recoverable even the instance goes down. The disadvantage of frequent hard commit is, solr has to perform segment merges frequently, which is CPU intensive.
You can configure the autoCommit option in solrconfig.xml according to your needs.
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>1000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
There are merits and demerits of each approach. You can find more information on Apache Wiki Commits and an article from LucidWorks on commits in CloudSolr Understanding Transaction Logs, Soft Commit and Commit in SolrCloud

SOLR autoCommit vs autoSoftCommit

I'm very confused about and . Here is what I understand
autoSoftCommit - after a autoSoftCommit, if the the SOLR server goes down, the autoSoftCommit documents will be lost.
autoCommit - does a hard commit to the disk and make sure all the autoSoftCommit commits are written to disk and commits any other document.
My following configuration seems to be only with with autoSoftCommit. autoCommit on its own does not seems to be doing any commits. Is there something I am missing ?
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<autoSoftCommit>
<maxDocs>1000</maxDocs>
<maxTime>1200000</maxTime>
</autoSoftCommit>
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>120000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
</updateHandler>
why is autoCommit working on it's own ?
I think this article will be useful for you. It explains in detail how hard commit and soft commit work, and the tradeoffs that should be taken in account when tuning your system.
I always shudder at this, because any recommendation will be wrong in some cases. My first recommendation would be to not overthink the problem. Some very smart people have tried to make the entire process robust. Try the simple things first and only tweak things as necessary. In particular, look at the size of your transaction logs and adjust your hard commit intervals to keep these “reasonably sized”. Remember that the penalty is mostly the replay-time involved if you restart after a JVM crash. Is 15 seconds tolerable? Why go smaller then?
We’ve seen situations in which the hard commit interval is much shorter than the soft commit interval, see the bulk indexing bit below.
These are places to start.
HEAVY (BULK) INDEXING
The assumption here is that you’re interested in getting lots of data to the index as quickly as possible for search sometime in the future. I’m thinking original loads of a data source etc.
Set your soft commit interval quite long. As in10 minutes. Soft commit is about visibility, and my assumption here is that bulk indexing isn’t about near real time searching so don’t do the extra work of opening any kind of searcher.
Set your hard commit intervals to 15 seconds, openSearcher=false. Again the assumption is that you’re going to be just blasting data at Solr. The worst case here is that you restart your system and have to replay 15 seconds or so of data from your tlog. If your system is bouncing up and down more often than that, fix the reason for that first.
Only after you’ve tried the simple things should you consider refinements, they’re usually only required in unusual circumstances. But they include:
Turning off the tlog completely for the bulk-load operation
Indexing offline with some kind of map-reduce process
Only having a leader per shard, no replicas for the load, then turning on replicas later and letting them do old-style replication to catch up. Note that this is automatic, if the node discovers it is “too far” out of sync with the leader, it initiates an old-style replication. After it has caught up, it’ll get documents as they’re indexed to the leader and keep its own tlog.
etc.
INDEX-HEAVY, QUERY-LIGHT
By this I mean, say, searching log files. This is the case where you have a lot of data coming at the system pretty much all the time. But the query load is quite light, often to troubleshoot or analyze usage.
Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand.
Set your hard commit to 15 seconds, openSearcher=false
INDEX-LIGHT, QUERY-LIGHT OR HEAVY
This is a relatively static index that sometimes gets a small burst of indexing. Say every 5-10 minutes (or longer) you do an update
Unless NRT functionality is required, I’d omit soft commits in this situation and do hard commits every 5-10 minutes with openSearcher=true. This is a situation in which, if you’re indexing with a single external indexing process, it might make sense to have the client issue the hard commit.
INDEX-HEAVY, QUERY-HEAVY
This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start
Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free.
Set your hard commit interval to 15 seconds.
In my case (index heavy, query heavy), replication master-slave was taking too long time, slowing don the queries to the slave. By increasing the softCommit to 15min and increasing the hardCommit to 1min, the performance improvement was great. Now the replication works with no problems, and the servers can handle much more requests per second.
This is my use case though, I realized I don'r really need the items to be available on the master at real time, since the master is only used for indexing items, and new items are available in the slaves every replication cycle (5min), which is totally ok for my case. you should tune this parameters for your case.
You have openSearcher=false for hard commits. Which means that even though the commit happened, the searcher has not been restarted and cannot see the changes. Try changing that setting and you will not need soft commit.
SoftCommit does reopen the searcher. So if you have both sections, soft commit shows new changes (even if they are not hard-committed) and - as configured - hard commit saves them to disk, but does not change visibility.
This allows to put soft commit to 1 second and have documents show up quickly and have hard commit happen less frequently.
Soft commits are about visibility.
hard commits are about durability.
optimize are about performance.
Soft commits are very fast ,there changes are visible but this changes are not persist (they are only in memory) .So during the crash this changes might be last.
Hard commits changes are persistent to disk.
Optimize is like hard commit but it also merge solr index segments into a single segment for improving performance .But it is very costly.
A commit(hard commit) operation makes index changes visible to new search requests. A hard commit uses the transaction
log to get the id of the latest document changes, and also calls fsync on the index files to ensure they have
been flushed to stable storage and no data loss will result from a power failure.
A soft commit is much faster since it only makes index changes visible and does not fsync index files or write
a new index descriptor. If the JVM crashes or there is a loss of power, changes that occurred after the last hard
commit will be lost. Search collections that have NRT requirements (that want index changes to be quickly
visible to searches) will want to soft commit often but hard commit less frequently. A softCommit may be "less
expensive" in terms of time, but not free, since it can slow throughput.
An optimize is like a hard commit except that it forces all of the index segments to be merged into a single
segment first. Depending on the use, this operation should be performed infrequently (e.g., nightly), if at all, since
it involves reading and re-writing the entire index. Segments are normally merged over time anyway (as
determined by the merge policy), and optimize just forces these merges to occur immediately.
auto commit properties we can manage from sorlconfig.xml files.
<autoCommit>
<maxTime>1000</maxTime>
</autoCommit>
<!-- SoftAutoCommit
Perform a 'soft' commit automatically under certain conditions.
This commit avoids ensuring that data is synched to disk.
maxDocs - Maximum number of documents to add since the last
soft commit before automaticly triggering a new soft commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automaticly
triggering a new soft commit.
-->
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
References:
https://wiki.apache.org/solr/SolrConfigXml
https://lucene.apache.org/solr/guide/6_6/index.html

solr optimize command and concurrency

I know that when Solr performs optimization, either explicitly by the optimize command, or implicitly by Lucene due to the mergeFactor, readers are not blocked. That is, the server is still available for searching
Is it also available for updates? Can other threads in my application send documents updates to solr, and possibly also send commits? Will those updates pass through into the index, or will they be blocked?
An old question though, however, some more info can help here.
optimize command in solr is a call to IndexWriter's forceMerge() method. This method does take a lock on the IndexWriter instance itself. However, the point is that adding documents does not require any lock on the IW instance, neither does it need any commitLock or fullFlushLock.
Moreover, even with forceMerge(), it is the ConcurrentMergeScheduler which picks up the merge process and does it in a different thread altogether.
Usually merge process (Not the forceMerge, which is not recommended anyway) needs to lock the IndexWriter instance only while preparing the merge info, when it needs to know what segments to take for merge, and what is the new merged segment name etc. Once it has this information, merge happens concurrently.
So, yes, you can keep adding documents even when optimize is in process - they will get buffered in RAM until the next commit/optimize or close() of IndexWriter.
Having said that, might as well add that you can not have concurrent commits to different segments - that is Lucene will do only one commit at a time. Adding documents does not flush them to any segment at all - just puts them in buffer.
The answer is "Yes". The server will be respond to search requests, but updated documents will not show up in the search results until you send a commit command. The updated documents will stack up and be committed whenever a client/thread issues a commit command to the server. If you have multiple clients/threads issuing updates and commits they will not block each other, and the updates will show as soon as the commit command completes.

Resources