SOLR autoCommit vs autoSoftCommit - solr

I'm very confused about and . Here is what I understand
autoSoftCommit - after a autoSoftCommit, if the the SOLR server goes down, the autoSoftCommit documents will be lost.
autoCommit - does a hard commit to the disk and make sure all the autoSoftCommit commits are written to disk and commits any other document.
My following configuration seems to be only with with autoSoftCommit. autoCommit on its own does not seems to be doing any commits. Is there something I am missing ?
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<autoSoftCommit>
<maxDocs>1000</maxDocs>
<maxTime>1200000</maxTime>
</autoSoftCommit>
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>120000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
</updateHandler>
why is autoCommit working on it's own ?

I think this article will be useful for you. It explains in detail how hard commit and soft commit work, and the tradeoffs that should be taken in account when tuning your system.
I always shudder at this, because any recommendation will be wrong in some cases. My first recommendation would be to not overthink the problem. Some very smart people have tried to make the entire process robust. Try the simple things first and only tweak things as necessary. In particular, look at the size of your transaction logs and adjust your hard commit intervals to keep these “reasonably sized”. Remember that the penalty is mostly the replay-time involved if you restart after a JVM crash. Is 15 seconds tolerable? Why go smaller then?
We’ve seen situations in which the hard commit interval is much shorter than the soft commit interval, see the bulk indexing bit below.
These are places to start.
HEAVY (BULK) INDEXING
The assumption here is that you’re interested in getting lots of data to the index as quickly as possible for search sometime in the future. I’m thinking original loads of a data source etc.
Set your soft commit interval quite long. As in10 minutes. Soft commit is about visibility, and my assumption here is that bulk indexing isn’t about near real time searching so don’t do the extra work of opening any kind of searcher.
Set your hard commit intervals to 15 seconds, openSearcher=false. Again the assumption is that you’re going to be just blasting data at Solr. The worst case here is that you restart your system and have to replay 15 seconds or so of data from your tlog. If your system is bouncing up and down more often than that, fix the reason for that first.
Only after you’ve tried the simple things should you consider refinements, they’re usually only required in unusual circumstances. But they include:
Turning off the tlog completely for the bulk-load operation
Indexing offline with some kind of map-reduce process
Only having a leader per shard, no replicas for the load, then turning on replicas later and letting them do old-style replication to catch up. Note that this is automatic, if the node discovers it is “too far” out of sync with the leader, it initiates an old-style replication. After it has caught up, it’ll get documents as they’re indexed to the leader and keep its own tlog.
etc.
INDEX-HEAVY, QUERY-LIGHT
By this I mean, say, searching log files. This is the case where you have a lot of data coming at the system pretty much all the time. But the query load is quite light, often to troubleshoot or analyze usage.
Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand.
Set your hard commit to 15 seconds, openSearcher=false
INDEX-LIGHT, QUERY-LIGHT OR HEAVY
This is a relatively static index that sometimes gets a small burst of indexing. Say every 5-10 minutes (or longer) you do an update
Unless NRT functionality is required, I’d omit soft commits in this situation and do hard commits every 5-10 minutes with openSearcher=true. This is a situation in which, if you’re indexing with a single external indexing process, it might make sense to have the client issue the hard commit.
INDEX-HEAVY, QUERY-HEAVY
This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start
Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free.
Set your hard commit interval to 15 seconds.
In my case (index heavy, query heavy), replication master-slave was taking too long time, slowing don the queries to the slave. By increasing the softCommit to 15min and increasing the hardCommit to 1min, the performance improvement was great. Now the replication works with no problems, and the servers can handle much more requests per second.
This is my use case though, I realized I don'r really need the items to be available on the master at real time, since the master is only used for indexing items, and new items are available in the slaves every replication cycle (5min), which is totally ok for my case. you should tune this parameters for your case.

You have openSearcher=false for hard commits. Which means that even though the commit happened, the searcher has not been restarted and cannot see the changes. Try changing that setting and you will not need soft commit.
SoftCommit does reopen the searcher. So if you have both sections, soft commit shows new changes (even if they are not hard-committed) and - as configured - hard commit saves them to disk, but does not change visibility.
This allows to put soft commit to 1 second and have documents show up quickly and have hard commit happen less frequently.

Soft commits are about visibility.
hard commits are about durability.
optimize are about performance.
Soft commits are very fast ,there changes are visible but this changes are not persist (they are only in memory) .So during the crash this changes might be last.
Hard commits changes are persistent to disk.
Optimize is like hard commit but it also merge solr index segments into a single segment for improving performance .But it is very costly.
A commit(hard commit) operation makes index changes visible to new search requests. A hard commit uses the transaction
log to get the id of the latest document changes, and also calls fsync on the index files to ensure they have
been flushed to stable storage and no data loss will result from a power failure.
A soft commit is much faster since it only makes index changes visible and does not fsync index files or write
a new index descriptor. If the JVM crashes or there is a loss of power, changes that occurred after the last hard
commit will be lost. Search collections that have NRT requirements (that want index changes to be quickly
visible to searches) will want to soft commit often but hard commit less frequently. A softCommit may be "less
expensive" in terms of time, but not free, since it can slow throughput.
An optimize is like a hard commit except that it forces all of the index segments to be merged into a single
segment first. Depending on the use, this operation should be performed infrequently (e.g., nightly), if at all, since
it involves reading and re-writing the entire index. Segments are normally merged over time anyway (as
determined by the merge policy), and optimize just forces these merges to occur immediately.
auto commit properties we can manage from sorlconfig.xml files.
<autoCommit>
<maxTime>1000</maxTime>
</autoCommit>
<!-- SoftAutoCommit
Perform a 'soft' commit automatically under certain conditions.
This commit avoids ensuring that data is synched to disk.
maxDocs - Maximum number of documents to add since the last
soft commit before automaticly triggering a new soft commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automaticly
triggering a new soft commit.
-->
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
References:
https://wiki.apache.org/solr/SolrConfigXml
https://lucene.apache.org/solr/guide/6_6/index.html

Related

When should we apply Hard commit and Soft commit in SOLR?

I want to know when we should do hard commit and when we should do soft commit in SOLR.
Thanks
In the same vein as the question you just asked but deleted, this is explained thoroughly on the internet:
Soft commit when you want something to be made available as soon as possible without waiting for it to be written to disk. Hard commit when you want make sure its being persisted to disk.
From the link above:
Soft commits
Soft commits are about visibility, hard commits are about durability. The thing to understand most about soft commits are that they will make documents visible, but at some cost. In particular the “top level” caches, which include what you configure in solrconfig.xml (filterCache, queryResultCache, etc) will be invalidated! Autowarming will be performed on your top level caches (e.g. filterCache, queryResultCache), and any newSearcher queries will be executed. Also, the FieldValueCache is invalidated, so facet queries will have to wait until the cache is refreshed. With very frequent soft commits it’s often the case that your top-level caches are little used and may, in some cases, be eliminated. However, “segment level caches”, used for function queries, sorting, etc., are “per segment”, so will not be invalidated on soft commit; they can continue to be used.
Hard commits
Hard commits are about durability, soft commits are about visibility. There are really two flavors here, openSearcher=true and openSearcher=false. First we’ll talk about what happens in both cases. If openSearcher=true or openSearcher=false, the following consequences are most important:
The tlog is truncated: A new tlog is started.
Old tlogs will be deleted if there are more than 100 documents in newer, closed tlogs.
The current index segment is closed and flushed.
Background segment merges may be initiated.
The above happens on all hard commits.
That leaves the openSearcher setting
openSearcher=true: The Solr/Lucene searchers are re-opened and all caches are invalidated. Autowarming is done etc. This used to be the only way you could see newly-added documents.
openSearcher=false: Nothing further happens other than the four points above. To search the docs, a soft commit is necessary.

Solr indexing slowdown

I'm working on a product which index high volume of small documents.
When starting Solr it provide indexing rate of 35k/sec for around 20 minutes and then start to slowdown down to 24k/sec after a while.
If I restart the server, the server will index again 35k/sec for 20 minutes and then slow down again.
I have a softCommit every 5 seconds and a hard commit every minute.
I was wondering if someone might have some insight about this?
I don't think it is related to merges since I see merger threads kicking in after 2-3 minutes.
you should check the usual suspects:
There is a problem with your Java (or whatever language you are using) application that you're using to index. If that's the case please specify the implementation details and I will provide more guidelines;
You're NRT cache fills up after 20 minutes and the hard commit doesn't happen quickly enough. In order to check this option, please set the maximum number of documents to index before writing the docs from cache to the disc in the following way: <autoCommit> <maxDocs>10000</maxDocs></autoCommit> in case this is the issue then you can tune up the autocommit or the NRT cache management.

What is the best approach to guarantee commits in Apache SOLR?

Question: How can I get "guarantee commits" with Apache SOLR where persisting data to disk and visibility are both equally important ?
Background: We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and do not want to use another database on the side.
I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ?
Can anyone please suggest a solution for achieving "guaranteed commits" with SOLR ?
As you were told on the mailing list, Solr does not have transactions. If you index from a dozen clients, and a commit happens from somewhere (either autoSoftCommit, commitWithin on the udpate request, or an explicit commit from one of those dozen clients), all of the documents indexed by those dozen clients will be visible to all searchers.
With a transactional database, each of the dozen clients that is sending updates would have to issue a commit, which would only make the changes made by that specific client visible.
Solr usually does not make any guarantees regarding commits. If you issue ten commits in parallel, that will most likely exceed the maxWarmingSearchers configuration, which is typically set to 2. Most of those ten commits wouldn't actually create a new searcher, which is what makes new documents visible.
If you do manual commits in such a way that you are never exceeding maxWarmingSearchers, then when that commit finishes without error, you can take that as a sign that all changes are now visible.
The answer is that Solr is not designed to be the primary data store. Its data structures and indexing/retrieval designed for other use cases, even if it all seems like CRUD on the surface. You should have your data persisted somewhere else and then indexed in Solr - in the way that makes it easy to find - later. Same with Elasticsearch and other search-oriented software.
If you absolutely have to combine those things, look at the commercial products that included Solr on top of Cassandra or other similar databases.
Solr provides two type of commits to persist the data in solr.
Soft Commit: The soft commits persists the into Solr data structure. Solr guarantees visibility of the document after every soft commit. It does not actually stores the data into disk. So if the Solr instance goes down then this information can not be recovered.
Hard Commit: Every time application index the data to solr, it can perform the hard commit of the data. The hard commit persists the data into disk and it recoverable even the instance goes down. The disadvantage of frequent hard commit is, solr has to perform segment merges frequently, which is CPU intensive.
You can configure the autoCommit option in solrconfig.xml according to your needs.
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>1000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
There are merits and demerits of each approach. You can find more information on Apache Wiki Commits and an article from LucidWorks on commits in CloudSolr Understanding Transaction Logs, Soft Commit and Commit in SolrCloud

How much RealTime is Elasticsearch, Solr and DSE realtime search?

From some last couple of weeks, I have been working around Elasticsearch and Solr, and trying to do OLTP processing in real time. However, what comes to me is they claims(especially ES) to be real time. The meaning of real time looks a lot fuzzy to me.
If we go deep into it, both ES and Solr, defines a refresh rate or a soft-commit rate, after which the newly indexed documents would be available for search, effectively providing only Near-Real time capabilities.
It looks like by Real time search, it is either a marketing statement to call it real time, or they make the word fuzzy by talking about Real Time Search rather than batch or analytical processing.
Am I correct, or correct me if I am wrong, and there is a real-time search possible in a typical OLTP system, where every transaction has search visibility to last document ?
Elasticsearch is a Near Real Time search engine for search. Elasticsearch is Real Time for operations like Create, Update, Delete and Get.
By default, refresh is 1 second. In some use cases, it could appear as real time. For example, I was working for a french gov service and we were producing statistics per day. So for our use case, it was somehow real time from our perspective.
For logs for example, 1 second is enough in most use cases.
You can modify this default value but it comes with a cost.
If you really need real time, then you probably want to use a SQL database.
My 2 cents.
Yes, DSE Search is indeed Near real-time and has not yet achieved the mythical goal of absolute zero latency. But... even traditional Real real-time is not real-time once you factor in the time to do the actual database update, plus the fact that a lot of traditional database updates are batch-oriented, or even if the actual update operation is not batched, there is likely to be some human process that delays the start of the database update from the original source of a data change.
Also keep in mind that the latency of a database update needs to include maintaining the required (tunable) consistency for replicating data updates in the cluster.
Rather than push you back towards SQL if you want real-time, I would challenge you to fully justify the true latency requirements of the app. For example, with complex distributed applications you need to be prepared for occasional resource outages, such as network delays, so that it is usually much better to design a modern distributed application to be a lot more flexible and asynchronous than a traditional, synchronous, fragile (think HealthCare.gov) app architecture that improperly depends on a perception of zero-latency distributed operations.
Finally, we are working on enhancements to reduce the actual latency of database updates, coupled with ongoing improvements in hardware performance that further shrink the update latency window.
But ultimately, all computing real-time measures will have some non-zero latency and modern distributed apps must be designed for at least some degree of decoupling between database updates and absolute dependency on those updates.
Worst case scenario, apps that need to synchronize with database updates may need to implement a polling strategy to wait for the update to complete.
ElasticSearch has real time features for CRUD operations. On GET operations, it checks the Transaction log, to look for any uncommitted changes and return the most relevant document.
The Percolator feature enables realtime in search queries as well. It allows you to register queries (percolation), that will be used at indexing time to return matching documents to those predefined queries.
This workflow looks like this:
Register specific query (percolation) in Elasticsearch
Index new content (passing a flag to trigger percolation)
The response to the indexing operation will contain the matched percolations
A very good blog with live example that explains the Percolator concept:
http://blog.qbox.io/elasticsesarch-percolator

Are long high volume transactions bad

Hallo,
I am writing a database application that does a lot of inserts and updates with fake serialisable isolation level (snapshot isolation).
To not do tonnes of network roundtrips I'm batching inserts and updates in one transaction with PreparedStatements. They should fail very seldom because the inserts are prechecked and nearly conflict free to other transactions, so rollbacks don't occur often.
Having big transactions should be good for WAL, because it can flush big chunks and doesn't have to flush for mini transactions.
1.) I can only see positive effects of a big transaction. But I often read that they are bad. Why could they be bad in my use case?
2.) Is the checking for conflicts so expensive when the local snapshots are merged back into the real database? The database will have to compare all write sets of possible conflicts (parallel transaction). Or does it do some high speed shortcut? Or is that quite cheap anyways?
[EDIT] It might be interesting if someone could bring some clarity into how a snapshot isolation database checks if transaction, which have overlapping parts on the timeline, are checked for disjunct write sets. Because that's what fake serializable isolation level is all about.
The real issues here are two fold. The first possible problem is bloat. Large transactions can result in a lot of dead tuples showing up at once. The other possible problem is from long running transactions. As long as a long running transaction is running, the tables it's touching can't be vacuumed so can collect lots of dead tuples as well.
I'd say just use check_postgresql.pl to check for bloating issues. As long as you don't see a lot of table bloat after your long transactions you're ok.
1) Manual says that it is good: http://www.postgresql.org/docs/current/interactive/populate.html
I can recommend also to Use COPY, Remove Indexes (but first test), Increase maintenance_work_mem, Increase checkpoint_segments, Run ANALYZE (or VACUUM ANALYZE) Afterwards.
I will not recommed if you are not sure: Remove Foreign Key Constraints, Disable WAL archival and streaming replication.
2) Always data are merged on commit but there is no checks, data are just written. Read again: http://www.postgresql.org/docs/current/interactive/transaction-iso.html
If your inserts/updates does not depend on other inserts/updates you don't need "wholly consistent view". You may use read committed and transaction will never fail.

Resources