Multiple connections updating solr documents - solr

I am designing a web application where multiple connections make changes to a databse and solr 5 using two-phase commit. My questions is, is there a way to isolate the changes each connection make so that changes made by a connection will not be visible to the rest of connections until the changes have been successfully committed to the database and solr?
After numerous searches I read this interesting article it suggests it is not possible. Anyone has done something similar?
Thanks in advance!

It is not possible. Solr's commit is server side to expose all accumulated documents to date.
One alternative approach you could use is to index into separate collections and then make them all available together as a single multi-collection alias. This, of course, implies that no document could theoretically exist in two collections at once (or you will get duplicates).

I faced the same issue. Solr provides 2 ways of commits.
Hard Commit - It commits your documents into disk after given period of time.
Soft Commit - It is real time commits your documents. It is good to use it. If you want to commits your document in real time.
You can config you solrconfig.xml such as
Hard Commit - It will properly commit your documents to disk within interval of 15 seconds.
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
Soft Commit - It will commit your documents in real-time whenever any commit happens.
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

Related

Solr Cloud Data Import Handler slow with replication

I am setting up a Solr Cloud deployment with 3 nodes and 3 shards. Without replication my data import handler imports very quickly- around 1.2M documents in ~5minutes. This is great, however when I enable replication, i.e. re-create the collection with a replication factor of 2, the data import handler becomes significantly slower, taking around 1hr 30mins for the same 1.2M documents.
I am using solr 5.3.1 in cloud mode on 3 4x16 virtual servers with a zookeeper instance on each node. The data import comes from an MS SQL DB.
Most of my configuration is the defaults that come with Solr, I have tried changing the auto commit for hard and soft commits to being very long but no effect.
Any ideas/pointers would be much appreciated.
Thanks,
Ewen
Maybe not a proper answer but the issue seemed to resolve itself. Of course we must have done something to make this happen, however all we can think of that we did was removing the CONSOLE logging in the log4j properties file and deleting the 11GB log file it had created.
Guess this may at least may give something else for others to try who are having the same issue.
When you send a document to a collection, it first gets proxied to the leader shard for that document, then the leader shard applies it locally, then sends it to all active replicas, then returns to the client.
This means the 'send a document' request is held open until all replicas have either received the document or failed. This means that the time to insert a doc is the max time for any replica to insert the document.
So yes, a collection with a higher replication factor will be slower to insert documents, assuming a fixed number of indexer connections.
With respect to logging, Solr uses synchronous logging by default, so if you're writing logs to a very slow disk or nfs or something, that could certainly influence query time. I highly recommend async logging for everything, but that means messing with the default Solr settings.

What is the best approach to guarantee commits in Apache SOLR?

Question: How can I get "guarantee commits" with Apache SOLR where persisting data to disk and visibility are both equally important ?
Background: We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and do not want to use another database on the side.
I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ?
Can anyone please suggest a solution for achieving "guaranteed commits" with SOLR ?
As you were told on the mailing list, Solr does not have transactions. If you index from a dozen clients, and a commit happens from somewhere (either autoSoftCommit, commitWithin on the udpate request, or an explicit commit from one of those dozen clients), all of the documents indexed by those dozen clients will be visible to all searchers.
With a transactional database, each of the dozen clients that is sending updates would have to issue a commit, which would only make the changes made by that specific client visible.
Solr usually does not make any guarantees regarding commits. If you issue ten commits in parallel, that will most likely exceed the maxWarmingSearchers configuration, which is typically set to 2. Most of those ten commits wouldn't actually create a new searcher, which is what makes new documents visible.
If you do manual commits in such a way that you are never exceeding maxWarmingSearchers, then when that commit finishes without error, you can take that as a sign that all changes are now visible.
The answer is that Solr is not designed to be the primary data store. Its data structures and indexing/retrieval designed for other use cases, even if it all seems like CRUD on the surface. You should have your data persisted somewhere else and then indexed in Solr - in the way that makes it easy to find - later. Same with Elasticsearch and other search-oriented software.
If you absolutely have to combine those things, look at the commercial products that included Solr on top of Cassandra or other similar databases.
Solr provides two type of commits to persist the data in solr.
Soft Commit: The soft commits persists the into Solr data structure. Solr guarantees visibility of the document after every soft commit. It does not actually stores the data into disk. So if the Solr instance goes down then this information can not be recovered.
Hard Commit: Every time application index the data to solr, it can perform the hard commit of the data. The hard commit persists the data into disk and it recoverable even the instance goes down. The disadvantage of frequent hard commit is, solr has to perform segment merges frequently, which is CPU intensive.
You can configure the autoCommit option in solrconfig.xml according to your needs.
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>1000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
There are merits and demerits of each approach. You can find more information on Apache Wiki Commits and an article from LucidWorks on commits in CloudSolr Understanding Transaction Logs, Soft Commit and Commit in SolrCloud

SOLR autoCommit vs autoSoftCommit

I'm very confused about and . Here is what I understand
autoSoftCommit - after a autoSoftCommit, if the the SOLR server goes down, the autoSoftCommit documents will be lost.
autoCommit - does a hard commit to the disk and make sure all the autoSoftCommit commits are written to disk and commits any other document.
My following configuration seems to be only with with autoSoftCommit. autoCommit on its own does not seems to be doing any commits. Is there something I am missing ?
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<autoSoftCommit>
<maxDocs>1000</maxDocs>
<maxTime>1200000</maxTime>
</autoSoftCommit>
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>120000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
</updateHandler>
why is autoCommit working on it's own ?
I think this article will be useful for you. It explains in detail how hard commit and soft commit work, and the tradeoffs that should be taken in account when tuning your system.
I always shudder at this, because any recommendation will be wrong in some cases. My first recommendation would be to not overthink the problem. Some very smart people have tried to make the entire process robust. Try the simple things first and only tweak things as necessary. In particular, look at the size of your transaction logs and adjust your hard commit intervals to keep these “reasonably sized”. Remember that the penalty is mostly the replay-time involved if you restart after a JVM crash. Is 15 seconds tolerable? Why go smaller then?
We’ve seen situations in which the hard commit interval is much shorter than the soft commit interval, see the bulk indexing bit below.
These are places to start.
HEAVY (BULK) INDEXING
The assumption here is that you’re interested in getting lots of data to the index as quickly as possible for search sometime in the future. I’m thinking original loads of a data source etc.
Set your soft commit interval quite long. As in10 minutes. Soft commit is about visibility, and my assumption here is that bulk indexing isn’t about near real time searching so don’t do the extra work of opening any kind of searcher.
Set your hard commit intervals to 15 seconds, openSearcher=false. Again the assumption is that you’re going to be just blasting data at Solr. The worst case here is that you restart your system and have to replay 15 seconds or so of data from your tlog. If your system is bouncing up and down more often than that, fix the reason for that first.
Only after you’ve tried the simple things should you consider refinements, they’re usually only required in unusual circumstances. But they include:
Turning off the tlog completely for the bulk-load operation
Indexing offline with some kind of map-reduce process
Only having a leader per shard, no replicas for the load, then turning on replicas later and letting them do old-style replication to catch up. Note that this is automatic, if the node discovers it is “too far” out of sync with the leader, it initiates an old-style replication. After it has caught up, it’ll get documents as they’re indexed to the leader and keep its own tlog.
etc.
INDEX-HEAVY, QUERY-LIGHT
By this I mean, say, searching log files. This is the case where you have a lot of data coming at the system pretty much all the time. But the query load is quite light, often to troubleshoot or analyze usage.
Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand.
Set your hard commit to 15 seconds, openSearcher=false
INDEX-LIGHT, QUERY-LIGHT OR HEAVY
This is a relatively static index that sometimes gets a small burst of indexing. Say every 5-10 minutes (or longer) you do an update
Unless NRT functionality is required, I’d omit soft commits in this situation and do hard commits every 5-10 minutes with openSearcher=true. This is a situation in which, if you’re indexing with a single external indexing process, it might make sense to have the client issue the hard commit.
INDEX-HEAVY, QUERY-HEAVY
This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start
Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free.
Set your hard commit interval to 15 seconds.
In my case (index heavy, query heavy), replication master-slave was taking too long time, slowing don the queries to the slave. By increasing the softCommit to 15min and increasing the hardCommit to 1min, the performance improvement was great. Now the replication works with no problems, and the servers can handle much more requests per second.
This is my use case though, I realized I don'r really need the items to be available on the master at real time, since the master is only used for indexing items, and new items are available in the slaves every replication cycle (5min), which is totally ok for my case. you should tune this parameters for your case.
You have openSearcher=false for hard commits. Which means that even though the commit happened, the searcher has not been restarted and cannot see the changes. Try changing that setting and you will not need soft commit.
SoftCommit does reopen the searcher. So if you have both sections, soft commit shows new changes (even if they are not hard-committed) and - as configured - hard commit saves them to disk, but does not change visibility.
This allows to put soft commit to 1 second and have documents show up quickly and have hard commit happen less frequently.
Soft commits are about visibility.
hard commits are about durability.
optimize are about performance.
Soft commits are very fast ,there changes are visible but this changes are not persist (they are only in memory) .So during the crash this changes might be last.
Hard commits changes are persistent to disk.
Optimize is like hard commit but it also merge solr index segments into a single segment for improving performance .But it is very costly.
A commit(hard commit) operation makes index changes visible to new search requests. A hard commit uses the transaction
log to get the id of the latest document changes, and also calls fsync on the index files to ensure they have
been flushed to stable storage and no data loss will result from a power failure.
A soft commit is much faster since it only makes index changes visible and does not fsync index files or write
a new index descriptor. If the JVM crashes or there is a loss of power, changes that occurred after the last hard
commit will be lost. Search collections that have NRT requirements (that want index changes to be quickly
visible to searches) will want to soft commit often but hard commit less frequently. A softCommit may be "less
expensive" in terms of time, but not free, since it can slow throughput.
An optimize is like a hard commit except that it forces all of the index segments to be merged into a single
segment first. Depending on the use, this operation should be performed infrequently (e.g., nightly), if at all, since
it involves reading and re-writing the entire index. Segments are normally merged over time anyway (as
determined by the merge policy), and optimize just forces these merges to occur immediately.
auto commit properties we can manage from sorlconfig.xml files.
<autoCommit>
<maxTime>1000</maxTime>
</autoCommit>
<!-- SoftAutoCommit
Perform a 'soft' commit automatically under certain conditions.
This commit avoids ensuring that data is synched to disk.
maxDocs - Maximum number of documents to add since the last
soft commit before automaticly triggering a new soft commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automaticly
triggering a new soft commit.
-->
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
References:
https://wiki.apache.org/solr/SolrConfigXml
https://lucene.apache.org/solr/guide/6_6/index.html

Solr full-import performance

I have a small set of queries and entities and even though the performance is pretty bad, I just would like to know what tricks and configurations that i can do to increase the performance ?
Note I'm using Solr 4.1.
You should try to minimize the number of commits during your import. Even if you don't commit periodically when adding documents to Solr, Solr will do an auto commit based on solrconfig.xml autoCommit settings:
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
Increase both maxDocs and maxTime and see if you get better speeds. (maxTime is in milli seconds, so default setting is 15 secs only, which is very low for bulk imports.)
You can even try disabling auto-commit during your bulk import and issue one commit command after all your documents are added. If this does not throw an out-of-memory exception from Solr, it is the best speed you can get.
If you were doing an RDBMS import, then I would have suggested capturing as many fields as possible using JOINs and minimizing the number of sub-entities, since each sub-entity opens a separate connection to the DB. Since you are importing from mongo, this doesn't apply to you. You can experiment by creating a new mongo collection with all the data you need for Solr, keep a single entity in your data importer and see if it improves import speed.

Few questions about Solr. Transactions and Realtime search

I have a heldesk application in PHP/MySQL. I want to implement realtime Full text search and I have shortlisted Solr. MySQL database will store all the data and data required for search will be imported for building Solr index. All Search requests will be handled by Solr.
What I want is
Real time search. The moment someone updates a ticket, it should be available for search.
If multiple people update the ticket simultaneously, Solr should be able to handle the commits
As per my understanding of Solr, this is how I think the system will work. A user updates a ticket -> corrresponding database records modified -> a request is sent to Solr server to modify corresponding document in index.
I have read a book on Solr and below questions are troubling me.
The book mentions that
"commits are slow in Solr. Depending on the index size, Solr's
auto-warming configuration, and Solr's cache state prior to
committing, a commit can take a non-trivial amount of time. Typically,
it takes a few seconds, but it can take some number of minutes in
extreme cases"
If this is true then how will I know when the data will be availbale for search and how can I implemnt realtime search? Even if its taking a few seconds, it can't be real time. Also I don't want the ticket update operation to be slowed down (by adding extra step of updating Solr index)
It is also mentioned that
"there is no transaction isolation. This means that if more than one
Solr client were to submit modifications and commit them at
overlapping times, it is possible for part of one client's set of
changes to be committed before that client told Solr to commit. This
applies to rollback as well. If this is a problem for your
architecture then consider using one client process responsible for
updating Solr."
Doe it mean that that due to lack of transactional commits, Solr can mess up if multiple people update the ticket simultaneously?
Now the question before me is: Can I achieve the two using Solr? If yes, How?
Edit1:
Yeah! I came acorss a couple of similar questions but none has a staisfactory answer. So posting again. Sorry If you find it duplicate.
The functionality that you are requesting is known as Near Realtime Search also referred to as NRT. The work on NRT is still in progress, but there have been excellent incremental improvements to this support in Solr over the last couple of years. Please refer to the following links for more details on the current (versions 1.4 - 3.5) and future (ver 4.0) support for NRT.
NRT options
Solr Near Realtime Search for versions 3.5/3.4/3.3/3.2/1.4.1
Near Real Time Search ver 3.x
Near Realtime Search Tuning (ver 1.4 - 3.x)
Solr Near Realtime Search (ver 4.0)
Benchmarking the new Solr 'Near Realtime' improvements (ver 4.0)
Solr with Ranking Algorithm (ver 1.4 - 4.0)

Resources