Solr full-import performance - solr

I have a small set of queries and entities and even though the performance is pretty bad, I just would like to know what tricks and configurations that i can do to increase the performance ?
Note I'm using Solr 4.1.

You should try to minimize the number of commits during your import. Even if you don't commit periodically when adding documents to Solr, Solr will do an auto commit based on solrconfig.xml autoCommit settings:
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
Increase both maxDocs and maxTime and see if you get better speeds. (maxTime is in milli seconds, so default setting is 15 secs only, which is very low for bulk imports.)
You can even try disabling auto-commit during your bulk import and issue one commit command after all your documents are added. If this does not throw an out-of-memory exception from Solr, it is the best speed you can get.
If you were doing an RDBMS import, then I would have suggested capturing as many fields as possible using JOINs and minimizing the number of sub-entities, since each sub-entity opens a separate connection to the DB. Since you are importing from mongo, this doesn't apply to you. You can experiment by creating a new mongo collection with all the data you need for Solr, keep a single entity in your data importer and see if it improves import speed.

Related

Solr Cloud Data Import Handler slow with replication

I am setting up a Solr Cloud deployment with 3 nodes and 3 shards. Without replication my data import handler imports very quickly- around 1.2M documents in ~5minutes. This is great, however when I enable replication, i.e. re-create the collection with a replication factor of 2, the data import handler becomes significantly slower, taking around 1hr 30mins for the same 1.2M documents.
I am using solr 5.3.1 in cloud mode on 3 4x16 virtual servers with a zookeeper instance on each node. The data import comes from an MS SQL DB.
Most of my configuration is the defaults that come with Solr, I have tried changing the auto commit for hard and soft commits to being very long but no effect.
Any ideas/pointers would be much appreciated.
Thanks,
Ewen
Maybe not a proper answer but the issue seemed to resolve itself. Of course we must have done something to make this happen, however all we can think of that we did was removing the CONSOLE logging in the log4j properties file and deleting the 11GB log file it had created.
Guess this may at least may give something else for others to try who are having the same issue.
When you send a document to a collection, it first gets proxied to the leader shard for that document, then the leader shard applies it locally, then sends it to all active replicas, then returns to the client.
This means the 'send a document' request is held open until all replicas have either received the document or failed. This means that the time to insert a doc is the max time for any replica to insert the document.
So yes, a collection with a higher replication factor will be slower to insert documents, assuming a fixed number of indexer connections.
With respect to logging, Solr uses synchronous logging by default, so if you're writing logs to a very slow disk or nfs or something, that could certainly influence query time. I highly recommend async logging for everything, but that means messing with the default Solr settings.

Multiple connections updating solr documents

I am designing a web application where multiple connections make changes to a databse and solr 5 using two-phase commit. My questions is, is there a way to isolate the changes each connection make so that changes made by a connection will not be visible to the rest of connections until the changes have been successfully committed to the database and solr?
After numerous searches I read this interesting article it suggests it is not possible. Anyone has done something similar?
Thanks in advance!
It is not possible. Solr's commit is server side to expose all accumulated documents to date.
One alternative approach you could use is to index into separate collections and then make them all available together as a single multi-collection alias. This, of course, implies that no document could theoretically exist in two collections at once (or you will get duplicates).
I faced the same issue. Solr provides 2 ways of commits.
Hard Commit - It commits your documents into disk after given period of time.
Soft Commit - It is real time commits your documents. It is good to use it. If you want to commits your document in real time.
You can config you solrconfig.xml such as
Hard Commit - It will properly commit your documents to disk within interval of 15 seconds.
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
Soft Commit - It will commit your documents in real-time whenever any commit happens.
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

What is the best approach to guarantee commits in Apache SOLR?

Question: How can I get "guarantee commits" with Apache SOLR where persisting data to disk and visibility are both equally important ?
Background: We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and do not want to use another database on the side.
I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ?
Can anyone please suggest a solution for achieving "guaranteed commits" with SOLR ?
As you were told on the mailing list, Solr does not have transactions. If you index from a dozen clients, and a commit happens from somewhere (either autoSoftCommit, commitWithin on the udpate request, or an explicit commit from one of those dozen clients), all of the documents indexed by those dozen clients will be visible to all searchers.
With a transactional database, each of the dozen clients that is sending updates would have to issue a commit, which would only make the changes made by that specific client visible.
Solr usually does not make any guarantees regarding commits. If you issue ten commits in parallel, that will most likely exceed the maxWarmingSearchers configuration, which is typically set to 2. Most of those ten commits wouldn't actually create a new searcher, which is what makes new documents visible.
If you do manual commits in such a way that you are never exceeding maxWarmingSearchers, then when that commit finishes without error, you can take that as a sign that all changes are now visible.
The answer is that Solr is not designed to be the primary data store. Its data structures and indexing/retrieval designed for other use cases, even if it all seems like CRUD on the surface. You should have your data persisted somewhere else and then indexed in Solr - in the way that makes it easy to find - later. Same with Elasticsearch and other search-oriented software.
If you absolutely have to combine those things, look at the commercial products that included Solr on top of Cassandra or other similar databases.
Solr provides two type of commits to persist the data in solr.
Soft Commit: The soft commits persists the into Solr data structure. Solr guarantees visibility of the document after every soft commit. It does not actually stores the data into disk. So if the Solr instance goes down then this information can not be recovered.
Hard Commit: Every time application index the data to solr, it can perform the hard commit of the data. The hard commit persists the data into disk and it recoverable even the instance goes down. The disadvantage of frequent hard commit is, solr has to perform segment merges frequently, which is CPU intensive.
You can configure the autoCommit option in solrconfig.xml according to your needs.
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>1000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
There are merits and demerits of each approach. You can find more information on Apache Wiki Commits and an article from LucidWorks on commits in CloudSolr Understanding Transaction Logs, Soft Commit and Commit in SolrCloud

Sunspot with Solr 3.5. Manually updating indexes for real time search

Im working with Rails 3 and Sunspot solr 3.5. My application uses Solr to index user generated content and makes it searchable for other users. The goal is to allow users to search this data as soon as possible from the time the user uploaded it. I don't know if this qualifies as Real time search.
My application has two models
Posts
PostItems
I index posts by including data from post items so that a when a user searches based on certain description provided in a post_item record the corresponding post object is made available in the search.
Users frequently update post_items so every time a new post_item is added I need to reindex the corresponding post object so that the new post_item will be available during search.
So at the moment whenever I receive a new post_item object I run
post_item.post.solr_index! #
which according to this documentation instantly updates the index and commits. This works but is this the right way to handle indexing in this scenario? I read here that calling index while searching may break solr. Also frequent manual index calls are not the way to go.
Any suggestions on the right way to do this. Are there alternatives other than switching to ElasticSearch
try to use this gem https://github.com/bdurand/sunspot_index_queue
you will than be able to batch reindex, let's say, every minute, and it definitely will not brake an index
If you are just starting out and have the luxury to choose between Solr and ElasticSearch, go with ElasticSearch.
We use Solr in production and have run into many weird issues as the index and search volume grew. The conclusion was Solr was built/optimzed for indexing huge documents(word/pdf content) and in large numbers(billions?) but updating the index once a day or a couple of days when nobody is searching.
It was a wrong choice for consumer Rails application where documents are small, small in numbers( in millions) updates are random and continuous and the search needs to be somewhat real time( a delay of 5-10 sec is fine).
Some of the tricks we applied to tune the server.
removed all commits (i.e., !) from rails code,
use Solr auto-commit every 5/20 seconds,
have master/slave configuration,
run index optimization(on Master) every 1 hour
and more.
and we still see high CPU usage on slaves when the commit triggers. As a result some searches take a long time(> 60 seconds at times).
Also I doubt if the batching indexing sunspot_index_queue gem can remedy the high CPU issue.

Few questions about Solr. Transactions and Realtime search

I have a heldesk application in PHP/MySQL. I want to implement realtime Full text search and I have shortlisted Solr. MySQL database will store all the data and data required for search will be imported for building Solr index. All Search requests will be handled by Solr.
What I want is
Real time search. The moment someone updates a ticket, it should be available for search.
If multiple people update the ticket simultaneously, Solr should be able to handle the commits
As per my understanding of Solr, this is how I think the system will work. A user updates a ticket -> corrresponding database records modified -> a request is sent to Solr server to modify corresponding document in index.
I have read a book on Solr and below questions are troubling me.
The book mentions that
"commits are slow in Solr. Depending on the index size, Solr's
auto-warming configuration, and Solr's cache state prior to
committing, a commit can take a non-trivial amount of time. Typically,
it takes a few seconds, but it can take some number of minutes in
extreme cases"
If this is true then how will I know when the data will be availbale for search and how can I implemnt realtime search? Even if its taking a few seconds, it can't be real time. Also I don't want the ticket update operation to be slowed down (by adding extra step of updating Solr index)
It is also mentioned that
"there is no transaction isolation. This means that if more than one
Solr client were to submit modifications and commit them at
overlapping times, it is possible for part of one client's set of
changes to be committed before that client told Solr to commit. This
applies to rollback as well. If this is a problem for your
architecture then consider using one client process responsible for
updating Solr."
Doe it mean that that due to lack of transactional commits, Solr can mess up if multiple people update the ticket simultaneously?
Now the question before me is: Can I achieve the two using Solr? If yes, How?
Edit1:
Yeah! I came acorss a couple of similar questions but none has a staisfactory answer. So posting again. Sorry If you find it duplicate.
The functionality that you are requesting is known as Near Realtime Search also referred to as NRT. The work on NRT is still in progress, but there have been excellent incremental improvements to this support in Solr over the last couple of years. Please refer to the following links for more details on the current (versions 1.4 - 3.5) and future (ver 4.0) support for NRT.
NRT options
Solr Near Realtime Search for versions 3.5/3.4/3.3/3.2/1.4.1
Near Real Time Search ver 3.x
Near Realtime Search Tuning (ver 1.4 - 3.x)
Solr Near Realtime Search (ver 4.0)
Benchmarking the new Solr 'Near Realtime' improvements (ver 4.0)
Solr with Ranking Algorithm (ver 1.4 - 4.0)

Resources