Few questions about Solr. Transactions and Realtime search - solr

I have a heldesk application in PHP/MySQL. I want to implement realtime Full text search and I have shortlisted Solr. MySQL database will store all the data and data required for search will be imported for building Solr index. All Search requests will be handled by Solr.
What I want is
Real time search. The moment someone updates a ticket, it should be available for search.
If multiple people update the ticket simultaneously, Solr should be able to handle the commits
As per my understanding of Solr, this is how I think the system will work. A user updates a ticket -> corrresponding database records modified -> a request is sent to Solr server to modify corresponding document in index.
I have read a book on Solr and below questions are troubling me.
The book mentions that
"commits are slow in Solr. Depending on the index size, Solr's
auto-warming configuration, and Solr's cache state prior to
committing, a commit can take a non-trivial amount of time. Typically,
it takes a few seconds, but it can take some number of minutes in
extreme cases"
If this is true then how will I know when the data will be availbale for search and how can I implemnt realtime search? Even if its taking a few seconds, it can't be real time. Also I don't want the ticket update operation to be slowed down (by adding extra step of updating Solr index)
It is also mentioned that
"there is no transaction isolation. This means that if more than one
Solr client were to submit modifications and commit them at
overlapping times, it is possible for part of one client's set of
changes to be committed before that client told Solr to commit. This
applies to rollback as well. If this is a problem for your
architecture then consider using one client process responsible for
updating Solr."
Doe it mean that that due to lack of transactional commits, Solr can mess up if multiple people update the ticket simultaneously?
Now the question before me is: Can I achieve the two using Solr? If yes, How?
Edit1:
Yeah! I came acorss a couple of similar questions but none has a staisfactory answer. So posting again. Sorry If you find it duplicate.

The functionality that you are requesting is known as Near Realtime Search also referred to as NRT. The work on NRT is still in progress, but there have been excellent incremental improvements to this support in Solr over the last couple of years. Please refer to the following links for more details on the current (versions 1.4 - 3.5) and future (ver 4.0) support for NRT.
NRT options
Solr Near Realtime Search for versions 3.5/3.4/3.3/3.2/1.4.1
Near Real Time Search ver 3.x
Near Realtime Search Tuning (ver 1.4 - 3.x)
Solr Near Realtime Search (ver 4.0)
Benchmarking the new Solr 'Near Realtime' improvements (ver 4.0)
Solr with Ranking Algorithm (ver 1.4 - 4.0)

Related

What is the best approach to guarantee commits in Apache SOLR?

Question: How can I get "guarantee commits" with Apache SOLR where persisting data to disk and visibility are both equally important ?
Background: We have a website which requires high end search functionality for machine learning and also requires guaranteed commit for financial transaction. We just want to SOLR as our only datastore to keep things simple and do not want to use another database on the side.
I can't seem to find any answer to this question. The simplest solution for a financial transaction seems to be to periodically query SOLR for the record after it has been persisted but this can have longer wait time or is there a better solution ?
Can anyone please suggest a solution for achieving "guaranteed commits" with SOLR ?
As you were told on the mailing list, Solr does not have transactions. If you index from a dozen clients, and a commit happens from somewhere (either autoSoftCommit, commitWithin on the udpate request, or an explicit commit from one of those dozen clients), all of the documents indexed by those dozen clients will be visible to all searchers.
With a transactional database, each of the dozen clients that is sending updates would have to issue a commit, which would only make the changes made by that specific client visible.
Solr usually does not make any guarantees regarding commits. If you issue ten commits in parallel, that will most likely exceed the maxWarmingSearchers configuration, which is typically set to 2. Most of those ten commits wouldn't actually create a new searcher, which is what makes new documents visible.
If you do manual commits in such a way that you are never exceeding maxWarmingSearchers, then when that commit finishes without error, you can take that as a sign that all changes are now visible.
The answer is that Solr is not designed to be the primary data store. Its data structures and indexing/retrieval designed for other use cases, even if it all seems like CRUD on the surface. You should have your data persisted somewhere else and then indexed in Solr - in the way that makes it easy to find - later. Same with Elasticsearch and other search-oriented software.
If you absolutely have to combine those things, look at the commercial products that included Solr on top of Cassandra or other similar databases.
Solr provides two type of commits to persist the data in solr.
Soft Commit: The soft commits persists the into Solr data structure. Solr guarantees visibility of the document after every soft commit. It does not actually stores the data into disk. So if the Solr instance goes down then this information can not be recovered.
Hard Commit: Every time application index the data to solr, it can perform the hard commit of the data. The hard commit persists the data into disk and it recoverable even the instance goes down. The disadvantage of frequent hard commit is, solr has to perform segment merges frequently, which is CPU intensive.
You can configure the autoCommit option in solrconfig.xml according to your needs.
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>1000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
There are merits and demerits of each approach. You can find more information on Apache Wiki Commits and an article from LucidWorks on commits in CloudSolr Understanding Transaction Logs, Soft Commit and Commit in SolrCloud

When to definitely use SOLR over Lucene in a Sitecore 7 build?

My client does not have the budget to setup and maintain a SOLR server to use in their production environment. If I understand the Sitecore 7 Content Search API correctly, it is not a big deal to configure things to use Lucene instead. For the most part the configuration will be similar and the code will be the same, and a SOLR server can be swapped in later.
The site build has
faceted search page
listing components on landing and on other pages that will leverage the Content Search API
buckets with custom facets
The site has around 5,000 pages and components not including media library items. Are there any concerns about simply using Lucene?
The main question is, when, during your architecture or design phase do you know that you should definitely choose SOLR over Lucene? What are the major signs that lead you recommend that?
I think if you are dealing with a customer on a limited budget then Lucene will work perfectly well and perform excellently for the scale of things you are doing. All the things you mention are fully supported by the implementation in Lucene.
In a Sitecore scenario I would begin to consider Solr if:
You need to index a large number of items - id say 50 thousand upwards - Lucene is happy with these sorts of number but Solr has improved query caching and is designed for these large numbers of items.
The resilience of the search tier is of maximum business importance (ie the site is purely driven by search) - Solr provides a more robust replication/sharding and failover system with SolrCloud.
Re-purposing of the search tier in other application is important (non Sitecore) - Solr is a search application so can be accessed over HTTP with XML/JSON etc which makes integration with external systems easier.
You need some specific additional feature of Solr that Lucene doesn't have.
.. but as you say if you want swap out Lucene for Solr at a later phase, we have worked hard to make sure that the process as simple as possible. Worth noting a few points here:
While your LINQ queries will stay the same your configuration will be slightly different and will need attention to port across.
The understanding of how Solr works as an application and how the schema works is important to know but there are some great books and a wealth of knowledge out there.
Solr has slightly different (newer) analyzers and scoring mechanisms so your search results may be slightly different (sometimes customers can get alarmed by this :P)
.. but I think these are things you can build up to over time and assess with the customer. Im sure there are more points here and others can chime in if they think of them. Hope this helps :)
Stephen pretty much covered the question - but I just wanted to add another scenario. You need to take into account the server setup in your production environment. If you are going to be using multiple content delivery servers behind a load balancer I would consider Solr from the start, as trying to make sure that the Lucene index on each delivery server is synchronized 100% of the time can be painful.
I would recommend planning an escape plan from Lucene as early as you start thinking about multiple CDs and here is why:
A) Each server has to maintain its own index copy:
Any unexpected restart might cause a few documents not to be added to the index on the one box, making indexes different from server to server.
That would lead to same page showing differently by CDs
Each server must perform index updates - use CPU & disk space; response rate drops after publish operation is over =/
According to security guide, CDs should have Sitecore Shell UI removed, so index cannot be easily rebuilt from Control Panel =\
B) Lucene is not designed for large volumes of content. Each search operation does roughly following:
Create an array with size equal to total number of documents in the index
If document matches search, set flag in the array
While this works like a charm for low sized indexes (~10K elements), huge performance degradation is produced once the volume of content grows.
The allocated array ends in Large Object Heap that is not compacted by default, thereby gets fragmented fast.
Scenario:
Perform search for 100K documents -> huge array created in memory
Perform one more search in another thread -> one more huge array created
Update index -> now 100K + 10 documents
The first operation was completed; LOH has space for 100K array
Seach triggered again -> 100K+10 array is to be created; freed memory 'hole' is not large enough, so more RAM is requested.
w3wp.exe process keeps on consuming more and more RAM
This is the common case for Analytics Aggregation as an index is being populated by multiple threads at once.
You'll see a lot of RAM used after a while on the processing instance.
C) Last Lucene.NET release was done 5 years ago.
Whereas SOLR is actively being developed.
The sooner you'll make the switch to SOLR, the easier it would be.

Solr appears to block update requests while committing

We're running a master-slave setup with Solr 3.6 using the following auto-commit options:
maxDocs: 500000
maxTime: 600000
We have approx 5 million documents in our index which takes up approx 550GB. We're running both master and slave on Amazon EC2 XLarge instances (4 virtual cores and 15GB). We don't have a particularly high write throughput - about 100 new documents per minute.
We're using Jetty as a container which has 6GB allocated to it.
The problem is that once a commit has started, all our update requests start timing out (we're not performing queries against this box). The commit itself appears to take approx 20-25mins during which time we're unable to add any new documents to Solr.
One of the answers in the following question suggests using 2 cores and swapping them once its fully updated. However this seems a little over the top.
Solr requests time out during index update. Perhaps replication a possible solution?
Is there anything else I should be looking at regarding why Solr seems to be blocking requests? I'm optimistically hoping there's a "dontBlockUpdateRequestsWhenCommitting" flag in the config that I've overlooked...
Many thanks,
According to bounty reason and the problem mentioned at question here is a solution from Solr:
Solr has a capability that is called as SolrCloud beginning with 4.x version of Solr. Instead of previous master/slave architecture there are leaders and replicas. Leaders are responsible for indexing documents and replicas answers queries. System is managed by Zookeeper. If a leader goes down one of its replicas are selected as new leader.
All in all if you want to divide you indexing process that is OK with SolrCloud by automatically because there exists one leader for each shard and they are responsible for indexing for their shard's documents. When you send a query into the system there will be some Solr nodes (of course if there are Solr nodes more than shard count) that is not responsible for indexing however ready to answer the query. When you add more replica, you will get faster query result (but it will cause more inbound network traffic when indexing etc.)
For those who is facing a similar problem, the cause of my problem was i had too many fields in the document, i used automatic fields *_t, and the number of fields grows pretty fast, and when that reach a certain number, it just hogs solr and commit would take forever.
Secondarily, I took some effort to do a profiling, it end up most of the time is consumed by string.intern() function call, it seems the number of fields in the document matters, when that number goes up, the string.intern() seems getting slower.
The solr4 source appears no longer using the string.intern() anymore. But large number of fields still kills the performance quite easily.

Sunspot with Solr 3.5. Manually updating indexes for real time search

Im working with Rails 3 and Sunspot solr 3.5. My application uses Solr to index user generated content and makes it searchable for other users. The goal is to allow users to search this data as soon as possible from the time the user uploaded it. I don't know if this qualifies as Real time search.
My application has two models
Posts
PostItems
I index posts by including data from post items so that a when a user searches based on certain description provided in a post_item record the corresponding post object is made available in the search.
Users frequently update post_items so every time a new post_item is added I need to reindex the corresponding post object so that the new post_item will be available during search.
So at the moment whenever I receive a new post_item object I run
post_item.post.solr_index! #
which according to this documentation instantly updates the index and commits. This works but is this the right way to handle indexing in this scenario? I read here that calling index while searching may break solr. Also frequent manual index calls are not the way to go.
Any suggestions on the right way to do this. Are there alternatives other than switching to ElasticSearch
try to use this gem https://github.com/bdurand/sunspot_index_queue
you will than be able to batch reindex, let's say, every minute, and it definitely will not brake an index
If you are just starting out and have the luxury to choose between Solr and ElasticSearch, go with ElasticSearch.
We use Solr in production and have run into many weird issues as the index and search volume grew. The conclusion was Solr was built/optimzed for indexing huge documents(word/pdf content) and in large numbers(billions?) but updating the index once a day or a couple of days when nobody is searching.
It was a wrong choice for consumer Rails application where documents are small, small in numbers( in millions) updates are random and continuous and the search needs to be somewhat real time( a delay of 5-10 sec is fine).
Some of the tricks we applied to tune the server.
removed all commits (i.e., !) from rails code,
use Solr auto-commit every 5/20 seconds,
have master/slave configuration,
run index optimization(on Master) every 1 hour
and more.
and we still see high CPU usage on slaves when the commit triggers. As a result some searches take a long time(> 60 seconds at times).
Also I doubt if the batching indexing sunspot_index_queue gem can remedy the high CPU issue.

Indexing SQLServer data with SOLR

What is the best way of syncing the database change with solr incremental indexing? What is the best way of getting MSSQL server data to be indexed by solr?
Thank so much in addvance
Solr works with plugins. you will need to create your own data importer plugin that will be called in a periodically manner (based on notifications, time period that passed, etc). You will point your solr configuration to the class that will be called upon update.
Regarding your second Q, I used a text file, that holds a time date description. Each time Solr was started it looked at said file and retrieved from the DB the relevant data that was changed in the DB from that point on (the file is updated when the index is updated).
I would suggest reading a good solr/lucene book/guide such as lucidworks-solr-refguide-1.4 before getting started, so you will be sure that your architectural solution is correct

Resources