Building a Solr index and hot swapping it in Prod? - solr

We are planning to setup a Solr cluster which will have around 30 machines. We have a zookeeper ensemble of 3 nodes which will be managing Solr.
We will have new production data every few days, which is going to be quite different from the one that is in Prod. Since the data difference is
quite large, we are planning to use hadoop to create the entire Solr index dump and copy these binaries to each machine and maybe do some kinda core swap.
I am still new to Solr and was wondering if this is a good idea. I could http post my data to the prod cluster, but each update could span multiple documents.
I am not sure how this will impact the read traffic while the write happens.
Any pointers ?
Thanks

I am not sure i completely understand your explanations, but it seems to me that you would like to migrate to a new solr cloud environment with zero down time.
First, you need to know how many shards you want, how many replicas, etc.
You need to deploy the solr nodes, then you need to use the collection admin API to create the collection as desired (https://cwiki.apache.org/confluence/display/solr/Collections+API).
After all this you should be ready to add content to your new solr environment.
You can use Hadoop to populate the new solr cloud, say for instance by using solrj. Or you can use the data import handler to migrate data from another solr (or a relational database, etc).
It is very important how you create your solr cloud in terms of document routing, because it controls in which shard your document will be stored.
This is why it is not a good idea to copy raw index data to a solr node, as you may mess up the routing.
I found these explanations very useful about routing: https://lucidworks.com/blog/solr-cloud-document-routing/

Related

Solr or Lucene like single application

Hello i have already working application for searching in database. In database I have like 50M indexed documents. There is any idea to run all together i mean i don't want solr on http? what should i do? it's better to use Lucene or EmbeddedSolrServer? Or maybe you have other solution?
I have already something like on 1st diagram and i want make this in single process
And if i will go in lucene can i use my indexes from solr?
solr-5.2.1
Tomcat v8.0
It is not recommended to have one tomcat and deploy the application and solr.
If solr crashes then they are chances of getting downtime for the application. So its always better to run solr independently. Embedding solr is also not recommended.
The simplest, safest, way to use Solr is via Solr's standard HTTP interfaces. Embedding Solr is less flexible, harder to support, not as well tested, and should be reserved for special circumstances.
for reference http://wiki.apache.org/solr/EmbeddedSolr
It depends. If you want to use parts of the Solr feature set (Solr adds quite a few features on top of Lucene), you'll reimplement features that you otherwise would get for free.
You can use EmbeddedSolr to have Solr internal to your application, and then use the EmbeddedSolrServer client in SolrJ to talk to it - the rest of your application would still use Solr as it were a remote instance.
The problem with EmbeddedSolr is that you'll run into scalability issues as the index size grows, since you'll have a harder time scaling onto multiple servers and to separate concerns.

Using SolrCloud with RDBMS or using Solr as main data storage

I was wondering which scenario (or the combination) would be better for my application. From the aspect of performance, scalability and high availability.
Here is my application:
Suppose I am going to have more than 10m documents and it grows every day. (probably in 1 years it reaches to more than 100m docs. I want to use Solr as tool for indexing these documents but the problem is I have some data fields that could change frequently. (not too much but it could change)
Scenarios:
1- Using SolrCloud as database for all data. (even the one that could be changed)
2- Using SolrCloud as database for static data and using RDBMS (such as oracle) for storing dynamic fields.
3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all data.
Best regards.
I'm not sure how SolrCloud works with DIH (you might face situation when indexing will happen only on one instance).
On the other hand I would store data in RDBMS, because from time to time you will need to reindex Solr to add some new functionality to the index.
At the end of the day I would use DB + Solr (all the fields) with either Hadoop (have not used it yet) or some other piece of software to post data into the SolrCloud.

Is Solr stable enough to use as main repository?

I'm testing Solr as my full text search engine provider over 1,000,000 documents.
I have also users information data which is related to the documents as creator and I want to store the users hit.
Is it necessary to have database engine to store all the data? Or Solr is stable and safe to rely on?
Is there any risk to loose the stored data in Solr (I know it can happen to Solr index and I can rebuild it, but how about RAW data?)
The only reason that I want to have 2nd storage is having another backup/version of all of my data (not for querying,...).
Amir,
Solr is stable. If you are not convinced, have a look at list of users here...
http://wiki.apache.org/solr/PublicServers which include NASA, AT&T etc...
Solr main goal is to serve as Search engine, helping us to implement search, NLP algorithms, Big Data issues, etc.
Solr is not meant to be main data store (also it might serve as one....
Reason for the ambiguous sentence above is that unlike relational database, Solr can store both original data and index OR the INDEX ONLY without the data itself.
If you store only the index, by specifying in Solr schema.xml Stored="false" per field, then you get a much smaller Solr data volume and better performance, but when you query Solr you will receive back only the document ID, and you will have to continue with your relational DB....
Of course you can store some of the data, some of document field, and avoid storing some.
Of course, you should backup/ replicate Solr to ensure disaster recovery, etc.

solr - can I use it for this?

Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.

building in support for future Solr sharding

Building an application. Right now we have one Solr server. But we would like to design the app so that it can support multiple Solr shard in future if we outgrow the indexing needs.
What are keys things to keep in mind when developing an application that can support multiple shards in future?
we stored the solr URL /solr/ in a DB. Which is used to execute queries against solr. There is one URL for Updates and one URL for Searches in the DB
If we add shards to the solr environment at a future date, will the process for using the shards be as simple as updating the URLs in the DB? Or are there other things that need to be updated. We are using SolrJ
e.g. change the SolrSearchBaseURL in DB to:
https://solr2/solr/select?shards=solr1/solr,solr2/solr&indent=true&q={search_query}
And updating the SolrUpdateBaseURL in DB to
https://solr2/solr/
?
Basically, what you are describing has already been implemented in SolrCloud. There the ZooKeeper maintains the state of your search cluster (which shards in what collections, shard replicas, leader and slave nodes and more). It can handle the load on indexing and querying sides by using hashing.
You could, in principle, get by (at least in the beginning of your cluster growth) with the system you have developed. But think about replicating, adding load balancers, external cache servers (like e.g. varnish): in the long run you would end up implementing smth like SolrCloud yourself.
Having said that, there are some caveats to using hash based indexing and hence searching. If you want to implement logical partitioning of you data (say, by date) at this point there is no way to this but making a custom code. There is some work projected around this though.

Resources