How do I go about migrating an existing Solr instance (4.2.1) with several cores to SolrCloud (4.6.1)? Will I have to re-index the data?
If re-index is feasible - I would vote for it.
Unless you are specifically interested in DocValues, you don't have to upgrade the index format. See codec history here:
https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/codecs/lucene46/package-summary.html
Make sure everything is backed up and indexes are in sync among replicas before trying.
unlikely(from 4.2 to 4.6) you need to re-index everything but usually it's a good practice to re-index your data since you will creating newer version of lucene indexes that maximize your chance of a smooth migration.
Related
I'm using Solr 4.10.4 with MySQL on Windows.
Solr recommends setting the HTTP cache setting max-age to half of how often the index gets updated update the index.
So, the question is: does Solr automatically perform full/delta imports? If so, how do I control that?
If not, I assume it's up to me to update the index regularly?
#Howie
SOLR can be configured to pull data using a DataImportHandler
You should look at this documentation for details https://wiki.apache.org/solr/DataImportHandler
There is some documentation on scheduling the data pull but it appears that its not a configurable feature and requires some additional changes. The section http://wiki.apache.org/solr/DataImportHandler#Scheduling discuss the same. Also there is a Stackoverflow question on the same How can I Schedule data imports in Solr
Alternately you can also post data to SOLR as needed from your System.
Both strategies will work. It depends completely on what is better for your system. I would recommend going through the appropriate docs on Indexing https://wiki.apache.org/solr/FrontPage#Search_and_Indexing and then decide which strategy works better for your application
We are planning to setup a Solr cluster which will have around 30 machines. We have a zookeeper ensemble of 3 nodes which will be managing Solr.
We will have new production data every few days, which is going to be quite different from the one that is in Prod. Since the data difference is
quite large, we are planning to use hadoop to create the entire Solr index dump and copy these binaries to each machine and maybe do some kinda core swap.
I am still new to Solr and was wondering if this is a good idea. I could http post my data to the prod cluster, but each update could span multiple documents.
I am not sure how this will impact the read traffic while the write happens.
Any pointers ?
Thanks
I am not sure i completely understand your explanations, but it seems to me that you would like to migrate to a new solr cloud environment with zero down time.
First, you need to know how many shards you want, how many replicas, etc.
You need to deploy the solr nodes, then you need to use the collection admin API to create the collection as desired (https://cwiki.apache.org/confluence/display/solr/Collections+API).
After all this you should be ready to add content to your new solr environment.
You can use Hadoop to populate the new solr cloud, say for instance by using solrj. Or you can use the data import handler to migrate data from another solr (or a relational database, etc).
It is very important how you create your solr cloud in terms of document routing, because it controls in which shard your document will be stored.
This is why it is not a good idea to copy raw index data to a solr node, as you may mess up the routing.
I found these explanations very useful about routing: https://lucidworks.com/blog/solr-cloud-document-routing/
I was wondering which scenario (or the combination) would be better for my application. From the aspect of performance, scalability and high availability.
Here is my application:
Suppose I am going to have more than 10m documents and it grows every day. (probably in 1 years it reaches to more than 100m docs. I want to use Solr as tool for indexing these documents but the problem is I have some data fields that could change frequently. (not too much but it could change)
Scenarios:
1- Using SolrCloud as database for all data. (even the one that could be changed)
2- Using SolrCloud as database for static data and using RDBMS (such as oracle) for storing dynamic fields.
3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all data.
Best regards.
I'm not sure how SolrCloud works with DIH (you might face situation when indexing will happen only on one instance).
On the other hand I would store data in RDBMS, because from time to time you will need to reindex Solr to add some new functionality to the index.
At the end of the day I would use DB + Solr (all the fields) with either Hadoop (have not used it yet) or some other piece of software to post data into the SolrCloud.
I'm using Solr 4.0 and have two rather large collections indexed. I would rather not have to reload the data when upgrading to 4.8, but I'm not finding much in the way of instructions on how to maintain my collections. Is there a procedure for this other than common sense?
Indexes are compatible. Upgrade your Solr and point the new version at the same index location and everything just works.
I'm thinking about migrating from Solr 3 to Solrcloud or Elasticsearch and was wondering if is it possible to import data indexed with Solr 3.x to Solrcloud (solr 4) and/or Elasticsearch?
They're all lucene based, but since they have different behaviors I'm not really sure that it will work.
Has anyone ever done this? How it going? Related issues?
Regarding importing data from solr to elasticsearch you can take a look at the elasticsearch mock solr plugin. It adds a new solr-alike endpoint to elasticsearch, so that you can use the indexer that you've written for solr (if you have one) to index documents in elasticsearch.
Also, I've been working on an elasticsearch solr river which would allow to import data from solr to elasticsearch through the solrj library. The only limitation is that it can import only the fields that you configured as stored in solr. I should be able to make it public pretty soon, just a matter of days. I'll update my answer as soon as it's available.
Regarding the upgrade of Solr from 3.x to 4.0, not a big deal. The index format has changed, but Solr will take care of upgrading the index. That happens automatically once you start Solr with your old index. But after that the index cannot be read anymore by a previous Solr/lucene version. If you have a master/slave setup you should upgrade the slaves first, otherwise the index on the master would be replicated to the slaves which cannot read it yet.
UPDATE
Regarding the river that I mentioned: I made it public, you can download it from my github profile: https://github.com/javanna/elasticsearch-river-solr.