Migrate data from Solr 3

Migrate data from Solr 3 - solr

I'm thinking about migrating from Solr 3 to Solrcloud or Elasticsearch and was wondering if is it possible to import data indexed with Solr 3.x to Solrcloud (solr 4) and/or Elasticsearch?
They're all lucene based, but since they have different behaviors I'm not really sure that it will work.
Has anyone ever done this? How it going? Related issues?

Regarding importing data from solr to elasticsearch you can take a look at the elasticsearch mock solr plugin. It adds a new solr-alike endpoint to elasticsearch, so that you can use the indexer that you've written for solr (if you have one) to index documents in elasticsearch.
Also, I've been working on an elasticsearch solr river which would allow to import data from solr to elasticsearch through the solrj library. The only limitation is that it can import only the fields that you configured as stored in solr. I should be able to make it public pretty soon, just a matter of days. I'll update my answer as soon as it's available.
Regarding the upgrade of Solr from 3.x to 4.0, not a big deal. The index format has changed, but Solr will take care of upgrading the index. That happens automatically once you start Solr with your old index. But after that the index cannot be read anymore by a previous Solr/lucene version. If you have a master/slave setup you should upgrade the slaves first, otherwise the index on the master would be replicated to the slaves which cannot read it yet.
UPDATE
Regarding the river that I mentioned: I made it public, you can download it from my github profile: https://github.com/javanna/elasticsearch-river-solr.

Related

Building a Solr index and hot swapping it in Prod?

We are planning to setup a Solr cluster which will have around 30 machines. We have a zookeeper ensemble of 3 nodes which will be managing Solr.
We will have new production data every few days, which is going to be quite different from the one that is in Prod. Since the data difference is
quite large, we are planning to use hadoop to create the entire Solr index dump and copy these binaries to each machine and maybe do some kinda core swap.
I am still new to Solr and was wondering if this is a good idea. I could http post my data to the prod cluster, but each update could span multiple documents.
I am not sure how this will impact the read traffic while the write happens.
Any pointers ?
Thanks

I am not sure i completely understand your explanations, but it seems to me that you would like to migrate to a new solr cloud environment with zero down time.
First, you need to know how many shards you want, how many replicas, etc.
You need to deploy the solr nodes, then you need to use the collection admin API to create the collection as desired (https://cwiki.apache.org/confluence/display/solr/Collections+API).
After all this you should be ready to add content to your new solr environment.
You can use Hadoop to populate the new solr cloud, say for instance by using solrj. Or you can use the data import handler to migrate data from another solr (or a relational database, etc).
It is very important how you create your solr cloud in terms of document routing, because it controls in which shard your document will be stored.
This is why it is not a good idea to copy raw index data to a solr node, as you may mess up the routing.
I found these explanations very useful about routing: https://lucidworks.com/blog/solr-cloud-document-routing/

Does moving existing Solr instance to SolrCloud require re-indexing?

How do I go about migrating an existing Solr instance (4.2.1) with several cores to SolrCloud (4.6.1)? Will I have to re-index the data?

If re-index is feasible - I would vote for it.
Unless you are specifically interested in DocValues, you don't have to upgrade the index format. See codec history here:
https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/codecs/lucene46/package-summary.html
Make sure everything is backed up and indexes are in sync among replicas before trying.

unlikely(from 4.2 to 4.6) you need to re-index everything but usually it's a good practice to re-index your data since you will creating newer version of lucene indexes that maximize your chance of a smooth migration.

How to synchronize indexes and repository of Apache Lucene and Solr in Clustered JBoss

I have a situation, I want to run my demo Web-Application built with EJB-Hibernate into JBoss Cluter for High Availability and in my application we use Apache Solr (and one part uses Lucene as well) for text-based search.
I got the clustering information from Jboss official website, but I am not able to get any information about how to sync up solr or lucene indexes and their data repositories..?
I am sure that lot many people must have done clustering with Lucene or solr in them, please anyone point me to the correct source about it. About how to synchronize solr or lucene directories on multiple server instances of JBoss.
I have embedded solr deployment, so as Jayendra had suggested below, Solr Replication with HTTP is not possible for me. Is there any other way to do solr-replication with repeater configuration (i.e. my all nodes will act as both master as well as slave)?

If you want to copy/sync data repositories for Solr, you can check for Solr Replication which will allow you to sync data repositories across different solrs instances on different machines

The clustering technology of JBoss and WildFly is based on the Infinispan OSS project.
Infinispan provides an highly efficient distributed storage model, and the project includes an Apache Lucene index storage layer:
http://infinispan.org/docs/dev/user_guide/user_guide.html#integrations:lucene-directory
It should be easy to replace the Solr Directory with this implementation.

Is Solr necessary to index crawled data for Nutch?

I found that Nutch 1.4 only contains one Indexer/solrindex. Is Solr the only way for Nutch to index the crawled data? If not, what are the other ways?
I'm also wondering why Nutch 1.4 uses Solr to index the data. Why not do it itself? Doesn't it increase the coupling of these two projects?

Solr uses lucene internally. Since 2005, nutch was designated as a subproject of Lucene. Historically, nutch used lucene indexes and was a full fledged search engine (this was until ver 1.0) . It had crawling capability and even support to index data and UI via browser to query the indexed data (similar to that like a google search).
As the initial design was based around lucene (it was another apache project which earned lot of kudos at that period and still rocks), the nutch code was NOT changed or made generic so that other indexing frameworks could have been used. If you want to, then you need lots of efforts to put your indexing framework with it.
In recent versions, (nutch ver 1.3 and further), the Nutch dev team realized that its difficult to track the work involved in indexing due to changing needs and expertise required. It was better to delegate the responsibility of indexing to Solr (its a lucene based indexing framework). The Nutch developers focus only on the crawling part. So now nutch is not a full fledged search engine but its a full fledged web crawler.
Hope this answers your query. You can browse nutch news for more info.
Latest happenings:
Recently there are efforts going on to create a generic library for crawlers (under commons). This project is commons-crawler which will have all functions required for a web crawler and can be used for creating crawlers. Further nutch versions will be using this library as a dependency.

Hadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera?

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?
I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation.
Here is my proposed architecture:
Use a MapReduce framework (Cloudera, Hadoop, Nutch, even DryadLinq) to prepare those documents for indexing
Index those documents into a Lucene.NET / Lucene (java) compatible file format
Deploy that file to all my Solr instances
Activate that replicated index
If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.
Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc...), index them, copy the index to my Solr instances, and somehow "activate" them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr.
The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet.
Based on your experience, how does
this architecture look? Do you see
any issues/problems? What advice can
you give?
What should I not do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it's saying.

Generally, you what you've described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that's based on Hadoop core.
You shouldn't mix Cloudera, Hadoop, Nutch and Lucene. You'll most likely end up using all of them:
Nutch is the name of indexing / answering (like Solr) machinery.
Nutch itself runs using a Hadoop cluster (which heavily uses it's own distributed file system, HDFS)
Nutch uses Lucene format of indexes
Nutch includes a query answering frontend, which you can use, or you can attach a Solr frontend and use Lucene indexes from there.
Finally, Cloudera Hadoop Distribution (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you'd most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).
Generally, if you're just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.
I personally don't see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I'd managed such a project, I'd considered alternatives, especially if your task of crawling & indexing is limited (i.e. you don't want to crawl the whole internet for some purpose).

Have you looked at Lucandra https://github.com/tjake/Lucandra a Cassandra based back end for Lucense/Solr which you can use Hadoop to populate the Cassandra store with the index of your data.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight