Migrate Solr index to Datastax (Enterprise Search)? - solr

I'm having an index (Solr/Lucene v. 4.x) with ~1bn rows (180gb) and wanted to migrate that into the Datastax variant of Solr. I couldn't find any HOWTO or migration guideline. Will simply copying the index dir to Datastax solr.data// do the trick, plus posting the solrconfig.xml and schema.xml?
br
accid

The first question is how much of your data is "stored", and then you need to export your existing Solr data to, say, CSV files, and then import that data into Datastax Enterprise.
But, you cannot directly move a Lucene/Solr index into Datastax Enterprise. For one thing, DSE stores some additional attributes for each Solr document.
The whole point of DSE is that Cassandra becomes your system of record, maintaining the raw data, and then DSE/Solr is simply indexing the data to support rich query. DSE uses Cassandra to store the data and Solr to index the data.

You can use something like, https://github.com/dbashford/solr2solr, to copy your data from one to the other, but you can't re-use your index files.

Related

How does partial updates work in DataStax Solr

Cassandra is a column family datastore which means that each column has its own timestamp/version and it is possible to update a specific column of a Cassandra row which is often referred to as partial updates.
I am trying to implement a pipeline which makes the data in Cassandra column family also searchable in a search engine like Solr or Elastic Search.
I know Datastax Enterprise Edition does provide this Cassandra Solr Integration out of the box.
Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.
How does the partial updates done in Cassandra are written to Solr?
In other words does partial updates done in Cassandra get written into Solr without the updates stepping onto each other?
I can see where you might be coming from here but its also important for anyone reading this to know that the following statement is not correct
Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.
To add some colour to this let me try to explain. When an update is written to Cassandra, regardless of the content, the new mutation goes into the write path as outlined here:
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlHowDataWritten.html
DSE search uses "secondary index hook" on the table where incoming writes are then pushed into an indexing queue which will be written into documents and stored in the Lucene index. The architecture gives an overview at a high level here:
https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/srch/searchArchitecture.html
This blog post is a bit old now but still outlines the concepts of this:
http://www.datastax.com/dev/blog/datastax-enterprise-cassandra-with-solr-integration-details
So any update regardless of whether it is a single column or an entire row will be indexed at the same time.

Solr reindex after schema change

I need to change the datatype of one field from "int" to "long" because some of the values exceed the upper limit of 32-bit signed integer. I might also need to add and drop some fields in the future. Will my index be updated automatically after uploading the new schema.xml? If no, how should I go about re-indexing?
The Solr FAQ suggests that I remove the data via an update command that deletes all data. However, my team is using Cassandra as the primary database and it seems that Cassandra and Solr are tightly coupled (i.e. whatever you do in your Solr index will directly affect the Cassandra data). In our case, deleting the data in Solr results to the deletion of the underlying Cassandra row. What is the best approach to deal with this? The Cassandra table (and Solr core) contains more than 2 billion rows so creating a duplicate core and swapping the two afterwards is not practical.
Note: We are using Datastax Enterprise 4.0. I'm not sure if the behavior I described above is the true for the open-source Solr
You need to reindex the Solr data. Unfortunately, since you are changing the type of a field, you need to delete the old index data for Solr first, and then reindex from the Cassandra data.
See page 109 of the PDF for the DSE 4.0 doc for instructions for Full Reindex from the Solr Admin UI, or page 126 for Solr reload and full reindex from the command line (curl command) - using the reindex=true and deleteAll=true parameters.

Can I crawl with Nutch, store in Cassandra, index using Solr?

I'm developing a keyword analytics app. I wish to crawl the web using Nutch, index the output using Solr and finally store the data in Cassandra.
I should later be able to do search queries and analytics on Solr and it must fetch the relevant data from Cassandra.
Is this setup possible? If yes, is there anything that I should keep in mind?
If you use Datastax's Cassandra, indexing Cassandra table(s) into Solr is much easier. Here is a link at http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-solr
I think you can, but I am not a Cassandra user, so never tried.
You will have to configure gora.properties (http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1/conf/gora.properties) to enable Cassandra. In Nutch 2 Tutorial (http://wiki.apache.org/nutch/Nutch2Tutorial) the do that for HBase.
To know where is the data mapped in Cassandra you will need to take a look at the mappings at http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1/conf/gora-cassandra-mapping.xml
Nutch will store the data in Cassandra. About Solr I don't know (I never used Solr).
Programmatically its possible .... you can get the result from solr indexes ... keep the unique id in both cassandra and Solr ... fetch that id from solr and fetch the entire result from cassandra .....

Getting raw text files from a Solr snapshot?

I have a Solr database snapshot. The database is an archive of published blog posts (plus a bunch of metadata for each post). The snapshot is tens of thousands of posts.
I want to run some machine learning algorithms and topic modeling on the posts. So I don't need the database per se, I just want to get the raw text of the posts and the metadata in some simple form. Can anyone tell me how to open or extract that info without actually installing Solr?
I suppose you have the Solr Index when you mean the Solr database snapshot.
Solr index is basically a lucene index and you can use the Lucene apis to just read the index and extract data from the fields.
This would not need Solr to be installed.

Migrate data from Solr 3

I'm thinking about migrating from Solr 3 to Solrcloud or Elasticsearch and was wondering if is it possible to import data indexed with Solr 3.x to Solrcloud (solr 4) and/or Elasticsearch?
They're all lucene based, but since they have different behaviors I'm not really sure that it will work.
Has anyone ever done this? How it going? Related issues?
Regarding importing data from solr to elasticsearch you can take a look at the elasticsearch mock solr plugin. It adds a new solr-alike endpoint to elasticsearch, so that you can use the indexer that you've written for solr (if you have one) to index documents in elasticsearch.
Also, I've been working on an elasticsearch solr river which would allow to import data from solr to elasticsearch through the solrj library. The only limitation is that it can import only the fields that you configured as stored in solr. I should be able to make it public pretty soon, just a matter of days. I'll update my answer as soon as it's available.
Regarding the upgrade of Solr from 3.x to 4.0, not a big deal. The index format has changed, but Solr will take care of upgrading the index. That happens automatically once you start Solr with your old index. But after that the index cannot be read anymore by a previous Solr/lucene version. If you have a master/slave setup you should upgrade the slaves first, otherwise the index on the master would be replicated to the slaves which cannot read it yet.
UPDATE
Regarding the river that I mentioned: I made it public, you can download it from my github profile: https://github.com/javanna/elasticsearch-river-solr.

Resources