I need to change the datatype of one field from "int" to "long" because some of the values exceed the upper limit of 32-bit signed integer. I might also need to add and drop some fields in the future. Will my index be updated automatically after uploading the new schema.xml? If no, how should I go about re-indexing?
The Solr FAQ suggests that I remove the data via an update command that deletes all data. However, my team is using Cassandra as the primary database and it seems that Cassandra and Solr are tightly coupled (i.e. whatever you do in your Solr index will directly affect the Cassandra data). In our case, deleting the data in Solr results to the deletion of the underlying Cassandra row. What is the best approach to deal with this? The Cassandra table (and Solr core) contains more than 2 billion rows so creating a duplicate core and swapping the two afterwards is not practical.
Note: We are using Datastax Enterprise 4.0. I'm not sure if the behavior I described above is the true for the open-source Solr
You need to reindex the Solr data. Unfortunately, since you are changing the type of a field, you need to delete the old index data for Solr first, and then reindex from the Cassandra data.
See page 109 of the PDF for the DSE 4.0 doc for instructions for Full Reindex from the Solr Admin UI, or page 126 for Solr reload and full reindex from the command line (curl command) - using the reindex=true and deleteAll=true parameters.
Related
There is a use-case for us, where we spin-up an embedded solr-server (using the SolrJ EmbeddedSolrServer api) from a remote solr instance. This is so that we can serve documents extremely fast in a query pipeline.
One of the things I am stuck at is the determination of if the remote solr instance has been modified in any ways since the last sync was done. Obviously, a naive way to do is compare docs. one each at a time. However, that would be extremely inefficient and completely negate the entire purpose of being fast.
Thanks for any tips or recommendations.
Each version of the Lucene index is assigned a version number. This version number is exposed through the Replication Handler (which you might already be using to replicate the index to your local embedded Solr instance):
http://host:port/solr/core_name/replication?command=indexversion
Returns the version of the latest replicatable index on the specified master or slave.
If you want to do it more manually, you can use the _version_ field that is automagically added to all documents in recent version of Solr, and use that to fetch any _version_ values that is larger than the current, largest version in your index. This assumes you use the default _version_ numbering (which you kind of have to, since it's also used internally for Solr Cloud).
If you want to track the individual documents, then you can have a date field which will be applied for every document on the solr side.
I mean you can add a new date field to the schema file which will have named as UpdateDateTime and this field is updated for every time the document entity is modified or newly added document.
I am not very sure how are you maintaining the deleting of documents on the solr side. If you are not maintaining the deletion then you can have another boolen field which will be isDeleted.
Cassandra is a column family datastore which means that each column has its own timestamp/version and it is possible to update a specific column of a Cassandra row which is often referred to as partial updates.
I am trying to implement a pipeline which makes the data in Cassandra column family also searchable in a search engine like Solr or Elastic Search.
I know Datastax Enterprise Edition does provide this Cassandra Solr Integration out of the box.
Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.
How does the partial updates done in Cassandra are written to Solr?
In other words does partial updates done in Cassandra get written into Solr without the updates stepping onto each other?
I can see where you might be coming from here but its also important for anyone reading this to know that the following statement is not correct
Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.
To add some colour to this let me try to explain. When an update is written to Cassandra, regardless of the content, the new mutation goes into the write path as outlined here:
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlHowDataWritten.html
DSE search uses "secondary index hook" on the table where incoming writes are then pushed into an indexing queue which will be written into documents and stored in the Lucene index. The architecture gives an overview at a high level here:
https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/srch/searchArchitecture.html
This blog post is a bit old now but still outlines the concepts of this:
http://www.datastax.com/dev/blog/datastax-enterprise-cassandra-with-solr-integration-details
So any update regardless of whether it is a single column or an entire row will be indexed at the same time.
I have the pipeline of Hbase, Lily, Solr and Hue setup for search and visualization. I am able to search on the data indexed in Solr using Hue, except I cannot view all the required data since I do not have all the fields from Hbase stored in Solr. I'm not planning on storing all of the data as well.
So is there a way of retrieving those fields from Hbase along with the Solr response for visualizing the data with Hue?
From what I know, I believe it is possible to setup the Solr searchhandler to perform this, but I haven't been able to find a concrete example to help me understand better(I am very new to both Solr and Hbase, so examples help)
My question is similar to this question. But I am unable to comment there for further information.
Current Solution thanks to suggestion by Romain:
Used HTML widget to provide a link for each record in Hue Search page back to the Hbase record on the Hbase Browser.
One of the approach is, fetch the required id from the solr, and then get the actual data from Hbase. Well solr gives you the count based on your query and also some faceting features. Once those are fetched, and you always have the data in Hbase. Solr is best for index search. So given the speed and space compromise, this design can help. Another main reason is Hbase gives you good fetch times for entire row, when searched based on row key. So, the overall performance depends on your Hbase row key design also.
i think you are using lily Hbase indexer if I am not wrong. so by default the doc id is the hbase row key, which might make things easy
I'm having an index (Solr/Lucene v. 4.x) with ~1bn rows (180gb) and wanted to migrate that into the Datastax variant of Solr. I couldn't find any HOWTO or migration guideline. Will simply copying the index dir to Datastax solr.data// do the trick, plus posting the solrconfig.xml and schema.xml?
br
accid
The first question is how much of your data is "stored", and then you need to export your existing Solr data to, say, CSV files, and then import that data into Datastax Enterprise.
But, you cannot directly move a Lucene/Solr index into Datastax Enterprise. For one thing, DSE stores some additional attributes for each Solr document.
The whole point of DSE is that Cassandra becomes your system of record, maintaining the raw data, and then DSE/Solr is simply indexing the data to support rich query. DSE uses Cassandra to store the data and Solr to index the data.
You can use something like, https://github.com/dbashford/solr2solr, to copy your data from one to the other, but you can't re-use your index files.
I need to update few fields of each document in Solr index separately from the main indexing process. According to documentation "Create" and "Update" are mapped onto the "Add()" function. http://code.google.com/p/solrnet/wiki/CRUD
So if I add a document which already exist, will it replace the entire document or just the fields that I have specified?
If it'll replace the entire document then the only way that I can think of in order to update is to search the document by unique id, update the document object and then "Add" it again. This doesn't sound feasible because of the frequency of update ops required. Is there a better way to update?
Thanks!
Unfortunately, Solr does not currently support updating individual fields for a given document in the index. The later scenario you describe of retrieving the entire document contents (either from Solr or the original source) and then resending the document (adding via SolrNet) is the only way to update documents in Solr.
Please see the previous question: Update specific field on Solr index for more details about Solr not supporting individual field updates and an open JIRA issue for adding this support to Solr.
If you need to frequently update a lot of documents in SOLR, you might need to rethink your entire solution. In typical solutions that use SOLR and require lots of frequent updates to documents, the way it is usually done is that the documents reside in some SQL or NoSQL database, and they are modified there. Then you use DIH or something similar to bulk update the SOLR index from the database, possibly just dropping the index and re-indexing all content. SOLR can index documents very quickly so that is typically not a problem.
Partial updating of documents is now supported in the newer versions of Solr, for example 4.10 does pretty well. Please look at the following page for more information:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
The only detail is that you need to declare your fields as stored=true to allow for partial updates.
I also show how to do it in this training:
http://www.pluralsight.com/courses/enterprise-search-using-apache-solr
In this specific module: Content: Schemas, Documents and Indexing