How does partial updates work in DataStax Solr - solr

Cassandra is a column family datastore which means that each column has its own timestamp/version and it is possible to update a specific column of a Cassandra row which is often referred to as partial updates.
I am trying to implement a pipeline which makes the data in Cassandra column family also searchable in a search engine like Solr or Elastic Search.
I know Datastax Enterprise Edition does provide this Cassandra Solr Integration out of the box.
Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.
How does the partial updates done in Cassandra are written to Solr?
In other words does partial updates done in Cassandra get written into Solr without the updates stepping onto each other?

I can see where you might be coming from here but its also important for anyone reading this to know that the following statement is not correct
Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.
To add some colour to this let me try to explain. When an update is written to Cassandra, regardless of the content, the new mutation goes into the write path as outlined here:
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlHowDataWritten.html
DSE search uses "secondary index hook" on the table where incoming writes are then pushed into an indexing queue which will be written into documents and stored in the Lucene index. The architecture gives an overview at a high level here:
https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/srch/searchArchitecture.html
This blog post is a bit old now but still outlines the concepts of this:
http://www.datastax.com/dev/blog/datastax-enterprise-cassandra-with-solr-integration-details
So any update regardless of whether it is a single column or an entire row will be indexed at the same time.

Related

Is Solr Better than the normal RDBMS in case of searching normal queries i.e not full text search?

I am developing a web application where I want to use Solr for search only and keep my data on another Database.
I will be having 2 databases: one Relational (Sql Server) and the other will be a copy of it on the NoSQL Solr database.
I'll be searching for specific fields in the solr documents e.g(by id,name,type and join queries) i.e NOT full text search.
I know Solr strength is in full text search by creating inverted index on the documents data, now i want to know does it also helps in my case by creating another type of index on my documents which make normal searching faster than sql server index?
Yes, it will help you.
You need to consider what is your requirement. What is your preference?
If you have the solr as another additional option which will be used for the searching the application data, you need to consider that you have to constantly update the solr. You will need additional infrastructure and all.
If the performance is your main criteria and you don't want to put any search load on your RDBMS then you can add the solr to your system. Also consider how big your data is in the RDBMS. Because RDBMS system are also enough strong to support searching data.
Considering all the above aspects you can take the decision.

Is it possible to retrieve Hbase data along with Solr data?

I have the pipeline of Hbase, Lily, Solr and Hue setup for search and visualization. I am able to search on the data indexed in Solr using Hue, except I cannot view all the required data since I do not have all the fields from Hbase stored in Solr. I'm not planning on storing all of the data as well.
So is there a way of retrieving those fields from Hbase along with the Solr response for visualizing the data with Hue?
From what I know, I believe it is possible to setup the Solr searchhandler to perform this, but I haven't been able to find a concrete example to help me understand better(I am very new to both Solr and Hbase, so examples help)
My question is similar to this question. But I am unable to comment there for further information.
Current Solution thanks to suggestion by Romain:
Used HTML widget to provide a link for each record in Hue Search page back to the Hbase record on the Hbase Browser.
One of the approach is, fetch the required id from the solr, and then get the actual data from Hbase. Well solr gives you the count based on your query and also some faceting features. Once those are fetched, and you always have the data in Hbase. Solr is best for index search. So given the speed and space compromise, this design can help. Another main reason is Hbase gives you good fetch times for entire row, when searched based on row key. So, the overall performance depends on your Hbase row key design also.
i think you are using lily Hbase indexer if I am not wrong. so by default the doc id is the hbase row key, which might make things easy

Solr reindex after schema change

I need to change the datatype of one field from "int" to "long" because some of the values exceed the upper limit of 32-bit signed integer. I might also need to add and drop some fields in the future. Will my index be updated automatically after uploading the new schema.xml? If no, how should I go about re-indexing?
The Solr FAQ suggests that I remove the data via an update command that deletes all data. However, my team is using Cassandra as the primary database and it seems that Cassandra and Solr are tightly coupled (i.e. whatever you do in your Solr index will directly affect the Cassandra data). In our case, deleting the data in Solr results to the deletion of the underlying Cassandra row. What is the best approach to deal with this? The Cassandra table (and Solr core) contains more than 2 billion rows so creating a duplicate core and swapping the two afterwards is not practical.
Note: We are using Datastax Enterprise 4.0. I'm not sure if the behavior I described above is the true for the open-source Solr
You need to reindex the Solr data. Unfortunately, since you are changing the type of a field, you need to delete the old index data for Solr first, and then reindex from the Cassandra data.
See page 109 of the PDF for the DSE 4.0 doc for instructions for Full Reindex from the Solr Admin UI, or page 126 for Solr reload and full reindex from the command line (curl command) - using the reindex=true and deleteAll=true parameters.

solr - can I use it for this?

Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.

Solr/SolrNet: How can I update a document given a document unique ID?

I need to update few fields of each document in Solr index separately from the main indexing process. According to documentation "Create" and "Update" are mapped onto the "Add()" function. http://code.google.com/p/solrnet/wiki/CRUD
So if I add a document which already exist, will it replace the entire document or just the fields that I have specified?
If it'll replace the entire document then the only way that I can think of in order to update is to search the document by unique id, update the document object and then "Add" it again. This doesn't sound feasible because of the frequency of update ops required. Is there a better way to update?
Thanks!
Unfortunately, Solr does not currently support updating individual fields for a given document in the index. The later scenario you describe of retrieving the entire document contents (either from Solr or the original source) and then resending the document (adding via SolrNet) is the only way to update documents in Solr.
Please see the previous question: Update specific field on Solr index for more details about Solr not supporting individual field updates and an open JIRA issue for adding this support to Solr.
If you need to frequently update a lot of documents in SOLR, you might need to rethink your entire solution. In typical solutions that use SOLR and require lots of frequent updates to documents, the way it is usually done is that the documents reside in some SQL or NoSQL database, and they are modified there. Then you use DIH or something similar to bulk update the SOLR index from the database, possibly just dropping the index and re-indexing all content. SOLR can index documents very quickly so that is typically not a problem.
Partial updating of documents is now supported in the newer versions of Solr, for example 4.10 does pretty well. Please look at the following page for more information:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
The only detail is that you need to declare your fields as stored=true to allow for partial updates.
I also show how to do it in this training:
http://www.pluralsight.com/courses/enterprise-search-using-apache-solr
In this specific module: Content: Schemas, Documents and Indexing

Resources