I have the pipeline of Hbase, Lily, Solr and Hue setup for search and visualization. I am able to search on the data indexed in Solr using Hue, except I cannot view all the required data since I do not have all the fields from Hbase stored in Solr. I'm not planning on storing all of the data as well.
So is there a way of retrieving those fields from Hbase along with the Solr response for visualizing the data with Hue?
From what I know, I believe it is possible to setup the Solr searchhandler to perform this, but I haven't been able to find a concrete example to help me understand better(I am very new to both Solr and Hbase, so examples help)
My question is similar to this question. But I am unable to comment there for further information.
Current Solution thanks to suggestion by Romain:
Used HTML widget to provide a link for each record in Hue Search page back to the Hbase record on the Hbase Browser.
One of the approach is, fetch the required id from the solr, and then get the actual data from Hbase. Well solr gives you the count based on your query and also some faceting features. Once those are fetched, and you always have the data in Hbase. Solr is best for index search. So given the speed and space compromise, this design can help. Another main reason is Hbase gives you good fetch times for entire row, when searched based on row key. So, the overall performance depends on your Hbase row key design also.
i think you are using lily Hbase indexer if I am not wrong. so by default the doc id is the hbase row key, which might make things easy
Related
Cassandra is a column family datastore which means that each column has its own timestamp/version and it is possible to update a specific column of a Cassandra row which is often referred to as partial updates.
I am trying to implement a pipeline which makes the data in Cassandra column family also searchable in a search engine like Solr or Elastic Search.
I know Datastax Enterprise Edition does provide this Cassandra Solr Integration out of the box.
Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.
How does the partial updates done in Cassandra are written to Solr?
In other words does partial updates done in Cassandra get written into Solr without the updates stepping onto each other?
I can see where you might be coming from here but its also important for anyone reading this to know that the following statement is not correct
Given that Solr and ElasticSearch maintains the versioning at the Document level and not at the Field level, there is a disconnect in the data model of Solr and Cassandra conceptually.
To add some colour to this let me try to explain. When an update is written to Cassandra, regardless of the content, the new mutation goes into the write path as outlined here:
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlHowDataWritten.html
DSE search uses "secondary index hook" on the table where incoming writes are then pushed into an indexing queue which will be written into documents and stored in the Lucene index. The architecture gives an overview at a high level here:
https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/srch/searchArchitecture.html
This blog post is a bit old now but still outlines the concepts of this:
http://www.datastax.com/dev/blog/datastax-enterprise-cassandra-with-solr-integration-details
So any update regardless of whether it is a single column or an entire row will be indexed at the same time.
Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.
I have a data warehousing problem, needing to query over a large dataset. For the sake of this example lets say a typical state would have 30 million users with activity stats for each. Ideally I could buy a data warehousing tool (Vertica, Infobright, etc...) but that's not in the cards or the budget.
Right now I'm considering using Solr to query HBase. While I believe HBase could scale up to the needs, I worry about Solr. It's optimized as a search engine, i.e. the first pages of results return before the last and there's no support for something like a database cursor. Tests so far have shown that getting a large result set out of Solr have been slower than I would've liked. For instance comparing a query that would retrieve half of the available users (one which ultimately returned 500 mb of data) in the community version of Infobright finished in under a minute, for Solr it took 12 minutes.
Is there something other than Solr that's better suited to query this data? Are there any optimizations that would help with bulk data input and output?
I know this is a bit late but...
Depending on your search requirements Solr could be a good option. Keep in mind you most likely won't need to index everything in HBase. Are there certain fields you can pick out? Portions of text? You most certainly do NOT need to store this stuff in Solr if you're already storing it in HBase.
Solr is an excellent secondary index system to put on top of HBase, and Solr also has some great text analytics capabilities if that is what you need.
You should also take a look at ElasticSearch, one of Solr's primary competitors.
Take a look at SolBase and Lily - two implementation that combine Solr with HBase backend
I have very huge solr index. I want to tag all documents with terms which better represent that document like this. Does this type of clustering results is also come under document tagging?
Which approach is better, Index time Document tagging or Query time document tagging like carrot2 ?
Query time has the obvious drawback that this makes the query more expensive.
However, the clustering results at query time are supposedly better, because at that time, more information has been seen and user feedback can be incorporated.
Note that technically, this is probably more frequent pattern mining than cluster analysis.
Maybe you should just try this variant of frequent pattern mining on your whole data set. You might not even need to store which documents were tagged which way - the solr engine should already be optimized to retrieve them again when needed.
I understood from your question that you want to know how to implement something similar to carrot2 faceting using solr.
IMO you can add a multivalued field tag to your documents (see this Stack Overflow Question for an example) with the cluster names for that doc, and then build facets using that field as explained in Solr wiki here and here.
I need to update few fields of each document in Solr index separately from the main indexing process. According to documentation "Create" and "Update" are mapped onto the "Add()" function. http://code.google.com/p/solrnet/wiki/CRUD
So if I add a document which already exist, will it replace the entire document or just the fields that I have specified?
If it'll replace the entire document then the only way that I can think of in order to update is to search the document by unique id, update the document object and then "Add" it again. This doesn't sound feasible because of the frequency of update ops required. Is there a better way to update?
Thanks!
Unfortunately, Solr does not currently support updating individual fields for a given document in the index. The later scenario you describe of retrieving the entire document contents (either from Solr or the original source) and then resending the document (adding via SolrNet) is the only way to update documents in Solr.
Please see the previous question: Update specific field on Solr index for more details about Solr not supporting individual field updates and an open JIRA issue for adding this support to Solr.
If you need to frequently update a lot of documents in SOLR, you might need to rethink your entire solution. In typical solutions that use SOLR and require lots of frequent updates to documents, the way it is usually done is that the documents reside in some SQL or NoSQL database, and they are modified there. Then you use DIH or something similar to bulk update the SOLR index from the database, possibly just dropping the index and re-indexing all content. SOLR can index documents very quickly so that is typically not a problem.
Partial updating of documents is now supported in the newer versions of Solr, for example 4.10 does pretty well. Please look at the following page for more information:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
The only detail is that you need to declare your fields as stored=true to allow for partial updates.
I also show how to do it in this training:
http://www.pluralsight.com/courses/enterprise-search-using-apache-solr
In this specific module: Content: Schemas, Documents and Indexing