I'm developing a keyword analytics app. I wish to crawl the web using Nutch, index the output using Solr and finally store the data in Cassandra.
I should later be able to do search queries and analytics on Solr and it must fetch the relevant data from Cassandra.
Is this setup possible? If yes, is there anything that I should keep in mind?
If you use Datastax's Cassandra, indexing Cassandra table(s) into Solr is much easier. Here is a link at http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-solr
I think you can, but I am not a Cassandra user, so never tried.
You will have to configure gora.properties (http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1/conf/gora.properties) to enable Cassandra. In Nutch 2 Tutorial (http://wiki.apache.org/nutch/Nutch2Tutorial) the do that for HBase.
To know where is the data mapped in Cassandra you will need to take a look at the mappings at http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1/conf/gora-cassandra-mapping.xml
Nutch will store the data in Cassandra. About Solr I don't know (I never used Solr).
Programmatically its possible .... you can get the result from solr indexes ... keep the unique id in both cassandra and Solr ... fetch that id from solr and fetch the entire result from cassandra .....
Related
I have the pipeline of Hbase, Lily, Solr and Hue setup for search and visualization. I am able to search on the data indexed in Solr using Hue, except I cannot view all the required data since I do not have all the fields from Hbase stored in Solr. I'm not planning on storing all of the data as well.
So is there a way of retrieving those fields from Hbase along with the Solr response for visualizing the data with Hue?
From what I know, I believe it is possible to setup the Solr searchhandler to perform this, but I haven't been able to find a concrete example to help me understand better(I am very new to both Solr and Hbase, so examples help)
My question is similar to this question. But I am unable to comment there for further information.
Current Solution thanks to suggestion by Romain:
Used HTML widget to provide a link for each record in Hue Search page back to the Hbase record on the Hbase Browser.
One of the approach is, fetch the required id from the solr, and then get the actual data from Hbase. Well solr gives you the count based on your query and also some faceting features. Once those are fetched, and you always have the data in Hbase. Solr is best for index search. So given the speed and space compromise, this design can help. Another main reason is Hbase gives you good fetch times for entire row, when searched based on row key. So, the overall performance depends on your Hbase row key design also.
i think you are using lily Hbase indexer if I am not wrong. so by default the doc id is the hbase row key, which might make things easy
I have set up nutch 1.9 with solr properly. Now I would like to retrieve this data via java into a program, to analyse and display the data. At them moment I can query the data with solr. However, I cannot find any further information about the underlying database which is used by nutch and how to retrive data.
Any recommendations, how that can be done?
I appreciate your answer!
If you can see your data already indexed in Solr, then you do not need to retrieve any thing from Nutch. What you need right now is the right Solr client to interact with Solr. The client will query Solr and parse responses.
Since you are going to use Java, you should use SolrJ.
Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.
I'm having an index (Solr/Lucene v. 4.x) with ~1bn rows (180gb) and wanted to migrate that into the Datastax variant of Solr. I couldn't find any HOWTO or migration guideline. Will simply copying the index dir to Datastax solr.data// do the trick, plus posting the solrconfig.xml and schema.xml?
br
accid
The first question is how much of your data is "stored", and then you need to export your existing Solr data to, say, CSV files, and then import that data into Datastax Enterprise.
But, you cannot directly move a Lucene/Solr index into Datastax Enterprise. For one thing, DSE stores some additional attributes for each Solr document.
The whole point of DSE is that Cassandra becomes your system of record, maintaining the raw data, and then DSE/Solr is simply indexing the data to support rich query. DSE uses Cassandra to store the data and Solr to index the data.
You can use something like, https://github.com/dbashford/solr2solr, to copy your data from one to the other, but you can't re-use your index files.
Currently collecting information where I should use Nutch with Solr (domain - vertical web search).
Could you suggest me?
Nutch is a framework to build web crawler and search engines. Nutch can do the whole process from collecting the web pages to building the inverted index. It can also push those indexes to Solr.
Solr is mainly a search engine with support for faceted searches and many other neat features. But Solr doesn't fetch the data, you have to feed it.
So maybe the first thing you have to ask in order to choose between the two is whether or not you have the data to be indexed already available (in XML, in a CMS or a database.). In that case, you should probably just use Solr and feed it that data. On the other hand, if you have to fetch the data from the web, you are probably better of with Nutch.