Apache nutch 1.9 database - database

I have set up nutch 1.9 with solr properly. Now I would like to retrieve this data via java into a program, to analyse and display the data. At them moment I can query the data with solr. However, I cannot find any further information about the underlying database which is used by nutch and how to retrive data.
Any recommendations, how that can be done?
I appreciate your answer!

If you can see your data already indexed in Solr, then you do not need to retrieve any thing from Nutch. What you need right now is the right Solr client to interact with Solr. The client will query Solr and parse responses.
Since you are going to use Java, you should use SolrJ.

Related

Can I crawl with Nutch, store in Cassandra, index using Solr?

I'm developing a keyword analytics app. I wish to crawl the web using Nutch, index the output using Solr and finally store the data in Cassandra.
I should later be able to do search queries and analytics on Solr and it must fetch the relevant data from Cassandra.
Is this setup possible? If yes, is there anything that I should keep in mind?
If you use Datastax's Cassandra, indexing Cassandra table(s) into Solr is much easier. Here is a link at http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-solr
I think you can, but I am not a Cassandra user, so never tried.
You will have to configure gora.properties (http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1/conf/gora.properties) to enable Cassandra. In Nutch 2 Tutorial (http://wiki.apache.org/nutch/Nutch2Tutorial) the do that for HBase.
To know where is the data mapped in Cassandra you will need to take a look at the mappings at http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1/conf/gora-cassandra-mapping.xml
Nutch will store the data in Cassandra. About Solr I don't know (I never used Solr).
Programmatically its possible .... you can get the result from solr indexes ... keep the unique id in both cassandra and Solr ... fetch that id from solr and fetch the entire result from cassandra .....

solr - can I use it for this?

Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.

Migrate data from Solr 3

I'm thinking about migrating from Solr 3 to Solrcloud or Elasticsearch and was wondering if is it possible to import data indexed with Solr 3.x to Solrcloud (solr 4) and/or Elasticsearch?
They're all lucene based, but since they have different behaviors I'm not really sure that it will work.
Has anyone ever done this? How it going? Related issues?
Regarding importing data from solr to elasticsearch you can take a look at the elasticsearch mock solr plugin. It adds a new solr-alike endpoint to elasticsearch, so that you can use the indexer that you've written for solr (if you have one) to index documents in elasticsearch.
Also, I've been working on an elasticsearch solr river which would allow to import data from solr to elasticsearch through the solrj library. The only limitation is that it can import only the fields that you configured as stored in solr. I should be able to make it public pretty soon, just a matter of days. I'll update my answer as soon as it's available.
Regarding the upgrade of Solr from 3.x to 4.0, not a big deal. The index format has changed, but Solr will take care of upgrading the index. That happens automatically once you start Solr with your old index. But after that the index cannot be read anymore by a previous Solr/lucene version. If you have a master/slave setup you should upgrade the slaves first, otherwise the index on the master would be replicated to the slaves which cannot read it yet.
UPDATE
Regarding the river that I mentioned: I made it public, you can download it from my github profile: https://github.com/javanna/elasticsearch-river-solr.

What is the role of NUTCH if we are going to make a search engine using Hadoop and Solr?

I want to make a search engine. In which i want to crawl some sites and stored their indexes and info in Hadoop. And then using Solr search will be done.
But I am facing lots of issues. If search over google then different people give different suggestions and different configuring ways for setup a hadoop based search engine.
These are my some questions :
1) How the crawling will be done? Is there any use of NUTCH for completing the crawling or not? If yes then how Hadoop and NUTCH communicate with each other?
2) What is the use of Solr? If NUTCH done Crawling and stored their crawled indexes and their information into the Hadoop then what's the role of Solr?
3) Can we done searching using Solr and Nutch? If yes then where they will saved their crawled indexes?
4) How Solr communicate with Hadoop?
5) Please explain me one by one steps if possible, that how can i crawl some sites and save their info into DB(Hadoop or any other) and then do search .
I am really really stuck with this. Any help will really appreciated.
A very big Thanks in advance. :)
Please help me to sort out my huge issue please
We are using Nutch as a webcrawler and Solr for searching in some productive environments. So I hope I can give you some information about 3).
How does this work? Nutch has it's own crawling db and some websites where it starts crawling. It has some plugins where you can configure different things like pdf crawling, which fields will be extracted of html sites and so on. When crawling Nutch stores all links extracted from a website and will follow them in the next cycle. All crawling results will be stored in a crawl db. In Nutch you configure an intervall where crawled results will be outdated and the crawler begins from the defined startsites.
The results inside the crawl db will be synchronized to the solr index. So you are searching on the solr index. Nutch is in this constallation only to get data from websites and providing them for solr.

Nutch versus Solr

Currently collecting information where I should use Nutch with Solr (domain - vertical web search).
Could you suggest me?
Nutch is a framework to build web crawler and search engines. Nutch can do the whole process from collecting the web pages to building the inverted index. It can also push those indexes to Solr.
Solr is mainly a search engine with support for faceted searches and many other neat features. But Solr doesn't fetch the data, you have to feed it.
So maybe the first thing you have to ask in order to choose between the two is whether or not you have the data to be indexed already available (in XML, in a CMS or a database.). In that case, you should probably just use Solr and feed it that data. On the other hand, if you have to fetch the data from the web, you are probably better of with Nutch.

Resources