Best Tika integration on Solr or Nutch - solr

Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)?
I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?

Set up the Tika plugin with Nutch, Nutch will parse the data for you and will do all the hard work for you.
I would suggest setting it up on Solr as well, you may wish to send documents to Solr via the curl command and it would help to have it set up on Solr too. It comes with little extra configuration and no performance costs:
There is a guide to setting up Tika & extracting request handler here

Apply tika parser in Nutch's parsing phase.

Related

in which type of project I could add the java program for apache solr with eclipse?

I am going to use apache solr for first time. I just want to write a basic code for apache solr in java with eclipse. Can you please suggest me that how to get familiar with apache solr.
Basically, SOLR is a full text search engine. You can use it as a data store to keep data of unstructured nature. You can use SolrJ to communicate to SOLR from Java.

Parse microdata using apache tika plugin on apache nutch

my objective is to
- crawl on urls and
- extract micro data and
- save to solr
I used this guide to setup nutch, hbase, and solr
Im using nutch to crawl on urls and hbase, im using tika pluggin for nutch to parse pages, but it only gets meta data.
Did I miss something to config? please guide me or suggest alternatives
You need to implement your own ParseFilter and implement the extraction logic there. You will get a DocumentFragment generated by the Tika parser and could use e.g. XPath to get the micro data.
Note that the DOM generated by Tika are heavily normalised / modified so your Xpath expressions could possibly not match. Maybe better to rely on the old HTML parser instead.
One generic way of doing would be to use Apache Any23 as done for instance in this storm-crawler module.
BTW There is an open JIRA for a MicroDataHandler in Tika which hasn't been committed yet.
HTH

What are the benefits of applying Apache Tika to Solr instead of Nutch

I am trying to crawl data with Apache Nutch and index it with Apache Solr.
As part of this I want to parse the content as well. I am trying to figure out is it better to apply Tika to Nutch , to Solr or both.
Apply it as early as you can but make sure to keep the original, full-fidelity, document somewhere as well.
There is no point passing a binary file around if you know that in the end you are going to reduce it to a set of metadata fields and get rid of the rest.

Solr+Nutch+AjaxSolr query

1) I referred https://github.com/evolvingweb/ajax-solr/wiki/reuters-tutorial for Ajax-Solr setup.
I want to know that although ajax-solr is running but it's searching under only reuters data. If I want to crawl the web using nutch and integrate it with solr,then i have to replace solr's schema.xml file with nutch's schema.xml file which will not be according to ajax-solr configuration. By replacing the schema.xml files, ajax-solr wont work(correct me if I am wrong)!!!
How would I now integrate Solr with Nutch along with Ajax-Solr so ajax-Solr can search other data on the web as well??
2) I would like to ask whether there are any front end API for Solr searching,except Ajax-Solr, which would help in efficient searching of the crawled web?
Look at Solr with multiple cores, it's better not to try mix documents with different nature in one collection
There are many APIs for SOLR, such as SOLRJ for Java (http://wiki.apache.org/solr/Solrj), SolPHP for PHP (http://wiki.apache.org/solr/SolPHP) and so on.

How to configure Apache Tika with apache Solr 1.4.1

I want to index a large number of pdf documents.
I have found a reference showing that it could be done using Apache Tika but unfortunately I cannot find any reference that describes I could configure Apache Tika in Solr 1.4.1.
Once configured I do have it configured, how can I send documents to Solr directly without using curl?
I am using solrnet for indexing.
See ExtractingRequestHandler
Support for ExtractingRequestHandler in SolrNet is not yet complete. You can either finish implementing it, or work around it and craft your own HttpWebRequests.

Resources