Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)?
I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?
Set up the Tika plugin with Nutch, Nutch will parse the data for you and will do all the hard work for you.
I would suggest setting it up on Solr as well, you may wish to send documents to Solr via the curl command and it would help to have it set up on Solr too. It comes with little extra configuration and no performance costs:
There is a guide to setting up Tika & extracting request handler here
Apply tika parser in Nutch's parsing phase.
Related
I am going to use apache solr for first time. I just want to write a basic code for apache solr in java with eclipse. Can you please suggest me that how to get familiar with apache solr.
Basically, SOLR is a full text search engine. You can use it as a data store to keep data of unstructured nature. You can use SolrJ to communicate to SOLR from Java.
my objective is to
- crawl on urls and
- extract micro data and
- save to solr
I used this guide to setup nutch, hbase, and solr
Im using nutch to crawl on urls and hbase, im using tika pluggin for nutch to parse pages, but it only gets meta data.
Did I miss something to config? please guide me or suggest alternatives
You need to implement your own ParseFilter and implement the extraction logic there. You will get a DocumentFragment generated by the Tika parser and could use e.g. XPath to get the micro data.
Note that the DOM generated by Tika are heavily normalised / modified so your Xpath expressions could possibly not match. Maybe better to rely on the old HTML parser instead.
One generic way of doing would be to use Apache Any23 as done for instance in this storm-crawler module.
BTW There is an open JIRA for a MicroDataHandler in Tika which hasn't been committed yet.
HTH
I am trying to crawl data with Apache Nutch and index it with Apache Solr.
As part of this I want to parse the content as well. I am trying to figure out is it better to apply Tika to Nutch , to Solr or both.
Apply it as early as you can but make sure to keep the original, full-fidelity, document somewhere as well.
There is no point passing a binary file around if you know that in the end you are going to reduce it to a set of metadata fields and get rid of the rest.
1) I referred https://github.com/evolvingweb/ajax-solr/wiki/reuters-tutorial for Ajax-Solr setup.
I want to know that although ajax-solr is running but it's searching under only reuters data. If I want to crawl the web using nutch and integrate it with solr,then i have to replace solr's schema.xml file with nutch's schema.xml file which will not be according to ajax-solr configuration. By replacing the schema.xml files, ajax-solr wont work(correct me if I am wrong)!!!
How would I now integrate Solr with Nutch along with Ajax-Solr so ajax-Solr can search other data on the web as well??
2) I would like to ask whether there are any front end API for Solr searching,except Ajax-Solr, which would help in efficient searching of the crawled web?
Look at Solr with multiple cores, it's better not to try mix documents with different nature in one collection
There are many APIs for SOLR, such as SOLRJ for Java (http://wiki.apache.org/solr/Solrj), SolPHP for PHP (http://wiki.apache.org/solr/SolPHP) and so on.
I want to index a large number of pdf documents.
I have found a reference showing that it could be done using Apache Tika but unfortunately I cannot find any reference that describes I could configure Apache Tika in Solr 1.4.1.
Once configured I do have it configured, how can I send documents to Solr directly without using curl?
I am using solrnet for indexing.
See ExtractingRequestHandler
Support for ExtractingRequestHandler in SolrNet is not yet complete. You can either finish implementing it, or work around it and craft your own HttpWebRequests.