my objective is to
- crawl on urls and
- extract micro data and
- save to solr
I used this guide to setup nutch, hbase, and solr
Im using nutch to crawl on urls and hbase, im using tika pluggin for nutch to parse pages, but it only gets meta data.
Did I miss something to config? please guide me or suggest alternatives
You need to implement your own ParseFilter and implement the extraction logic there. You will get a DocumentFragment generated by the Tika parser and could use e.g. XPath to get the micro data.
Note that the DOM generated by Tika are heavily normalised / modified so your Xpath expressions could possibly not match. Maybe better to rely on the old HTML parser instead.
One generic way of doing would be to use Apache Any23 as done for instance in this storm-crawler module.
BTW There is an open JIRA for a MicroDataHandler in Tika which hasn't been committed yet.
HTH
Related
I am going to use apache solr for first time. I just want to write a basic code for apache solr in java with eclipse. Can you please suggest me that how to get familiar with apache solr.
Basically, SOLR is a full text search engine. You can use it as a data store to keep data of unstructured nature. You can use SolrJ to communicate to SOLR from Java.
I am trying to crawl data with Apache Nutch and index it with Apache Solr.
As part of this I want to parse the content as well. I am trying to figure out is it better to apply Tika to Nutch , to Solr or both.
Apply it as early as you can but make sure to keep the original, full-fidelity, document somewhere as well.
There is no point passing a binary file around if you know that in the end you are going to reduce it to a set of metadata fields and get rid of the rest.
I have managed to get apache nutch to index a news website and pass the results off to Apache solr.
Using this tutorial
https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup the only difference is I have decided to use Cassandra instead.
As a test I am trying to crawl Cnn, to extract out the title of article's and the date it was published.
Question 1:
How to parse data from the webpage, to extract the date and the title.
I have found this article for a plugin. It seems a bit out dated and am not sure that it still applies. I have also read that Tika can be used as well but again most tutorials are quite old.
http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/
Another SO article is this
How to extend Nutch for article crawling. I would prefer to use Nutch, only because that is what I have started with. I have do not really have a preference.
Anything would be a great help.
Norconex HTTP Collector will store with your document all possible metadata it could find, without restriction. That ranges from the HTTP Header values obtained when downloading a page, to all the tags in that HTML page.
That may likely be too much fields for you. If so, you can reject those you do not want, or instead, be explicit about the ones you want to keep by adding a "KeepOnlyTagger" to your <importer> section in your configuration:
<tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
fields="title,pubdate,anotherone,etc"/>
You'll find how to get started quickly along with configuration options here: http://www.norconex.com/product/collector-http/configuration.html
Which is the best integration for Apache Tika assuming that I already connected and used Nutch(2.2.1) + Solr (4.3)?
I understand that Tika can be integrated within Nutch and/or Solr, but which one is the best decision?
Set up the Tika plugin with Nutch, Nutch will parse the data for you and will do all the hard work for you.
I would suggest setting it up on Solr as well, you may wish to send documents to Solr via the curl command and it would help to have it set up on Solr too. It comes with little extra configuration and no performance costs:
There is a guide to setting up Tika & extracting request handler here
Apply tika parser in Nutch's parsing phase.
Recently I got involved in a task, and part of it require to use Apache Solr ( for Document Search) ,and Apache Tika ( to Extract the meta-text or plain text from documents)
I have n't integrated Solr and tika yet ,But I have worked with both of them individually I might have set of questions related to Apache Solr and Apache Tika , It might be at beginners level or average.
Following types of practical I did with Solr e.g. created a dummy database, wrote a program, configured - schema.xml things, ran Solr sever, and program which fetches documents from database and store in Solr Document Index , Made a Simple client to fetch data from Solr via JSON Interface, Made a Program which keeps MySQL Database to sync with Apache’s Solr document Index.
Following types of practical I did with tika e.g. compiled and Installed Tika, understood its document parsing capablities.
..
My Sample Task statement:
Part of my project require to store around 100,000 of documents (Data of these 100,000 (Doc,PDF,Txt) docs are fetched by Apache tika and pushed to MySql’s Database and later that pushed to apache Solr’s Document Database)for Full Text Search and search them those via a client interface (Browser)
In simple programmatical level this task will get done,
I would like to understand the challenges related to managing the index or something else in Solr e.g.
** In advanced level does it require optimizing the Solr’s Open Source Code?
** While Solr works in proper way, does it provide any specific challenges?
** What Key things need to consider initially so that, Solr should work in a proper way.
** Do you think any extra tool to developed to monitor Solr’s working ?
Hope you got the idea related to questions I have ?
** Also I would like to know If you have any experience of using apache Tika with apache Solr, and any challenges or key things to consider ?
Would you like to recommend and specific sources Or If you have any document or anything which you feel to be helpful.