Nutch querying on the fly

Nutch querying on the fly - solr

I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :)
I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible.
I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. But I did all the steps mentioned in the link. I think somewhere in the process there should be a crawling happening and which is missed.
Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience.
Cheers,
Abi

This is not possible.
What you could do though is chunk the crawl cycle in a smaller number of URL's such that it will publish result more often whith this command
nutch generate crawl/crawldb crawl/segments -topN <the limit>
If you are using the onestop command crawl it should be the same.
I typically have a 24hours chunking scheme.

Related

Solr reindex is stopping prematurely when running Collective Solr for Plone

My team is working on a search application for our websites. We are using Collective Solr in Plone to index our intranet and documentation sites. We recently set up shared blob storage on our test instance of the intranet site because Solr was not indexing our PDF files. This appears to be working, however, each time I run the reindexing script (##solr-maintenance/reindex) it stops after about an hour and a half. I know that it is not indexing our entire site as there are numerous pages, files, etc. missing when I run a query in the Solr dashboard.
The warning below is the last thing I see in the Solr log before the script stops. I am very new to Solr so I'm not sure what it indicates. When I run the same script on our documentation site, it completes without error.
2017-04-14 18:05:37.259 WARN (qtp1989972246-970) [ ] o.a.s.h.a.LukeRequestHandler Error getting file length for [segments_284]
java.nio.file.NoSuchFileException: /var/solr/data/uvahealthPlone/data/index/segments_284
I'm hoping someone out there might have more experience with Collective Solr for Plone and could recommend some good resources for debugging this issue. I've done a lot of searching lately but haven't found much useful info.

This was a bug fixed some time ago with https://github.com/collective/collective.solr/pull/122

Parse data with tika for apache solr

I have managed to get apache nutch to index a news website and pass the results off to Apache solr.
Using this tutorial
https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup the only difference is I have decided to use Cassandra instead.
As a test I am trying to crawl Cnn, to extract out the title of article's and the date it was published.
Question 1:
How to parse data from the webpage, to extract the date and the title.
I have found this article for a plugin. It seems a bit out dated and am not sure that it still applies. I have also read that Tika can be used as well but again most tutorials are quite old.
http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/
Another SO article is this
How to extend Nutch for article crawling. I would prefer to use Nutch, only because that is what I have started with. I have do not really have a preference.
Anything would be a great help.

Norconex HTTP Collector will store with your document all possible metadata it could find, without restriction. That ranges from the HTTP Header values obtained when downloading a page, to all the tags in that HTML page.
That may likely be too much fields for you. If so, you can reject those you do not want, or instead, be explicit about the ones you want to keep by adding a "KeepOnlyTagger" to your <importer> section in your configuration:
<tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
fields="title,pubdate,anotherone,etc"/>
You'll find how to get started quickly along with configuration options here: http://www.norconex.com/product/collector-http/configuration.html

how to make a search engine with nutch and cassandra?

I am tring to implement a website search engine with java as an applet,I have used nutch as web crawler and cassandra as my database,I have to use a nosql database(because my teacher wants me to do),now my question is what should I do next to complete my search engine?
I have googled a lot,but all of the sites are mostly about nutch and solr,and they build search engines with integration of these two,cause solr itself is somehow a database,I don't know what should I do,do I have to use solr too to complete my search engine?is it wise to use two databases(solr and cassandra)?or I should do some thing else?
please remember I have to use cassandra.
and please first explain me if I have understood things in a wrong way and then give me a minus mark,:D
I will be really really thankfull for your help,I have got somehow confused.
by the way does solr counted as a nosql database?excuse me,I am new to them all.

Check out Solr's Data Import Handler and see if you feel it would work. It allows you to query your database and store the results with Solr to which then Solr can manipulate the reuslts. Nutch also has very good integration with Solr should you choose to use it.

How to configure Nutch and solr in ubuntu 10.10?

I'm trying to build a search engine for my final year project. I have done lots of research on this topic in the last 2 months.
And I found that I will need a crawler to crawl the Internet, a parser, and an indexer.
I am trying to use Nutch as crawler and solr to index data crawled by Nutch. But I am stuck in the installation part of both of them. I am trying to install Nutch and solr in my system with the help of tutorials on the internet, but nothing worked for me.
I need some kind of installation guide or a link where I can learn how to install and integrate Nutch and solr.
Next I am stuck with the parser. I have no idea about this phase. I need help here on how to do the parsing of data before indexing.
I don't want to build Google or something. All I need is certain items from certain websites to be searched.
I have Java experience and I can work with it comfortably but I am not a professional like you guys, and please do tell me whether I am going in the right direction or not, and what I should do next.
I am using Ubuntu 10.10, and I have Apache Tomcat 7.

This is for nutch installation and this is for integration with Solr.
Regarding the parsers, nutch has its own set of parsers and you dont have to bother about parsing. Trigger the crawl command, its done automatically. Unless you want to parse things apart from those provided by nutch, it wont be an issue for you. If you want nutch to parse some .xyz files, then you nedd to write parser plugins for that and integrate with nutch.

Simple Nutch 1.3/Solr index explanation

After much searching, it doesn't seem like there's any straightforward explanation of how to use Nutch 1.3 with Solr.
I have a Solr index with other content in it that I'll be using on a website for search.
I'd like to add Nutch results to the index, which will add external sites to the website's search.
All of this is working just fine.
The question is, how do you freshen the index? Do you have to delete all of the Nutch results from Solr first? Or does Nutch take care of that? Does Nutch remove results that are no longer valid from the Solr index?
Shell scripts with no documentation or explanation of what they are doing haven't been helpful with answering these questions.

The nutch schema defines id (= url) as teh unique key. If you re-crawl the url teh document will be replaced in solr index when nutch posts the data to solr.

Well you need to implement incremental crawling in Nutch... which is dependent on your application. Some people want to recrawl every day, others every 3 month. The max is 90 days in any case.
The general idea is to delete crawl segments that are older than your max time for recrawl, since they will be redundant at that time. And produce a fresh solrindex for use in Solr.
I'm afraid that you have to do that yourself in scripting. One day I may put on the wiki some scripts I did for that, but they are not ready for publish as it stands.

Try Lucidworks' enterprise Solr for testing/prototyping, which has a webcrawler builtin.
http://www.lucidimagination.com/products/lucidworks-search-platform/enterprise
It'll give you a feel for the whole Lucene stack. It has a MUCH better interface than any other Java software I've ever used. It's a joy to use.