How to configure Nutch to avoid crawling nonsense calendar webpage

How to configure Nutch to avoid crawling nonsense calendar webpage - calendar

I am using Nutch to index a website. I notice that Nutch has crawled some junk webpages, such as http://******/category/events/2015-11. This webpage is about the event occurring in 2015, 11. This is completely nonsense for me. I want to know is it possible for Nutch to intelligently skip such webpages. It may be argued that I can use Regex to avoid this. However, as the naming pattern of calendar webpages are not the same all the time, there is no way to write a perfect Regex for this. I know Heritrix (a Internet archive crawler) has such capabilities to avoid crawling nonsense calendar webpage. Does anyone solve this issue?

There is no other way apart from regex url filtering that can do this. You can keep adding new patterns to the regex file whenever you see an undesired page makes it through the crawled content.

Related

How to discourage scraping on a Drupal website?

I have a Drupal website that has a ton of data on it. However, people can quite easily scrape the site, due to the fact that Drupal class and IDs are pretty consistent.
Is there any way to "scramble" the code to make it harder to use something like PHP Simple HTML Dom Parser to scrape the site?
Are there other techniques that could make scraping the site a little harder?
Am I fighting a lost cause?
I am not sure if "scraping" is the official term, but I am referring to the process by which people write a script that "crawls" a website and parses sections of it in order to extract data and store it in their own database.

First I'd recommend you to google over web scraping anti-scrape. There you'll find some tools for fighting web scrapig.
As for the Drupal there should be some anti-scrape plugins avail (google over).
You might be interesting my categorized layout of anti-scrape techniques answer. It's for techy as well as non-tech users.

I am not sure but I think that it is quite easy to crawl a website where all contents are public, no matter if the IDs are sequential or not. You should take into account that if a human can read your Drupal site, a script also does.
Depending on your site's nature if you don't want your content to be indexed by others, you should consider setting registered-user access. Otherwise, I think you are fighting a lost cause.

Parse data with tika for apache solr

I have managed to get apache nutch to index a news website and pass the results off to Apache solr.
Using this tutorial
https://github.com/renepickhardt/metalcon/wiki/simpleNutchSolrSetup the only difference is I have decided to use Cassandra instead.
As a test I am trying to crawl Cnn, to extract out the title of article's and the date it was published.
Question 1:
How to parse data from the webpage, to extract the date and the title.
I have found this article for a plugin. It seems a bit out dated and am not sure that it still applies. I have also read that Tika can be used as well but again most tutorials are quite old.
http://www.ryanpfister.com/2009/04/how-to-sort-by-date-with-nutch/
Another SO article is this
How to extend Nutch for article crawling. I would prefer to use Nutch, only because that is what I have started with. I have do not really have a preference.
Anything would be a great help.

Norconex HTTP Collector will store with your document all possible metadata it could find, without restriction. That ranges from the HTTP Header values obtained when downloading a page, to all the tags in that HTML page.
That may likely be too much fields for you. If so, you can reject those you do not want, or instead, be explicit about the ones you want to keep by adding a "KeepOnlyTagger" to your <importer> section in your configuration:
<tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger"
fields="title,pubdate,anotherone,etc"/>
You'll find how to get started quickly along with configuration options here: http://www.norconex.com/product/collector-http/configuration.html

Simple Nutch 1.3/Solr index explanation

After much searching, it doesn't seem like there's any straightforward explanation of how to use Nutch 1.3 with Solr.
I have a Solr index with other content in it that I'll be using on a website for search.
I'd like to add Nutch results to the index, which will add external sites to the website's search.
All of this is working just fine.
The question is, how do you freshen the index? Do you have to delete all of the Nutch results from Solr first? Or does Nutch take care of that? Does Nutch remove results that are no longer valid from the Solr index?
Shell scripts with no documentation or explanation of what they are doing haven't been helpful with answering these questions.

The nutch schema defines id (= url) as teh unique key. If you re-crawl the url teh document will be replaced in solr index when nutch posts the data to solr.

Well you need to implement incremental crawling in Nutch... which is dependent on your application. Some people want to recrawl every day, others every 3 month. The max is 90 days in any case.
The general idea is to delete crawl segments that are older than your max time for recrawl, since they will be redundant at that time. And produce a fresh solrindex for use in Solr.
I'm afraid that you have to do that yourself in scripting. One day I may put on the wiki some scripts I did for that, but they are not ready for publish as it stands.

Try Lucidworks' enterprise Solr for testing/prototyping, which has a webcrawler builtin.
http://www.lucidimagination.com/products/lucidworks-search-platform/enterprise
It'll give you a feel for the whole Lucene stack. It has a MUCH better interface than any other Java software I've ever used. It's a joy to use.

Nutch querying on the fly

I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :)
I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible.
I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. But I did all the steps mentioned in the link. I think somewhere in the process there should be a crawling happening and which is missed.
Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience.
Cheers,
Abi

This is not possible.
What you could do though is chunk the crawl cycle in a smaller number of URL's such that it will publish result more often whith this command
nutch generate crawl/crawldb crawl/segments -topN <the limit>
If you are using the onestop command crawl it should be the same.
I typically have a 24hours chunking scheme.

Solr / SolrNet - Using wildcards for letter by letter search

Hey Guys,
Im trying to implement some search functionality to an application were writing.
Solr 1.4.1 running on Tomcat7
JDBC connection to a MS SQLServer with the View im indexing
Solr has finished indexing and the index is working.
To search and communicate with Solr i have created a little test WCF service (to be implemented with our main service later).
The purpose is to implement a textfield in our main application. In this text field the users can start typing something like Paintbrush and gradually filter through the list of objects as more and more characters are input.
This is working just fine and dandy with Solr up to a certain point. Im using the Wildcard asterisk in the end of my query and as such im throwing a lot of requests like
p*
pa*
pain*
paint*
etc. at the server and its returning results just fine (quite impressively fast actually). The only problem is that once the user types the whole word the query is paintbrush* at which point solr returns 0 results.
So it seems that query+wildcard can only be query+something and not query+nothing
I managed to get this working under Lucene.Net but Solr isnt doing things the same way it seems.
Any advice you can give me on implementing such a feature?
there isn't much code to look at since im using SolrNet: http://pastebin.com/tXpe4YUe
I figure it has something to do with the Analyzer and Parser but im not yet that into Solr to know where to look :)

I wouldn't implement suggestions with prefix wildcard queries in Solr. There are other mechanisms better suited to do this. See:
Simple Solr schema problem for autocomplete
Solr TermsComponent: Usage of wildcards

Stemming seems to be what caused the problem. I fixed it using a clone of text_ws instead of text for the type.
My changes to scema.xml : http://pastebin.com/xaJZDgY4
Stemming is disabled and lowercase indexing is enabled. As long as all queries are in lower case they should always give results (if there at all).
Issue seems to be that Analyzers dont work with Wildcards, so the logic that would make Johnny the result of Johni or Johnni is "broken" when using wildcards.
If your facing similiar problems and my solution here doesnt quite work you can add debugQuery=on to your query string and see a bit more about whats going on. That helped me narrow down the problem.