Nutch - Crawler not following next pages in paginated content - solr

I'm using nutch 1.6 to crawl a paginated web page containing 20 products/page, with this command:
./nutch crawl urls -dir <dir> -depth 4 -topN 100 -threads 100
I'm getting the 20 first products & the links to the following pages. But the crawler is not following my next pages link? Am I missing a parameter?

Unfortunately Nutch 1.6 lacks the support for crawling Ajax based sites. See this and this. There are no immediate plans to add the same.

The regex-urlfilter blocks urls that have querystring parameters:
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
Modify that file so that urls with querystring parameters are crawled:
# skip URLs containing certain characters as probable queries, etc.
-[*!#]

Related

Nutch 2.X - Prefered urls to fetch

I have this situation: There are over 160 URLs in my seed. I started my crawling one week ago. Now I have a lot of pages crawled in my storage but I can see in my Solr index that some URLs from seed are not crawled at all (the URLs do not have some restrictions from a robots.txt) or only in very small number. Is it possible tell Nutch to prefer some URLs?
have you checked TopN value?
Or is Nutch still crawling? because indexing and sending data to solr is done at the end of process!

How do I tell Nutch to crawl *through* a url without storing it?

Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.
Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.
But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).
What's the normal or least painful way to set Nutch->Solr up to work like this?
Looks like the only way to do this is write your own IndexFilter plugin (or find someone's to copy from).
[Will add my sample plugin code here when it's working properly]
References:
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
http://florianhartl.com/nutch-plugin-tutorial.html
How to filter URLs in Nutch 2.1 solrindex command

Nutch: Data read and adding metadata

I recently started looking apache nutch. I could do setup and able to crawl web pages of my interest with nutch. I am not quite understanding on how to read this data. I basically want to associate data of each page with some metadata(some random data for now) and store them locally which will be later used for searching(semantic). Do I need to use solr or lucene for the same? I am new to all of these. As far I know Nutch is used to crawl web pages. Can it do some additional features like adding metadata to the crawled data?
Useful commands.
Begin crawl
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
Get statistics of crawled URL's
bin/nutch readdb crawl/crawldb -stats
Read segment (gets all the data from web pages)
bin/nutch readseg -dump crawl/segments/* segmentAllContent
Read segment (gets only the text field)
bin/nutch readseg -dump crawl/segments/* segmentTextContent -nocontent -nofetch -nogenerate - noparse -noparsedata
Get all list of known links to each URL, including both the source URL and anchor text of the link.
bin/nutch readlinkdb crawl/linkdb/ -dump linkContent
Get all URL's crawled. Also gives other information like whether it was fetched, fetched time, modified time etc.
bin/nutch readdb crawl/crawldb/ -dump crawlContent
For the second part. i.e to add new field I am planning to use index-extra plugin or to write custom plugin.
Refer:
this and this

Nutch solrindex command not indexing all URLs in Solr

I have a Nutch index crawled from a specific domain and I am using the solrindex command to push the crawled data to my Solr index. The problem is that it seems that only some of the crawled URLs are actually being indexed in Solr. I had the Nutch crawl output to a text file so I can see the URLs that it crawled, but when I search for some of the crawled URLs in Solr I get no results.
Command I am using to do the Nutch crawl: bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
This command is completing successfully and the output displays URLs that I cannot find in the resulting Solr index.
Command I am using to push the crawled data to Solr: bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
The output for this command says it is also completing successfully, so it does not seem to be an issue with the process terminating prematurely (which is what I initially thought it might be).
One final thing that I am finding strange is that the entire Nutch & Solr config is identical to a setup I used previously on a different server and I had no problems that time. It is literally the same config files copied onto this new server.
TL;DR: I have a set of URLs successfully crawled in Nutch, but when I run the solrindex command only some of them are pushed to Solr. Please help.
UPDATE: I've re-run all these commands and the output still insists it's all working fine. I've looked into any blockers for indexing that I can think of, but still no luck. The URLs being passed to Solr are all active and publicly accessible, so that's not an issue. I'm really banging my head against a wall here so would love some help.
I can only guess what happend from my experiences:
There is a component called url-normalizer (with its configuration url-normalizer.xml) which is truncating some urls (removing URL parameters, SessionIds, ...)
Additionally, Nutch uses a unique constraint, by default each url is only saved once.
So, if the normalizer truncates 2 or more URLs ('foo.jsp?param=value', 'foo.jsp?param=value2', 'foo.jsp?param=value3', ...) to the exactly same one ('foo.jsp'), they get only saved once. So Solr will only see a subset of all your crawled URLs.
cheers

Apache Nutch does not index the entire website, only subfolders

Apache Nutch 1.2 does not index the entire website, only subfolders. My index-page provides links in most areas/subfolders of my website. For example stuff, students, research... But nutch only crawl in one specific folder - "students" in this case. Seems as if links in other directories are not followed.
crawl-urlfilter.txt:
+^http://www5.my-domain.de/
seed.txt in the URLs-folder:
http://www5.my-domain.de/
Starting nutch with(windows/linux both used):
nutch crawl "D:\Programme\nutch-1.2\URLs" -dir "D:\Programme\nutch-1.2\crawl" -depth 10 -topN 1000000
Different variants for depth(5-23) and topN(100-1000000) are tested. Providing more links in seed.txt doesnt help at all, still not following links found in injected pages.
Interestingly, crawling gnu.org works perfect. No robots.txt or preventing meta-tags used in my site.
Any ideas?
While attempting to crawl all links from an index page, I discovered that nutch was limited to exactly 100 links of around 1000. The setting that was holding me back was:
db.max.outlinks.per.page
Setting this to 2000 allowed nutch to index all of them in one shot.
Check out if you´ve got intra domain links limitation (property as false in nutch-site.xml). Also check out other properties as maximun intra-extra links per page and http size. Sometimes they produce wrong results during crawling.
Ciao!

Resources