How to prevent crawling external links with apache nutch? - solr

I want to crawl only specific domains on nutch. For this I set the db.ignore.external.links to true as it was said in this FAQ link
The problem is nutch start to crawl only links in the seed list. For example if I put "nutch.apache.org" to seed.txt, It only find the same url (nutch.apache.org).
I get the result by running crawl script with 200 depth. And it's finished with one cycle and generate the out put below.
How can I solve this problem ?
I'm using apache nutch 1.11
Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
Best Regards

You want to fetch only pages from a specific domain.
You already tried db.ignore.external.links but this restrict anything but the seek.txt urls.
You should try conf/regex-urlfilter.txt like in the example of the nutch1 tutorial:
+^http://([a-z0-9]*\.)*your.specific.domain.org/

Are you using "Crawl" script? If yes make sure you giving level which is greater than 1. If you run something like this "bin/crawl seedfoldername crawlDb http://solrIP:solrPort/solr 1". It will crawl only urls which are listed in the seed.txt
And to crawl specific domain you can use regex-urlfiltee.txt file.

Add following property in nutch-site.xml
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description>
</property>

Related

How to index a NFS mount using Nutch?

I am trying to build a search tool hosted on a CentOS 7 machine which should index and search the directories of the mounted NFS export. I found that Nutch+Solr is the best bet for this. I have had a hard time configuring the url for this since this will not search any http locations.
The mount is located on /mnt
So my seeds.txt looks like this:
[root#sauron bin]# cat /root/Desktop/apache-nutch-1.13/urls/seed.txt
file:///mnt
and my regex-urlfilter.txt has the same site plus allowing file protocol
# skip file: ftp: and mailto: urls
-^(http|https|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+^file:///mnt
However when I try to bootstrap from the initial seed list there are no updates done:
[root#sauron apache-nutch-1.13]# bin/nutch inject crawl/crawldb urls
Injector: starting at 2017-06-12 00:07:49
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 1
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: finished at 2017-06-12 00:10:27, elapsed: 00:02:38
I have also tried changing the seeds.txt to the following with no luck:
file:/mnt
file:////<IP>:<export_path>
Please let me know if I am doing something wrong here.
From the URI point of view a file system is not really that different for Nutch, you just need to enable the protocol-file plugin, and configure the regex-urlfilter.txt like:
+^file:///mnt/directory/
-.
In this case you prevent it from indexing the parent directories of the one that you've specified.
Keep in mind that since you've already have the NFS share mounted locally it would work as a normal local file system. More information could be found in https://wiki.apache.org/nutch/FAQ#How_do_I_index_my_local_file_system.3F.

Solr: where to find the Luke request handler

I'm trying to get a list of all the fields, both static and dynamic, in my Solr index. Another SO answer suggested using the Luke Request Handler for this.
It suggests finding the handler at this url:
http://solr:8983/solr/admin/luke?numTerms=0
When I try this url on my server, however, I get a 404 error.
The admin page for my core is here http://solr:8983/solr/#/mycore, so I also tried http://solr:8983/solr/#/mycore/admin/luke. This also gave me another 404.
Does anyone know what I'm doing wrong? Which url should I be using?
First of all you have to enable the Luke Request Handler. Note that if you started from the example solrconfig.xml you probably don't need to enable it explicitly because
<requestHandler name="/admin/" class="solr.admin.AdminHandlers" />
does it for you.
Then if you need to access the data programmatically you have to make an HTTP GET request to http://solr:8983/solr/mycore/admin/luke (no hash mark!). The response is in XML but specifying wt parameter you can obtain different formats (e.g. http://solr:8983/solr/mycore/admin/luke?wt=json)
If you only want to see fields in SOLR web interface select your core from the drop down menu and then click on "Schema Browser"
In Solr 6, the solr.admin.AdminHandlers has been removed. If your solrconfig.xml has the line <requestHandler name="/admin/" class="solr.admin.AdminHandlers" />, it will fail to load. You will see errors in the log telling you it failed to load the class org.apache.solr.handler.admin.AdminHandlers.
You must include in your solrconfig.xml the line,
<requestHandler name="/admin/luke" class="org.apache.solr.handler.admin.LukeRequestHandler" />
but the URL is core-specific, i.e. http://your_server.com:8983/solr/your_core_name/admin/luke
And you can specify the parameters fl,numTerms,id,docId as follows:
/admin/luke
/admin/luke?fl=cat
/admin/luke?fl=id&numTerms=50
/admin/luke?id=SOLR1000
/admin/luke?docId=2
You can use this Luke tool which allows you to explore Lucene index.
You can also use the solr admin page :
http://localhost:8983/solr/#/core/schema-browser

crawling all links of same domain in Nutch

Can anyone tel me how to crawl all other pages of same domain.
For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
Following property is added in nutch-site.xml
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
And following is added in regex-urlfilter.txt
accept anything else
+.
Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to crawl all other pages but not techcrunch.com's pages though it has got many other pages too.
Please help..?
In nutch-default.xml set db.ignore.external.links to true and 'db.ignore.external.links.mode' to byDomain. Like this :
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
<name>db.ignore.external.links.mode</name>
<value>byDomain</value>
</property>
By default db.ignore.external.links.mode is set to byHost. Which means while crawing http://www.techcrunch.com/ the URL http://subdomain1.techcrunch.com will get treated as EXTERNAL and hence will be ignored. But you want sudomain1 pages to be crawled too - hence keep db.ignore.external.links.mode to byDomain
No work around required in regex-urlfilter.txt. Use regex-urlfilter.txt for some complex situation
I think you are using the wrong property, first use db.ignore.external.links in nutch-site.xml
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This will limit your crawl to the host on your seeds file.
</description>
</property>
b) Then you could also use a regex in regex-urlfilter.txt to limit the domains crawled to just techcrunch.
+^(http|https)://.*techcrunch.com/
However I think that your issue is that Nutch obeys the robots.txt file and in this case techcrunch has a Crawl-delay value of 3600!! see robots.txt. The default value of fetcher.max.crawl.delay is 30 seconds making Nutch dismiss all the pages from techcrunch.
From fetcher.max.crawl.delay in nutch-default
"If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be."
You may want to play with the fetcher.threads.fetch and fetcher.threads.per.queue values to speed up your crawl. You could also take a look at this and play with the Nutch code.. or you may even want to use a different approach to crawl sites with long crawl delays.
Hope this is useful to you.
Cheers!

Nutch didn't crawl all URLs from the seed.txt

I am new to Nutch and Solr. Currently I would like to crawl a website and its content is
generated by ASP. Since the content is not static, I created a seed.txt which
contained all the URLs I would like to crawl. For example:
http://us.abc.com/product/10001
http://us.abc.com/product/10002
http://jp.abc.com/product/10001
http://jp.abc.com/product/10002
...
The regex-urlfilter.txt has this filter:
# accept anything else
#+.
+^http://([a-z0-9]*\.)*abc.com/
I used this command to start the crawling:
/bin/nutch crawl urls -solr http://abc.com:8983/solr/ -dir crawl -depth 10 -topN 10
The seed.txt content 40,000+ URLs. However, I found that many of the URLs content are not
able to be found by Solr.
Question:
Is this approach for a large seed.txt workable ?
How can I check a URL was being crawlered ?
Is seed.txt has a size limitation ?
Thank you !
Check out the property db.max.outlinks.per.page in the nutch configuration files.
The default value for this property is 100 and hence only 100 urls will be picked up from the seeds.txt and rest would be skipped.
Change this value to a higher number to have all the urls scanned and indexed.
topN indicates how many of the generated links should be fetched. You could have 100 links which have been generated , but if you set topN as 12, then only 12 of those links will get fetched, parsed and indexed.

Map static field between nutch and solr

I use nutch 1.4 and I would like to map static field to Solr.
I know there is the index-static plugin. I configured it in nutch-site.xml like this :
<property>
<name>index-static</name>
<value>field:value</value>
</property>
However, the value is not sent to Solr.
Does anyone have a solution ?
It looks like the entry in nutch-default.xml is wrong.
According to the plugin source "index.static" instead of "index-static" is the right name for the property.
String fieldsString = conf.get("index.static", null);
After using that in my nutch-site.xml I was able to send multiple fields to my solr server.
Also make sure that the plugin is added to list of included plugins in the "plugin.includes" property.

Resources