Nutch didn't crawl all URLs from the seed.txt - solr

I am new to Nutch and Solr. Currently I would like to crawl a website and its content is
generated by ASP. Since the content is not static, I created a seed.txt which
contained all the URLs I would like to crawl. For example:
http://us.abc.com/product/10001
http://us.abc.com/product/10002
http://jp.abc.com/product/10001
http://jp.abc.com/product/10002
...
The regex-urlfilter.txt has this filter:
# accept anything else
#+.
+^http://([a-z0-9]*\.)*abc.com/
I used this command to start the crawling:
/bin/nutch crawl urls -solr http://abc.com:8983/solr/ -dir crawl -depth 10 -topN 10
The seed.txt content 40,000+ URLs. However, I found that many of the URLs content are not
able to be found by Solr.
Question:
Is this approach for a large seed.txt workable ?
How can I check a URL was being crawlered ?
Is seed.txt has a size limitation ?
Thank you !

Check out the property db.max.outlinks.per.page in the nutch configuration files.
The default value for this property is 100 and hence only 100 urls will be picked up from the seeds.txt and rest would be skipped.
Change this value to a higher number to have all the urls scanned and indexed.

topN indicates how many of the generated links should be fetched. You could have 100 links which have been generated , but if you set topN as 12, then only 12 of those links will get fetched, parsed and indexed.

Related

Remove L parameter in request URL

I'm using Solr extension with TYPO3 9.5.3 and I couldn't index the Pages, I get this error https://imgur.com/1e6LfIy
Failed to execute Page Indexer Request. Request ID: 5d78d130b8b4d
When I look at the Solr log, I see that Typo3 add &L=0 to the request URL, the pages with &L=0 return '404 page not found' error :
request url => 'http://example.com/index.php?id=5&L=0' (43 chars)
I added the following code to my TS setup, But that did not work and the request url always ends with &L=0
plugin.tx_solr.index.queue.pages.fields.url.typolink.additionalParams >
I'm not sure that's the only reason solr doesn't index the pages (news can be indexed without any problem), but first, how can I solve the problem and remove &L=0 from request URL in Solr ?
Can you check your TypoScript if you have a configuration like
config.defaultGetVars.L = 0
or if other old language settings exist
I
m not dure, but have you an older languge-Configuration where you Deine the language-Parameter deines?

How to prevent crawling external links with apache nutch?

I want to crawl only specific domains on nutch. For this I set the db.ignore.external.links to true as it was said in this FAQ link
The problem is nutch start to crawl only links in the seed list. For example if I put "nutch.apache.org" to seed.txt, It only find the same url (nutch.apache.org).
I get the result by running crawl script with 200 depth. And it's finished with one cycle and generate the out put below.
How can I solve this problem ?
I'm using apache nutch 1.11
Generator: starting at 2016-04-05 22:36:16
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
Best Regards
You want to fetch only pages from a specific domain.
You already tried db.ignore.external.links but this restrict anything but the seek.txt urls.
You should try conf/regex-urlfilter.txt like in the example of the nutch1 tutorial:
+^http://([a-z0-9]*\.)*your.specific.domain.org/
Are you using "Crawl" script? If yes make sure you giving level which is greater than 1. If you run something like this "bin/crawl seedfoldername crawlDb http://solrIP:solrPort/solr 1". It will crawl only urls which are listed in the seed.txt
And to crawl specific domain you can use regex-urlfiltee.txt file.
Add following property in nutch-site.xml
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description>
</property>

Nutch to allow when the host name is having portnumber

I am using nutch to push and index data to solr. In nutch, i have added abc.com:85 to domain-urlfilter.txt and +^http://abc\.com\:85 to regex-urlfilter.txt.
The problem is that nutch is not indexing data and it is throwing this message Total number of urls rejected by filters:1
Here in the url, i need the portnumber ,this configuration is done.
Could you please let me know how to make nutch work with the port number :85 added.
The problem is the syntax: +^http://abc\.com\:85 is not correct. Please check the syntax here: Nutch regex-urlfilter syntax
Hope this helps,
Le Quoc Do

crawling all links of same domain in Nutch

Can anyone tel me how to crawl all other pages of same domain.
For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
Following property is added in nutch-site.xml
<property>
<name>db.ignore.internal.links</name>
<value>false</value>
<description>If true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping only the highest quality
links.
</description>
</property>
And following is added in regex-urlfilter.txt
accept anything else
+.
Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to crawl all other pages but not techcrunch.com's pages though it has got many other pages too.
Please help..?
In nutch-default.xml set db.ignore.external.links to true and 'db.ignore.external.links.mode' to byDomain. Like this :
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
<name>db.ignore.external.links.mode</name>
<value>byDomain</value>
</property>
By default db.ignore.external.links.mode is set to byHost. Which means while crawing http://www.techcrunch.com/ the URL http://subdomain1.techcrunch.com will get treated as EXTERNAL and hence will be ignored. But you want sudomain1 pages to be crawled too - hence keep db.ignore.external.links.mode to byDomain
No work around required in regex-urlfilter.txt. Use regex-urlfilter.txt for some complex situation
I think you are using the wrong property, first use db.ignore.external.links in nutch-site.xml
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This will limit your crawl to the host on your seeds file.
</description>
</property>
b) Then you could also use a regex in regex-urlfilter.txt to limit the domains crawled to just techcrunch.
+^(http|https)://.*techcrunch.com/
However I think that your issue is that Nutch obeys the robots.txt file and in this case techcrunch has a Crawl-delay value of 3600!! see robots.txt. The default value of fetcher.max.crawl.delay is 30 seconds making Nutch dismiss all the pages from techcrunch.
From fetcher.max.crawl.delay in nutch-default
"If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be."
You may want to play with the fetcher.threads.fetch and fetcher.threads.per.queue values to speed up your crawl. You could also take a look at this and play with the Nutch code.. or you may even want to use a different approach to crawl sites with long crawl delays.
Hope this is useful to you.
Cheers!

Map static field between nutch and solr

I use nutch 1.4 and I would like to map static field to Solr.
I know there is the index-static plugin. I configured it in nutch-site.xml like this :
<property>
<name>index-static</name>
<value>field:value</value>
</property>
However, the value is not sent to Solr.
Does anyone have a solution ?
It looks like the entry in nutch-default.xml is wrong.
According to the plugin source "index.static" instead of "index-static" is the right name for the property.
String fieldsString = conf.get("index.static", null);
After using that in my nutch-site.xml I was able to send multiple fields to my solr server.
Also make sure that the plugin is added to list of included plugins in the "plugin.includes" property.

Resources