Nutch 1.6 doesn't search new entries in seed.txt - solr

I set up Solr 7.7.1 and Nutch 1.6 and ran a test search. For that I put a URL in seed.txt and everything works fine. After this test I removed the old core in Solr, created a new core and put multiple URLs in seed.txt, and started Nutch again for a new crawl. But I got in every try the results of the previous test run. How can I remove the previous search and can start Nutch to crawl the new URLs i put in seed.txt?
Thanks in advance for your answers.

You should remove the crawl/ directory (if it is named crawl). This directory contains the previously crawled data (before it is sent to Solr). Probably there is no new content after you run the crawl command and Nutch is sending the already stored data into Solr.

Related

crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

I mainly followed the guide on this page. I installed Nutch 2.3, Cassandra 2.0, and solr 4.10.3. Set up went well. But when I executed the following command. No urls were fetched.
./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
Below are my settings.
nutch-site.xml
http://ideone.com/H8MPcl
regex-urlfilter.txt
+^http://([a-z0-9]*\.)*nutch.apache.org/
hadoop.log
http://ideone.com/LnpAw4
I don't see any errors in the log file. I am really lost. Any help would be appreciated. Thanks!
You will have to add the regex for your website that you want to crawl in regex-urlfilter.txt to pick the link that you have added in nutch-site.xml.
Right now it will only crawl "nutch.apache.org"
Try adding below line:
+^http://([a-z0-9]*\.)*ideone.com/
Try to set nutch logs in debug level and get the logs while executing the crawl command.
It will clearly shows why you are unable to crawl and index the site.
Regards,
Jayesh Bhoyar
http://technical-fundas.blogspot.com/p/technical-profile.html
I got a similar problem recently. I think you can try the following steps to find out the problem.
1 Do some tests to make sure the DB works well.
2 Instead of running the crawl in batch, you can call nutch step by step and watch the log change as well as the change of DB content, in particular, the new urls.
3 Turn off solr and focus on nutch and the DB.

Apache Nutch - indexing only the modified files in Solr

Iam able to set up the Apache Nutch and get the data indexed in Solr. While indexing I am trying to make sure only modified pages gets indexed. Below are the 2 questions we have regarding this.
Is it possible to tell Nutch to send ‘If-modified-since’ header while
crawling the site and download the page only if it has changed since
the last time it was crawled.
I could see that Nutch is forming the MD5 digest out of the
retrieved page content, but even though digest hasn’t changed
(compared to previous version), it is still the indexing the page
in Solr. Is there any setting with in Nutch to make sure if the
content hasn’t changed have it not index in Solr?
Answering my own question here, Hope it helps someone
Once I set the adaptivefetchschedule, could see that Nutch was not pulling the pages that hasnt changed.Its honoring if-modified-since header.

apache nutch does not add sublinks to main site

Could anyone please give a guidence on how to properly configure apache nutch in order to get some amount of records in the database as a result of crawling a web site. I would very appreciate that!
Here details:
I've got the following line in my bin/urls/seed.txt file:
http://transmetod.ru/
The following is the line from regex-urlfilter.txt file (all other regexps are commented) :
+^http://([a-z0-9]*\.)*transmetod.ru/([a-z0-9]*\.)*
Basically I expect lots of records in the database to appear as a result of crawling, but the only thing a got there is just a single record with base url ( with out any other records with additional sublinks in the url )
This is a command line I use to run apache-nutch-2.1 project:
./nutch crawl urls -depth 3 -topN 10000
Can anyone point me out to mistake I've made or gust give some piece of advice ?
P.S.: basically, when I built project and ran it without any changes, I didn't get a bunch of records as well... (if I remmember things right)
Try changing you regex filter to:
+^http://([a-z0-9]*.)transmetod.ru/
Also, when you first run Nutch, it will crawl the urls you put in your seed file.
The next time your run the crawl, using the same crawl folder, It should pick up the outlinks of the first page and crawl them.

Nutch solrindex command not indexing all URLs in Solr

I have a Nutch index crawled from a specific domain and I am using the solrindex command to push the crawled data to my Solr index. The problem is that it seems that only some of the crawled URLs are actually being indexed in Solr. I had the Nutch crawl output to a text file so I can see the URLs that it crawled, but when I search for some of the crawled URLs in Solr I get no results.
Command I am using to do the Nutch crawl: bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
This command is completing successfully and the output displays URLs that I cannot find in the resulting Solr index.
Command I am using to push the crawled data to Solr: bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
The output for this command says it is also completing successfully, so it does not seem to be an issue with the process terminating prematurely (which is what I initially thought it might be).
One final thing that I am finding strange is that the entire Nutch & Solr config is identical to a setup I used previously on a different server and I had no problems that time. It is literally the same config files copied onto this new server.
TL;DR: I have a set of URLs successfully crawled in Nutch, but when I run the solrindex command only some of them are pushed to Solr. Please help.
UPDATE: I've re-run all these commands and the output still insists it's all working fine. I've looked into any blockers for indexing that I can think of, but still no luck. The URLs being passed to Solr are all active and publicly accessible, so that's not an issue. I'm really banging my head against a wall here so would love some help.
I can only guess what happend from my experiences:
There is a component called url-normalizer (with its configuration url-normalizer.xml) which is truncating some urls (removing URL parameters, SessionIds, ...)
Additionally, Nutch uses a unique constraint, by default each url is only saved once.
So, if the normalizer truncates 2 or more URLs ('foo.jsp?param=value', 'foo.jsp?param=value2', 'foo.jsp?param=value3', ...) to the exactly same one ('foo.jsp'), they get only saved once. So Solr will only see a subset of all your crawled URLs.
cheers

solrindex way of mapping nutch schema to solr

We have several custom nutch fields that the crawler picks up and indexes. Transferring this to solr via solrindex (using the mapping file) works fine. The log shows everything is fine, however the index in solr environment does not reflect this.
Any help will be much appreciated,
Thanks,
Ashok
What I would do is use a tool like tcpmon to monitor exactly what Nutch is sending to Solr. By examing the xml payload, you could determine if Nutch is correctly sending those custom fields to Solr. If Nutch is sending them correctly, there is something going on on the Solr side. On the opposite, re-check your Nutch code.

Resources