Does solr do web crawling? - solr

I am interested to do web crawling. I was looking at solr.
Does solr do web crawling, or what are the steps to do web crawling?

Solr 5+ DOES in fact now do web crawling!
http://lucene.apache.org/solr/
Older Solr versions do not do web crawling alone, as historically it's a search server that provides full text search capabilities. It builds on top of Lucene.
If you need to crawl web pages using another Solr project then you have a number of options including:
Nutch - http://lucene.apache.org/nutch/
Websphinx - http://www.cs.cmu.edu/~rcm/websphinx/
JSpider - http://j-spider.sourceforge.net/
Heritrix - http://crawler.archive.org/
If you want to make use of the search facilities provided by Lucene or SOLR you'll need to build indexes from the web crawl results.
See this also:
Lucene crawler (it needs to build lucene index)

Solr does not in of itself have a web crawling feature.
Nutch is the "de-facto" crawler (and then some) for Solr.

Solr 5 started supporting simple webcrawling (Java Doc). If want search, Solr is the tool, if you want to crawl, Nutch/Scrapy is better :)
To get it up and running, you can take a detail look at here. However, here is how to get it up and running in one line:
java
-classpath <pathtosolr>/dist/solr-core-5.4.1.jar
-Dauto=yes
-Dc=gettingstarted -> collection: gettingstarted
-Ddata=web -> web crawling and indexing
-Drecursive=3 -> go 3 levels deep
-Ddelay=0 -> for the impatient use 10+ for production
org.apache.solr.util.SimplePostTool -> SimplePostTool
http://datafireball.com/ -> a testing wordpress blog
The crawler here is very "naive" where you can find all the code from this Apache Solr's github repo.
Here is how the response looks like:
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/gettingstarted/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked
Entering recursive mode, depth=3, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource http://datafireball.com (depth: 0)
Entering crawl at level 1 (52 links total, 51 new)
POSTed web resource http://datafireball.com/2015/06 (depth: 1)
...
Entering crawl at level 2 (266 links total, 215 new)
...
POSTed web resource http://datafireball.com/2015/08/18/a-few-functions-about-python-path (depth: 2)
...
Entering crawl at level 3 (846 links total, 656 new)
POSTed web resource http://datafireball.com/2014/09/06/node-js-web-scraping-using-cheerio (depth: 3)
SimplePostTool: WARNING: The URL http://datafireball.com/2014/09/06/r-lattice-trellis-another-framework-for-data-visualization/?share=twitter returned a HTTP result status of 302
423 web pages indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update/extract...
Time spent: 0:05:55.059
In the end, you can see all the data are indexed properly.

You might also want to take a look at
http://www.crawl-anywhere.com/
Very powerful crawler that is compatible with Solr.

I have been using Nutch with Solr on my latest project and it seems to work quite nicely.
If you are using a Windows machine then I would strongly recommend following the 'No cygwin' instructions given by Jason Riffel too!

Yes, I agree with the other posts here, use Apache Nutch
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
Although your solr version has the match the correct version of Nutch, because older versions of solr stores the indices in a different format
Its tutorial:
http://wiki.apache.org/nutch/NutchTutorial

I know it's been a while, but in case someone else is searching for a Solr crawler like me, there is a new open-source crawler called Norconex HTTP Collector

I know this question is quite old, but I'll respond anyway for the newcomer that will wonder here.
In order to use Solr, you can use a web crawler that is capable of storing documents in Solr.
For instance, The Norconex HTTP Collector is a flexible and powerful open-source web crawler that is compatible with Solr.
To use Solr with the Norconex HTTP Collector you will need the Norconex HTTP Collector which is used to crawl the website that you want to collect data from, and you will need to install the Norconex Apache Solr Committer to store collected documents into Solr. When the committer is installed, you will need to configure the XML configuration file of the crawler. I would recommend that you follow this link to get started test how the crawler works and here to know how to configure the configuration file. Finally, you will need this link to configure the committer section of the configuration file with Solr.
Note that if your goal is not to crawl web pages, Norconex also has a Filesystem Collector that can be used with the Sorl Committer as well.

Def Nutch !
Nutch also has a basic web front end which will let you query your search results. You might not even need to bother with SOLR depending on your requirements. If you do a Nutch/SOLR combination you should be able to take advantage of the recent work done to integrate SOLR and Nutch ... http://issues.apache.org/jira/browse/NUTCH-442

Related

How do you configure Apache Nutch 2.3 to honour robots metatag?

I have Nutch 2.3 setup with HBase as the backend and I run a crawl of which includes the index to Solr and Solr Deduplication.
I have recently noticed that the Solr index contains unwanted webpages.
In order to get Nutch to ignore these webpages I set the following metatag:
<meta name="robots" content="noindex,follow">
I have visited the apache nutch official website and it explains the following:
If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag
Searching the web for answers, I found a recommendations to set Protocol.CHECK_ROBOTS or set protocol.plugin.check.robots as a property in nutch-site.xml. None of these appear to work.
At current Nutch 2.3 ignores the noindex rule, therefore indexing the content to the external datastore ie Solr.
The question is how do I configure Nutch 2.3 to honour robots metatags?
Also if Nutch 2.3 was previously configured to ignore robot metatag and during a previous crawl cycle indexed that webpage. Providing the rules for the robots metatag are correct, will this result in the page being removed from the Solr index in future crawls?
I've created a plugin to overcome the problem of Apache Nutch 2.3 NOT honouring the robots metatag rule noindex. The metarobots plugin forces Nutch to discard qualifying documents during index. This prevents the qualifying documents being indexed to your external datastore ie Solr.
Please note: This plugin prevents the index of documents that contain robots metatag rule noindex, it does NOT remove any documents that were previously indexed to your external datastore.
Visit this link for instructions

crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

I mainly followed the guide on this page. I installed Nutch 2.3, Cassandra 2.0, and solr 4.10.3. Set up went well. But when I executed the following command. No urls were fetched.
./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
Below are my settings.
nutch-site.xml
http://ideone.com/H8MPcl
regex-urlfilter.txt
+^http://([a-z0-9]*\.)*nutch.apache.org/
hadoop.log
http://ideone.com/LnpAw4
I don't see any errors in the log file. I am really lost. Any help would be appreciated. Thanks!
You will have to add the regex for your website that you want to crawl in regex-urlfilter.txt to pick the link that you have added in nutch-site.xml.
Right now it will only crawl "nutch.apache.org"
Try adding below line:
+^http://([a-z0-9]*\.)*ideone.com/
Try to set nutch logs in debug level and get the logs while executing the crawl command.
It will clearly shows why you are unable to crawl and index the site.
Regards,
Jayesh Bhoyar
http://technical-fundas.blogspot.com/p/technical-profile.html
I got a similar problem recently. I think you can try the following steps to find out the problem.
1 Do some tests to make sure the DB works well.
2 Instead of running the crawl in batch, you can call nutch step by step and watch the log change as well as the change of DB content, in particular, the new urls.
3 Turn off solr and focus on nutch and the DB.

Nutch 2.X - Prefered urls to fetch

I have this situation: There are over 160 URLs in my seed. I started my crawling one week ago. Now I have a lot of pages crawled in my storage but I can see in my Solr index that some URLs from seed are not crawled at all (the URLs do not have some restrictions from a robots.txt) or only in very small number. Is it possible tell Nutch to prefer some URLs?
have you checked TopN value?
Or is Nutch still crawling? because indexing and sending data to solr is done at the end of process!

How do I tell Nutch to crawl *through* a url without storing it?

Let's say I have a Confluence instance, and I want to crawl it and store the results in Solr as part of an intranet search engine.
Now let's say I only want to store a subset of the pages (matching a regex) on the Confluence instance as part of the search engine.
But, I do want Nutch to crawl all the other pages, looking for links to pages that match—I just don't want Nutch to store them (or at least I don't want Solr to return them in the results).
What's the normal or least painful way to set Nutch->Solr up to work like this?
Looks like the only way to do this is write your own IndexFilter plugin (or find someone's to copy from).
[Will add my sample plugin code here when it's working properly]
References:
http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
http://florianhartl.com/nutch-plugin-tutorial.html
How to filter URLs in Nutch 2.1 solrindex command

Apache Nutch does not index the entire website, only subfolders

Apache Nutch 1.2 does not index the entire website, only subfolders. My index-page provides links in most areas/subfolders of my website. For example stuff, students, research... But nutch only crawl in one specific folder - "students" in this case. Seems as if links in other directories are not followed.
crawl-urlfilter.txt:
+^http://www5.my-domain.de/
seed.txt in the URLs-folder:
http://www5.my-domain.de/
Starting nutch with(windows/linux both used):
nutch crawl "D:\Programme\nutch-1.2\URLs" -dir "D:\Programme\nutch-1.2\crawl" -depth 10 -topN 1000000
Different variants for depth(5-23) and topN(100-1000000) are tested. Providing more links in seed.txt doesnt help at all, still not following links found in injected pages.
Interestingly, crawling gnu.org works perfect. No robots.txt or preventing meta-tags used in my site.
Any ideas?
While attempting to crawl all links from an index page, I discovered that nutch was limited to exactly 100 links of around 1000. The setting that was holding me back was:
db.max.outlinks.per.page
Setting this to 2000 allowed nutch to index all of them in one shot.
Check out if you´ve got intra domain links limitation (property as false in nutch-site.xml). Also check out other properties as maximun intra-extra links per page and http size. Sometimes they produce wrong results during crawling.
Ciao!

Resources