solrindex way of mapping nutch schema to solr - solr

We have several custom nutch fields that the crawler picks up and indexes. Transferring this to solr via solrindex (using the mapping file) works fine. The log shows everything is fine, however the index in solr environment does not reflect this.
Any help will be much appreciated,
Thanks,
Ashok

What I would do is use a tool like tcpmon to monitor exactly what Nutch is sending to Solr. By examing the xml payload, you could determine if Nutch is correctly sending those custom fields to Solr. If Nutch is sending them correctly, there is something going on on the Solr side. On the opposite, re-check your Nutch code.

Related

crawling with Nutch 2.3, Cassandra 2.0, and solr 4.10.3 returns 0 results

I mainly followed the guide on this page. I installed Nutch 2.3, Cassandra 2.0, and solr 4.10.3. Set up went well. But when I executed the following command. No urls were fetched.
./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
Below are my settings.
nutch-site.xml
http://ideone.com/H8MPcl
regex-urlfilter.txt
+^http://([a-z0-9]*\.)*nutch.apache.org/
hadoop.log
http://ideone.com/LnpAw4
I don't see any errors in the log file. I am really lost. Any help would be appreciated. Thanks!
You will have to add the regex for your website that you want to crawl in regex-urlfilter.txt to pick the link that you have added in nutch-site.xml.
Right now it will only crawl "nutch.apache.org"
Try adding below line:
+^http://([a-z0-9]*\.)*ideone.com/
Try to set nutch logs in debug level and get the logs while executing the crawl command.
It will clearly shows why you are unable to crawl and index the site.
Regards,
Jayesh Bhoyar
http://technical-fundas.blogspot.com/p/technical-profile.html
I got a similar problem recently. I think you can try the following steps to find out the problem.
1 Do some tests to make sure the DB works well.
2 Instead of running the crawl in batch, you can call nutch step by step and watch the log change as well as the change of DB content, in particular, the new urls.
3 Turn off solr and focus on nutch and the DB.

Apache Nutch - indexing only the modified files in Solr

Iam able to set up the Apache Nutch and get the data indexed in Solr. While indexing I am trying to make sure only modified pages gets indexed. Below are the 2 questions we have regarding this.
Is it possible to tell Nutch to send ‘If-modified-since’ header while
crawling the site and download the page only if it has changed since
the last time it was crawled.
I could see that Nutch is forming the MD5 digest out of the
retrieved page content, but even though digest hasn’t changed
(compared to previous version), it is still the indexing the page
in Solr. Is there any setting with in Nutch to make sure if the
content hasn’t changed have it not index in Solr?
Answering my own question here, Hope it helps someone
Once I set the adaptivefetchschedule, could see that Nutch was not pulling the pages that hasnt changed.Its honoring if-modified-since header.

Solr and Nutch - How to take control over Facets?

Sorry if this question might be too general. I'd be happy with good links to documentation, if there are any. Google won't help me find them.
I need to understand how facets can be extracted from a web site crawled by Nutch then indexed by Solr. On the web site, pages have meta tags, like <meta name="price" content="123.45"/> or <meta name="categories" content="category1, category2"/>. Can I tell Nutch to extract those and Solr to treat them as facets?
In the example above, I want to specify manually that the meta name "categories" is to be treated as a facet, but the content should be dynamically used as categories.
Does it make sense? Is it possible to do with Nutch and Solr, or should I rethink my way of using it?
I haven't used Nutch (I use Heritrix), but at the end of the day, Nutch need to extract the "meta" tag values and index them in Solr (using SolrJ for ex), with different solr fields "price", "categories", etc
Then you do
http://localhost:8080/solr/myrep/select?q=mobile&facet=true&facet.limit=10&facet.field=categories
to get facets per categories. Here is a page on facets:
http://wiki.apache.org/solr/SolrFacetingOverview
One of the options is to use nutch with metadata plugin
Although it is given as an example, it is very much included with the distribution.
Assuming you know the other processes of configuring, and crawling data using nutch
Before indexing, you need to configure nutch to use metadata plugin like this.
Edit conf/nutch-site.xml
<property>
<name>plugin.includes</name>
<value>urlmeta|(rest of the plugins)</value>
</property>
The metadata tags that need to be indexed, like price can be supplied as another property
<property>
<name>urlmeta.tags</name>
<value>price</value>
</property>$
Now, you can run the nutch crawl command. After crawling and indexing with solr, you should see a field price in the index. The facet search can be used by adding facet.field in your query.
Here are some links of interest.
Using Solr to index nutch data link :Link
Help on Solr faceting queries link :Link

Nutch solrindex command not indexing all URLs in Solr

I have a Nutch index crawled from a specific domain and I am using the solrindex command to push the crawled data to my Solr index. The problem is that it seems that only some of the crawled URLs are actually being indexed in Solr. I had the Nutch crawl output to a text file so I can see the URLs that it crawled, but when I search for some of the crawled URLs in Solr I get no results.
Command I am using to do the Nutch crawl: bin/nutch crawl urls -dir crawl -depth 20 -topN 2000000
This command is completing successfully and the output displays URLs that I cannot find in the resulting Solr index.
Command I am using to push the crawled data to Solr: bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
The output for this command says it is also completing successfully, so it does not seem to be an issue with the process terminating prematurely (which is what I initially thought it might be).
One final thing that I am finding strange is that the entire Nutch & Solr config is identical to a setup I used previously on a different server and I had no problems that time. It is literally the same config files copied onto this new server.
TL;DR: I have a set of URLs successfully crawled in Nutch, but when I run the solrindex command only some of them are pushed to Solr. Please help.
UPDATE: I've re-run all these commands and the output still insists it's all working fine. I've looked into any blockers for indexing that I can think of, but still no luck. The URLs being passed to Solr are all active and publicly accessible, so that's not an issue. I'm really banging my head against a wall here so would love some help.
I can only guess what happend from my experiences:
There is a component called url-normalizer (with its configuration url-normalizer.xml) which is truncating some urls (removing URL parameters, SessionIds, ...)
Additionally, Nutch uses a unique constraint, by default each url is only saved once.
So, if the normalizer truncates 2 or more URLs ('foo.jsp?param=value', 'foo.jsp?param=value2', 'foo.jsp?param=value3', ...) to the exactly same one ('foo.jsp'), they get only saved once. So Solr will only see a subset of all your crawled URLs.
cheers

Does solr do web crawling?

I am interested to do web crawling. I was looking at solr.
Does solr do web crawling, or what are the steps to do web crawling?
Solr 5+ DOES in fact now do web crawling!
http://lucene.apache.org/solr/
Older Solr versions do not do web crawling alone, as historically it's a search server that provides full text search capabilities. It builds on top of Lucene.
If you need to crawl web pages using another Solr project then you have a number of options including:
Nutch - http://lucene.apache.org/nutch/
Websphinx - http://www.cs.cmu.edu/~rcm/websphinx/
JSpider - http://j-spider.sourceforge.net/
Heritrix - http://crawler.archive.org/
If you want to make use of the search facilities provided by Lucene or SOLR you'll need to build indexes from the web crawl results.
See this also:
Lucene crawler (it needs to build lucene index)
Solr does not in of itself have a web crawling feature.
Nutch is the "de-facto" crawler (and then some) for Solr.
Solr 5 started supporting simple webcrawling (Java Doc). If want search, Solr is the tool, if you want to crawl, Nutch/Scrapy is better :)
To get it up and running, you can take a detail look at here. However, here is how to get it up and running in one line:
java
-classpath <pathtosolr>/dist/solr-core-5.4.1.jar
-Dauto=yes
-Dc=gettingstarted -> collection: gettingstarted
-Ddata=web -> web crawling and indexing
-Drecursive=3 -> go 3 levels deep
-Ddelay=0 -> for the impatient use 10+ for production
org.apache.solr.util.SimplePostTool -> SimplePostTool
http://datafireball.com/ -> a testing wordpress blog
The crawler here is very "naive" where you can find all the code from this Apache Solr's github repo.
Here is how the response looks like:
SimplePostTool version 5.0.0
Posting web pages to Solr url http://localhost:8983/solr/gettingstarted/update/extract
Entering auto mode. Indexing pages with content-types corresponding to file endings xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
SimplePostTool: WARNING: Never crawl an external web site faster than every 10 seconds, your IP will probably be blocked
Entering recursive mode, depth=3, delay=0s
Entering crawl at level 0 (1 links total, 1 new)
POSTed web resource http://datafireball.com (depth: 0)
Entering crawl at level 1 (52 links total, 51 new)
POSTed web resource http://datafireball.com/2015/06 (depth: 1)
...
Entering crawl at level 2 (266 links total, 215 new)
...
POSTed web resource http://datafireball.com/2015/08/18/a-few-functions-about-python-path (depth: 2)
...
Entering crawl at level 3 (846 links total, 656 new)
POSTed web resource http://datafireball.com/2014/09/06/node-js-web-scraping-using-cheerio (depth: 3)
SimplePostTool: WARNING: The URL http://datafireball.com/2014/09/06/r-lattice-trellis-another-framework-for-data-visualization/?share=twitter returned a HTTP result status of 302
423 web pages indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update/extract...
Time spent: 0:05:55.059
In the end, you can see all the data are indexed properly.
You might also want to take a look at
http://www.crawl-anywhere.com/
Very powerful crawler that is compatible with Solr.
I have been using Nutch with Solr on my latest project and it seems to work quite nicely.
If you are using a Windows machine then I would strongly recommend following the 'No cygwin' instructions given by Jason Riffel too!
Yes, I agree with the other posts here, use Apache Nutch
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
Although your solr version has the match the correct version of Nutch, because older versions of solr stores the indices in a different format
Its tutorial:
http://wiki.apache.org/nutch/NutchTutorial
I know it's been a while, but in case someone else is searching for a Solr crawler like me, there is a new open-source crawler called Norconex HTTP Collector
I know this question is quite old, but I'll respond anyway for the newcomer that will wonder here.
In order to use Solr, you can use a web crawler that is capable of storing documents in Solr.
For instance, The Norconex HTTP Collector is a flexible and powerful open-source web crawler that is compatible with Solr.
To use Solr with the Norconex HTTP Collector you will need the Norconex HTTP Collector which is used to crawl the website that you want to collect data from, and you will need to install the Norconex Apache Solr Committer to store collected documents into Solr. When the committer is installed, you will need to configure the XML configuration file of the crawler. I would recommend that you follow this link to get started test how the crawler works and here to know how to configure the configuration file. Finally, you will need this link to configure the committer section of the configuration file with Solr.
Note that if your goal is not to crawl web pages, Norconex also has a Filesystem Collector that can be used with the Sorl Committer as well.
Def Nutch !
Nutch also has a basic web front end which will let you query your search results. You might not even need to bother with SOLR depending on your requirements. If you do a Nutch/SOLR combination you should be able to take advantage of the recent work done to integrate SOLR and Nutch ... http://issues.apache.org/jira/browse/NUTCH-442

Resources