Nutch versus Solr

Nutch versus Solr - solr

Currently collecting information where I should use Nutch with Solr (domain - vertical web search).
Could you suggest me?

Nutch is a framework to build web crawler and search engines. Nutch can do the whole process from collecting the web pages to building the inverted index. It can also push those indexes to Solr.
Solr is mainly a search engine with support for faceted searches and many other neat features. But Solr doesn't fetch the data, you have to feed it.
So maybe the first thing you have to ask in order to choose between the two is whether or not you have the data to be indexed already available (in XML, in a CMS or a database.). In that case, you should probably just use Solr and feed it that data. On the other hand, if you have to fetch the data from the web, you are probably better of with Nutch.

Related

Sitecore 8.2 Update 1, Using SOLR for both CMS search and Web search

Can SOLR be used for Sitecore search and website search with latest release 8.2.1. the solution needs to be deployed on Cloud
Is it required to have Coveo and SOLR both in order to implement search for Sitecore 8.2.1?
What is the recommended best practice to use for SOLR search implementation for both CMS and Web search
Are there any code samples available?
I have already gone though the following links:
https://doc.sitecore.net/sitecore_experience_platform/setting_up__maintaining/search_and_indexing/indexing/configure_a_search_and_indexing_provider
https://doc.sitecore.net/sitecore%20experience%20platform/setting%20up%20%20maintaining/search%20and%20indexing/walkthrough%20setting%20up%20solr

Yes, SOLR can be used for both Sitecore and website search. It is recommended to use it over Lucene on distributed environments with multiple CM and CD servers.
It is not required to have both SOLR and Coveo. SOLR is enough. However, Coveo is providing features that SOLR doesn't have like UI components that are easy to customize for content editors and marketers, specialized search usage analytics, machine learning, great relevance of search results, multiple languages support... Coveo can only be used for website search. So if you use Coveo, you still need Lucene or SOLR for Sitecore search.
With SOLR, it is recommended to create separate Sitecore indexes for your website search needs and leave the default Sitecore indexes there for the Sitecore search. On the SOLR side, those new indexes should be stored in separate SOLR cores.
I don't know.

Sharing crawled nutch data between multiple solr indexes

We have thousands of solr indexes/collections that share pages being crawled by nutch.
Currently these pages are being crawled multiple times, once for each solr index that contains them.
It is possible to crawl these sites once, and share the crawl data between indexes?
Maybe by checking existing crawldbs if a site has been crawled and get the data from there for parsing and indexing.
Or crawl all sites in one go, and then selectively submit crawl data to each index. (eg: one site per segment, but not sure how to identify which segment belongs to what site due to segment names are numeric)
Any ideas or help appreciated :)

You will need to write a new indexer plugin to do that; look at the SolrIndexer of Nutch to understand how to write a new indexer. In that indexer, you should do the following:
Define three or four Solr server instances, one for each core.
Inside the write method of the indexer, examine the type of the document and use the right Solr core to add the document. By right, you should have a field at Nutch that you can use to determine where to send the document.

What is the role of NUTCH if we are going to make a search engine using Hadoop and Solr?

I want to make a search engine. In which i want to crawl some sites and stored their indexes and info in Hadoop. And then using Solr search will be done.
But I am facing lots of issues. If search over google then different people give different suggestions and different configuring ways for setup a hadoop based search engine.
These are my some questions :
1) How the crawling will be done? Is there any use of NUTCH for completing the crawling or not? If yes then how Hadoop and NUTCH communicate with each other?
2) What is the use of Solr? If NUTCH done Crawling and stored their crawled indexes and their information into the Hadoop then what's the role of Solr?
3) Can we done searching using Solr and Nutch? If yes then where they will saved their crawled indexes?
4) How Solr communicate with Hadoop?
5) Please explain me one by one steps if possible, that how can i crawl some sites and save their info into DB(Hadoop or any other) and then do search .
I am really really stuck with this. Any help will really appreciated.
A very big Thanks in advance. :)
Please help me to sort out my huge issue please

We are using Nutch as a webcrawler and Solr for searching in some productive environments. So I hope I can give you some information about 3).
How does this work? Nutch has it's own crawling db and some websites where it starts crawling. It has some plugins where you can configure different things like pdf crawling, which fields will be extracted of html sites and so on. When crawling Nutch stores all links extracted from a website and will follow them in the next cycle. All crawling results will be stored in a crawl db. In Nutch you configure an intervall where crawled results will be outdated and the crawler begins from the defined startsites.
The results inside the crawl db will be synchronized to the solr index. So you are searching on the solr index. Nutch is in this constallation only to get data from websites and providing them for solr.

Is Solr necessary to index crawled data for Nutch?

I found that Nutch 1.4 only contains one Indexer/solrindex. Is Solr the only way for Nutch to index the crawled data? If not, what are the other ways?
I'm also wondering why Nutch 1.4 uses Solr to index the data. Why not do it itself? Doesn't it increase the coupling of these two projects?

Solr uses lucene internally. Since 2005, nutch was designated as a subproject of Lucene. Historically, nutch used lucene indexes and was a full fledged search engine (this was until ver 1.0) . It had crawling capability and even support to index data and UI via browser to query the indexed data (similar to that like a google search).
As the initial design was based around lucene (it was another apache project which earned lot of kudos at that period and still rocks), the nutch code was NOT changed or made generic so that other indexing frameworks could have been used. If you want to, then you need lots of efforts to put your indexing framework with it.
In recent versions, (nutch ver 1.3 and further), the Nutch dev team realized that its difficult to track the work involved in indexing due to changing needs and expertise required. It was better to delegate the responsibility of indexing to Solr (its a lucene based indexing framework). The Nutch developers focus only on the crawling part. So now nutch is not a full fledged search engine but its a full fledged web crawler.
Hope this answers your query. You can browse nutch news for more info.
Latest happenings:
Recently there are efforts going on to create a generic library for crawlers (under commons). This project is commons-crawler which will have all functions required for a web crawler and can be used for creating crawlers. Further nutch versions will be using this library as a dependency.

Hadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera?

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?
I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation.
Here is my proposed architecture:
Use a MapReduce framework (Cloudera, Hadoop, Nutch, even DryadLinq) to prepare those documents for indexing
Index those documents into a Lucene.NET / Lucene (java) compatible file format
Deploy that file to all my Solr instances
Activate that replicated index
If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.
Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc...), index them, copy the index to my Solr instances, and somehow "activate" them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr.
The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet.
Based on your experience, how does
this architecture look? Do you see
any issues/problems? What advice can
you give?
What should I not do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it's saying.

Generally, you what you've described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that's based on Hadoop core.
You shouldn't mix Cloudera, Hadoop, Nutch and Lucene. You'll most likely end up using all of them:
Nutch is the name of indexing / answering (like Solr) machinery.
Nutch itself runs using a Hadoop cluster (which heavily uses it's own distributed file system, HDFS)
Nutch uses Lucene format of indexes
Nutch includes a query answering frontend, which you can use, or you can attach a Solr frontend and use Lucene indexes from there.
Finally, Cloudera Hadoop Distribution (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you'd most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).
Generally, if you're just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.
I personally don't see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I'd managed such a project, I'd considered alternatives, especially if your task of crawling & indexing is limited (i.e. you don't want to crawl the whole internet for some purpose).

Have you looked at Lucandra https://github.com/tjake/Lucandra a Cassandra based back end for Lucense/Solr which you can use Hadoop to populate the Cassandra store with the index of your data.