Is Solr necessary to index crawled data for Nutch? - solr

I found that Nutch 1.4 only contains one Indexer/solrindex. Is Solr the only way for Nutch to index the crawled data? If not, what are the other ways?
I'm also wondering why Nutch 1.4 uses Solr to index the data. Why not do it itself? Doesn't it increase the coupling of these two projects?

Solr uses lucene internally. Since 2005, nutch was designated as a subproject of Lucene. Historically, nutch used lucene indexes and was a full fledged search engine (this was until ver 1.0) . It had crawling capability and even support to index data and UI via browser to query the indexed data (similar to that like a google search).
As the initial design was based around lucene (it was another apache project which earned lot of kudos at that period and still rocks), the nutch code was NOT changed or made generic so that other indexing frameworks could have been used. If you want to, then you need lots of efforts to put your indexing framework with it.
In recent versions, (nutch ver 1.3 and further), the Nutch dev team realized that its difficult to track the work involved in indexing due to changing needs and expertise required. It was better to delegate the responsibility of indexing to Solr (its a lucene based indexing framework). The Nutch developers focus only on the crawling part. So now nutch is not a full fledged search engine but its a full fledged web crawler.
Hope this answers your query. You can browse nutch news for more info.
Latest happenings:
Recently there are efforts going on to create a generic library for crawlers (under commons). This project is commons-crawler which will have all functions required for a web crawler and can be used for creating crawlers. Further nutch versions will be using this library as a dependency.

Related

Sitecore 8.2 Update 1, Using SOLR for both CMS search and Web search

Can SOLR be used for Sitecore search and website search with latest release 8.2.1. the solution needs to be deployed on Cloud
Is it required to have Coveo and SOLR both in order to implement search for Sitecore 8.2.1?
What is the recommended best practice to use for SOLR search implementation for both CMS and Web search
Are there any code samples available?
I have already gone though the following links:
https://doc.sitecore.net/sitecore_experience_platform/setting_up__maintaining/search_and_indexing/indexing/configure_a_search_and_indexing_provider
https://doc.sitecore.net/sitecore%20experience%20platform/setting%20up%20%20maintaining/search%20and%20indexing/walkthrough%20setting%20up%20solr
Yes, SOLR can be used for both Sitecore and website search. It is recommended to use it over Lucene on distributed environments with multiple CM and CD servers.
It is not required to have both SOLR and Coveo. SOLR is enough. However, Coveo is providing features that SOLR doesn't have like UI components that are easy to customize for content editors and marketers, specialized search usage analytics, machine learning, great relevance of search results, multiple languages support... Coveo can only be used for website search. So if you use Coveo, you still need Lucene or SOLR for Sitecore search.
With SOLR, it is recommended to create separate Sitecore indexes for your website search needs and leave the default Sitecore indexes there for the Sitecore search. On the SOLR side, those new indexes should be stored in separate SOLR cores.
I don't know.

can GSA use data indexed by Apache Solr for search as a combined solution

It is observed that google does not provide good indexing through its enterprise
search solution Google Search Appliance . But Apache solr has a good indexing capability. Can we use apache solr to index documents and then those documents be
searched through GSA server . So that we can get best of the both world. Kindly give your thoughts ??
Can you please provide more details on why you think the GSA "does not provide good indexing"?
The GSA is generally recognised as being the best or at least one of the best when it comes to result relevancy. When it comes to non-web content, Google supply multiple connectors to allow you to index this content in the GSA and if you have a content source that is neither web based or covered by one of the Google connectors it is not difficult to write your own.
So I'm not sure why you think the indexing is not good, it would be really helpful if you could elaborate.
Mohan is incorrect when he says that you cannot serve Solr content via a GSA, you certainly can do this. What you will need to do is create a onebox module so that you can federate Solr results in realtime and they will be presented to the right of the main GSA results.
What is your data source?
If it is a website crawl,to my little knowledge GSA provides sophisticated crawling/indexing capability for websites than Solr.
Because Solr needs external toolkit such as Tika or Nutch for crawling web resources. On the other hand GSA has its own crawler which makes crawling simple and effective.
Regarding your question on indexing through Solr and serving through GSA,
it is possible through onebox module.(Refer BigMikeW's answer)
If you can provide some information about your data sources, it might help people to suggest the best solution to increase indexing capability in GSA.

Migrate data from Solr 3

I'm thinking about migrating from Solr 3 to Solrcloud or Elasticsearch and was wondering if is it possible to import data indexed with Solr 3.x to Solrcloud (solr 4) and/or Elasticsearch?
They're all lucene based, but since they have different behaviors I'm not really sure that it will work.
Has anyone ever done this? How it going? Related issues?
Regarding importing data from solr to elasticsearch you can take a look at the elasticsearch mock solr plugin. It adds a new solr-alike endpoint to elasticsearch, so that you can use the indexer that you've written for solr (if you have one) to index documents in elasticsearch.
Also, I've been working on an elasticsearch solr river which would allow to import data from solr to elasticsearch through the solrj library. The only limitation is that it can import only the fields that you configured as stored in solr. I should be able to make it public pretty soon, just a matter of days. I'll update my answer as soon as it's available.
Regarding the upgrade of Solr from 3.x to 4.0, not a big deal. The index format has changed, but Solr will take care of upgrading the index. That happens automatically once you start Solr with your old index. But after that the index cannot be read anymore by a previous Solr/lucene version. If you have a master/slave setup you should upgrade the slaves first, otherwise the index on the master would be replicated to the slaves which cannot read it yet.
UPDATE
Regarding the river that I mentioned: I made it public, you can download it from my github profile: https://github.com/javanna/elasticsearch-river-solr.

Hadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera?

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?
I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation.
Here is my proposed architecture:
Use a MapReduce framework (Cloudera, Hadoop, Nutch, even DryadLinq) to prepare those documents for indexing
Index those documents into a Lucene.NET / Lucene (java) compatible file format
Deploy that file to all my Solr instances
Activate that replicated index
If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.
Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc...), index them, copy the index to my Solr instances, and somehow "activate" them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr.
The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet.
Based on your experience, how does
this architecture look? Do you see
any issues/problems? What advice can
you give?
What should I not do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it's saying.
Generally, you what you've described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that's based on Hadoop core.
You shouldn't mix Cloudera, Hadoop, Nutch and Lucene. You'll most likely end up using all of them:
Nutch is the name of indexing / answering (like Solr) machinery.
Nutch itself runs using a Hadoop cluster (which heavily uses it's own distributed file system, HDFS)
Nutch uses Lucene format of indexes
Nutch includes a query answering frontend, which you can use, or you can attach a Solr frontend and use Lucene indexes from there.
Finally, Cloudera Hadoop Distribution (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you'd most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).
Generally, if you're just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.
I personally don't see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I'd managed such a project, I'd considered alternatives, especially if your task of crawling & indexing is limited (i.e. you don't want to crawl the whole internet for some purpose).
Have you looked at Lucandra https://github.com/tjake/Lucandra a Cassandra based back end for Lucense/Solr which you can use Hadoop to populate the Cassandra store with the index of your data.

Nutch versus Solr

Currently collecting information where I should use Nutch with Solr (domain - vertical web search).
Could you suggest me?
Nutch is a framework to build web crawler and search engines. Nutch can do the whole process from collecting the web pages to building the inverted index. It can also push those indexes to Solr.
Solr is mainly a search engine with support for faceted searches and many other neat features. But Solr doesn't fetch the data, you have to feed it.
So maybe the first thing you have to ask in order to choose between the two is whether or not you have the data to be indexed already available (in XML, in a CMS or a database.). In that case, you should probably just use Solr and feed it that data. On the other hand, if you have to fetch the data from the web, you are probably better of with Nutch.

Resources