Sharing crawled nutch data between multiple solr indexes

Sharing crawled nutch data between multiple solr indexes - solr

We have thousands of solr indexes/collections that share pages being crawled by nutch.
Currently these pages are being crawled multiple times, once for each solr index that contains them.
It is possible to crawl these sites once, and share the crawl data between indexes?
Maybe by checking existing crawldbs if a site has been crawled and get the data from there for parsing and indexing.
Or crawl all sites in one go, and then selectively submit crawl data to each index. (eg: one site per segment, but not sure how to identify which segment belongs to what site due to segment names are numeric)
Any ideas or help appreciated :)

You will need to write a new indexer plugin to do that; look at the SolrIndexer of Nutch to understand how to write a new indexer. In that indexer, you should do the following:
Define three or four Solr server instances, one for each core.
Inside the write method of the indexer, examine the type of the document and use the right Solr core to add the document. By right, you should have a field at Nutch that you can use to determine where to send the document.

Related

Is Solr stable enough to use as main repository?

I'm testing Solr as my full text search engine provider over 1,000,000 documents.
I have also users information data which is related to the documents as creator and I want to store the users hit.
Is it necessary to have database engine to store all the data? Or Solr is stable and safe to rely on?
Is there any risk to loose the stored data in Solr (I know it can happen to Solr index and I can rebuild it, but how about RAW data?)
The only reason that I want to have 2nd storage is having another backup/version of all of my data (not for querying,...).

Amir,
Solr is stable. If you are not convinced, have a look at list of users here...
http://wiki.apache.org/solr/PublicServers which include NASA, AT&T etc...
Solr main goal is to serve as Search engine, helping us to implement search, NLP algorithms, Big Data issues, etc.
Solr is not meant to be main data store (also it might serve as one....
Reason for the ambiguous sentence above is that unlike relational database, Solr can store both original data and index OR the INDEX ONLY without the data itself.
If you store only the index, by specifying in Solr schema.xml Stored="false" per field, then you get a much smaller Solr data volume and better performance, but when you query Solr you will receive back only the document ID, and you will have to continue with your relational DB....
Of course you can store some of the data, some of document field, and avoid storing some.
Of course, you should backup/ replicate Solr to ensure disaster recovery, etc.

Getting raw text files from a Solr snapshot?

I have a Solr database snapshot. The database is an archive of published blog posts (plus a bunch of metadata for each post). The snapshot is tens of thousands of posts.
I want to run some machine learning algorithms and topic modeling on the posts. So I don't need the database per se, I just want to get the raw text of the posts and the metadata in some simple form. Can anyone tell me how to open or extract that info without actually installing Solr?

I suppose you have the Solr Index when you mean the Solr database snapshot.
Solr index is basically a lucene index and you can use the Lucene apis to just read the index and extract data from the fields.
This would not need Solr to be installed.

What is the role of NUTCH if we are going to make a search engine using Hadoop and Solr?

I want to make a search engine. In which i want to crawl some sites and stored their indexes and info in Hadoop. And then using Solr search will be done.
But I am facing lots of issues. If search over google then different people give different suggestions and different configuring ways for setup a hadoop based search engine.
These are my some questions :
1) How the crawling will be done? Is there any use of NUTCH for completing the crawling or not? If yes then how Hadoop and NUTCH communicate with each other?
2) What is the use of Solr? If NUTCH done Crawling and stored their crawled indexes and their information into the Hadoop then what's the role of Solr?
3) Can we done searching using Solr and Nutch? If yes then where they will saved their crawled indexes?
4) How Solr communicate with Hadoop?
5) Please explain me one by one steps if possible, that how can i crawl some sites and save their info into DB(Hadoop or any other) and then do search .
I am really really stuck with this. Any help will really appreciated.
A very big Thanks in advance. :)
Please help me to sort out my huge issue please

We are using Nutch as a webcrawler and Solr for searching in some productive environments. So I hope I can give you some information about 3).
How does this work? Nutch has it's own crawling db and some websites where it starts crawling. It has some plugins where you can configure different things like pdf crawling, which fields will be extracted of html sites and so on. When crawling Nutch stores all links extracted from a website and will follow them in the next cycle. All crawling results will be stored in a crawl db. In Nutch you configure an intervall where crawled results will be outdated and the crawler begins from the defined startsites.
The results inside the crawl db will be synchronized to the solr index. So you are searching on the solr index. Nutch is in this constallation only to get data from websites and providing them for solr.

Solr schema validation

I don't understand in Solr wiki, whether Solr takes one schema.xml, or can have multiple ones.
I took the schema from Nutch and placed it in Solr, and later tried to run examples from Solr. The message was clear that there was error in schema.
If I have a Solr, am I stuck to a specific schema? If not, where is the information for using multiple ones?

From the Solr Wiki - SchemaXml page:
The schema.xml file contains all of the details about which fields
your documents can contain, and how those fields should be dealt with
when adding documents to the index, or when querying those fields.
Now you can only have one schema.xml file per instance/index within Solr. You can implement multiple instances/indexes within Solr by using the following strategies:
Running Multiple Indexes - please see this Solr Wiki page for more details.
There are various strategies to take when you want to manage multiple "indexes" in a Single Servlet Container
Running Multiple Cores within a Solr instance. - Again, see the Solr Wiki page for more details...
Multiple cores let you have a single Solr instance with separate
configurations and indexes, with their own config and schema for very
different applications, but still have the convenience of unified
administration. Individual indexes are still fairly isolated, but you
can manage them as a single application, create new indexes on the fly
by spinning up new SolrCores, and even make one SolrCore replace
another SolrCore without ever restarting your Servlet Container.

Nutch versus Solr

Currently collecting information where I should use Nutch with Solr (domain - vertical web search).
Could you suggest me?

Nutch is a framework to build web crawler and search engines. Nutch can do the whole process from collecting the web pages to building the inverted index. It can also push those indexes to Solr.
Solr is mainly a search engine with support for faceted searches and many other neat features. But Solr doesn't fetch the data, you have to feed it.
So maybe the first thing you have to ask in order to choose between the two is whether or not you have the data to be indexed already available (in XML, in a CMS or a database.). In that case, you should probably just use Solr and feed it that data. On the other hand, if you have to fetch the data from the web, you are probably better of with Nutch.