Nutch and save crawl data to Amazon S3

Nutch and save crawl data to Amazon S3 - solr

I am trying to evaluate if Nutch/Solr/Hadoop are the right technologies for my task.
PS: Previously I was trying to integrate Nutch (1.4) and Hadoop to see how it works.
Here is what I am trying to achieve overall,
a) Start with a Seed URL(s) and crawl and parse/save data/links
--Which Nutch crawler does anyway.
b) Then be able to query the crawled indexes from a Java client
--- (may be either using SolrJ client)
c) Since Nutch (as of 1.4.x) already uses Hadoop internally. I will just install Hadoop and configure in the nutch-**.xml
d) I would like Nutch to save the crawled indexes to Amazon S3 and also Hadoop to use S3 as file system.
Is this even possible? or even worth it?
e) I read in one of the forums, that in Nutch 2.0, there is a data layer using GORA that can save indexes to HBase etc. I don't when 2.0 release is due. :-(
Does anyone suggest to grab 2.0 "inprogress" trunk and start using it, hoping to get a released lib sooner or later?
PS: I am still trying to figure out how/when/why/where Nutch uses Hadoop internally. I just cannot find any written documentation or tutorials..Any help on this aspect is also much appreciated.
If you are reading this line, then thank you so much for reading this post up to this point :-)

Hadoop can use S3 as its underlying file system natively. I have had very good results with this approach when running Hadoop in EC2, either using EMR or your own / third-party Hadoop AMIs. I would not recommend using S3 as the underlying file system when using Hadoop outside of EC2, as bandwidth limitations would likely negate any performance gains Hadoop would give you. The S3 adapter for Hadoop was developed by Amazon and is part of the Hadoop core. Hadoop treats S3 just like HDFS. See http://wiki.apache.org/hadoop/AmazonS3 for more info on using Hadoop with S3.
Nutch is designed to run as a job on a Hadoop cluster (when in "deploy" mode) and therefore does not include the Hadoop jars in its distribution. Because it runs as a Hadoop job, however, it can access any underlying data store that Hadoop supports, such as HDFS or S3. When run in "local" mode, you will provide your own local Hadoop installation. Once crawling is finished in "deploy" mode, the data will be stored in the distributed file system. It is recommended that you wait for indexing to finish and then download the index to a local machine for searching, rather than searching in the DFS, for performance reasons. For more on using Nutch with Hadoop, see http://wiki.apache.org/nutch/NutchHadoopTutorial.
Regarding HBase, I have had good experiences using it, although not for your particular use case. I can imagine that for random searches, Solr may be faster and more feature-rich than HBase, but this is debatable. HBase is probably worth a try. Until 2.0 comes out, you may want to write your own Nutch-to-HBase connector or simply stick with Solr for now.

Related

Migrating Solr Cloud cluster over new cloud vendor

We need to move our solr cloud cluster from one cloud vendor to another, the cluster is composed of 8 shards with 2 replica factor spread among 8 servers with roughly a total of 500GB worth of data.
I wonder what are the common approaches to migrate the cluster but specially its data with the less impact in availability and performance etc..
I was thinking in some sort of initial dump copy to then synchronize them catching up the diff (which could be huge) after keeping them in sync just switch whenever everything is ready from the other side.
Is that something doable? what tools should/could I use?
Thanks!

You have multiple choices depending on your existing setup and Solr version:
As mentioned earlier, make use of backup and restore APIs from Collections API
If you have Solr 6 and above, I would recommend exploring the option of CDCR, which is Solr's native Cross Data Centre Replication.
Reindexing onto the new cluster and then leverage Solr Collection Aliasing to change your application end points to the target provider upon the completion of reindexing

Storage in Apache Flink

After processing those millions of events/data, where is the best place to storage the information to say that worth to save millions of events? I saw a pull request closed by this commit mentioning Parquet formats, but, the default is the HDFS? My concern is after saving (where?) if it is easy (fast!) to retrieved that data?

Apache Flink is not coupled with specific storage engines or formats. The best place to store the results computed by Flink depends on your use case.
Are you running a batch or streaming job?
What do you want to do with the result?
Do you need batch (full scan), point, or continuously streaming access to the data?
What format does the data have? flat structured (relational), nested, blob, ...
Depending on the answer to these questions, you can choose from various storage backends such as
- Apache HDFS for batch access (with different storage format such as Parquet, ORC, custom binary)
- Apache Kafka if you want to access the data as a stream
- a key-value store such as Apache HBase and Apache Cassandra for point access to data
- a database such as MongoDB, MySQL, ...
Flink provides OutputFormats for most of these systems (some through a wrapper for Hadoop OutputFormats). The "best" system depends on your use case.

How to quickly transfer data between servers using Sunspot / Websolr?

Since I suspect my setup is rather conventional, I'd like to start by providing a little context. Our Solr setup involves three environments:
Production - Solr server hosted on Websolr.
Staging - Also a Solr server hosted on Websolr.
Development - Supported via the sunspot_solr gem which allows us to easily set up our own local Solr server for development.
For the most part, this is working well. We have a lot of records so doing a full reindex takes a few hours (despite eager loading and using background jobs to parallelize the work). But that's not too terrible since we don't need to completely reindex very often.
But there's another scenario which is starting to become very annoying... We very frequently need to populate our local machine (or staging environment) with production data (i.e. basically grab a SQL dump from production and pipe it into our local database). We do this all the time for bugfixes and whatnot.
At this point, because our data has changed, our local Solr index is out of date. So, if we want our search to work correctly, we also need to reindex our local Solr server and that takes a really long time.
So now the question: Rather than doing a full reindex, I would like to simply copy the production index down on to my machine (i.e. conceptually similar to a SQL dump but for a Solr server rather than a database). I've Googled around enough to know that this is possible but have not seen any solutions specific to Websolr / Sunspot. These are such common tools that I figured someone else must have figured this out already.
Thanks in advance for any help!

One of the better kept secrets of Solr (and websolr): You can use the Solr Replication API to copy the data between two indices.
If you're making a copy of the production index "prod54321" into the QA index "qa12345", then you'd initiate the replication with the fetchindex command on the QA index's replication handler. Here's a quick command to approximate that, using cURL.
curl -X POST https://index.websolr.com/solr/qa12345/replication \
-d command=fetchindex \
-d masterUrl=https://index.websolr.com/solr/prod54321/replication
(Note the references to the replication request handler on both URLs.)

Nutch 2.1 (HBase, SOLR) with Amazon Web Services

I experienced Nutch 2.1 locally without any difficulty. I have also tried on a 3 machine distributed cluster. We're now discussing whether to run it with Amazon Web Services or not. I do not have much experience with AWS. My question is that, is it possible and neccessary to try Nutch2.1 crawling and indexing parts on the cloud. What possible advantages and disadvantages we will have?
Thanks.

If you have a cluster with same capacity as that of a AWS cluster (that you plan to invest in) then there is no advantage except for #1 below.
Here are several factors that you should think about before switching to AWS:
Locality of hosts crawled: If you are sitting in Europe and the websites that you want to crawl are hosted far away ... say Australia. If you buy AWS nodes located in Australia, it would be much faster for crawling that data rather than crawling from Europe.
Cost: For using AWS machines, you need to pay then on hourly basis. Can you afford that ? If not better use your own machines
Current cluster capacity : does your current cluster has ample capacity and space to handle the amount of crawled data ? I think there wont be problem in terms of computational speed as Nutch runs on Hadoop which was designed to run on commodity hardware. Can your cluster accommodate entire data that is being fetched by the crawler.
Volume of data : What is a rough estimate of the data that is being crawled ? If its less, then it makes no sense to have an AWS cluster.
Time constraints : Is there any time bound for completion for the crawl ?
If you are doing this for a professional project, then these factors must be given a thought.
If you are doing it for fun/hobby/learning, go ahead and use free tier nodes of AWS. Those are low capacity nodes given free by Amazon. Its fun to learn new things :)
Advantages of AWS:
No need to buy machines for setting up a cluster. get started without having any hardware except a terminal PC.
Locality
No need to look after machines. If a node crashes badly, leave it (its not your problem :P). Buy a new one, add it to the cluster and go ahead.
Disadvantages of AWS:
Costly.
Copying data to any machine outside AWS cluster is charged.
Your data is NOT persisted when u give up the procured AWS nodes. If u want to persist it, pay them and use the S3 storage service.

MapReduce in the cloud

Except for Amazon MapReduce, what other options do I have to process a large amount of data?

Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works

DataStax Brisk is good.
Full-on distributions
Apache Hadoop
Cloudera’s Distribution including Apache Hadoop (that’s the official name)
IBM Distribution of Apache Hadoop
DataStax Brisk
Amazon Elastic MapReduce
HDFS alternatives
Mapr
Appistry CloudIQ Storage Hadoop Edition
IBM Global Parallel File System (GPFS)
CloudStore
Hadoop MapReduce alternatives
Pervasive DataRush
Cascading
Hive (an Apache subproject, included in Cloudera’s distribution)
Pig (a Yahoo-developed language, included in Cloudera’s distribution)
Refer : http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/

If want to process large amount of data in real-time ( twitter feed, click stream from website) etc using cluster of machines then check out "storm" which was opensource'd from twitter recently
Standard Apache Hadoop is good for processing in batch with petabytes of data where latency is not a problem.
Brisk from DataStax as mentioned above is quite unique in that you can use MapReduce Parallel processing on live data.
There are other efforts like Hadoop Online which allows to process using pipeline.
Google BigQuery obviously another option where you have csv (delimited records) and you can slice and dice without any setting up. It's extremely simple to use ,but is a premium service where you have to pay by no. of bytes processed ( first 100GB / month is free though).

If you want to stay in the cloud, you can also spin up EC2 instances to create a permanent Hadoop cluster. Cloudera has plenty of resources about setting up such a cluster here.
However, this option is less cost effective than Amazon Elastic Mapreduce, unless you have lots of jobs to run through the day, keeping your cluster fairly busy.
The other option is to build your own cluster. One of the nice features of Hadoop is that you can cobble heterogenous hardware into a cluster with decent computing power. The kind that can live in a rack in your server room. Considering that older hardware that's laying around is already paid for, the only costs to getting such a cluster going is new drives, and perhaps enough memory sticks to maximize the capacity of those boxes. Then cost effectiveness of such an approach is much better than Amazon. The only caveat would be whether you have the bandwidth necessary for pulling down all the data into the cluster's HDFS on a regular basis.

Google App Engine does MapReduce as well (at least the map part for now). http://code.google.com/p/appengine-mapreduce/

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight